• BenchmarkQED: Automated benchmarking of RAG systems

    One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics. 
    To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub. It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.  
    BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks. 
    In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes.
    In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset. 
    Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.  
    AutoQ: Automated query synthesis
    This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset.
    AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the queryforming a logical progression along the spectrum.
    Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum. 
    AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset.
    Figure 2. Synthesis process and example query for each of the four AutoQ query classes. 

    About Microsoft Research
    Advancing science and technology to benefit humanity

    View our story

    Opens in a new tab
    AutoE: Automated evaluation framework 
    Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:

    Comprehensiveness: Does the answer address all relevant aspects of the question? 
    Diversity: Does it present varied perspectives or insights? 
    Empowerment: Does it help the reader understand and make informed judgments? 
    Relevance: Does it address what the question is specifically asking?  

    The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics.
    An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class . AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge.
    These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method.
    Figure 3. Win rates of four LazyGraphRAG configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%.
    The four LazyGraphRAG conditionsdiffer by query budgetand chunk size. All used GPT-4o mini for relevance tests and GPT-4o for query expansionand answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout.
    Comparison systems were GraphRAG , Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG, RAPTOR, and TREX. All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy.
    LazyGraphRAG outperformed every comparison condition using the same generative model, winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration. For DataLocal queries, the smaller budgetperformed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk sizehad a slight edge, likely because longer chunks provide a more coherent context.
    Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall.
    Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset.
    Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall.
    Figure 4. Win rates of LazyGraphRAG  over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition. 
    AutoD: Automated data sampling and summarization
    Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities.
    The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clustersand the number of samples per cluster. This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations.
    AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited.
    Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository, alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.  
    We hope these datasets, together with the BenchmarkQED tools, help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub. 
    Opens in a new tab
    #benchmarkqedautomatedbenchmarking #ofrag #systems
    BenchmarkQED: Automated benchmarking of RAG systems
    One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics.  To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub. It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.   BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks.  In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes. In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset.  Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.   AutoQ: Automated query synthesis This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset. AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the queryforming a logical progression along the spectrum. Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum.  AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset. Figure 2. Synthesis process and example query for each of the four AutoQ query classes.  About Microsoft Research Advancing science and technology to benefit humanity View our story Opens in a new tab AutoE: Automated evaluation framework  Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation: Comprehensiveness: Does the answer address all relevant aspects of the question?  Diversity: Does it present varied perspectives or insights?  Empowerment: Does it help the reader understand and make informed judgments?  Relevance: Does it address what the question is specifically asking?   The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics. An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class . AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge. These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method. Figure 3. Win rates of four LazyGraphRAG configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%. The four LazyGraphRAG conditionsdiffer by query budgetand chunk size. All used GPT-4o mini for relevance tests and GPT-4o for query expansionand answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout. Comparison systems were GraphRAG , Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG, RAPTOR, and TREX. All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy. LazyGraphRAG outperformed every comparison condition using the same generative model, winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration. For DataLocal queries, the smaller budgetperformed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk sizehad a slight edge, likely because longer chunks provide a more coherent context. Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall. Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset. Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall. Figure 4. Win rates of LazyGraphRAG  over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition.  AutoD: Automated data sampling and summarization Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities. The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clustersand the number of samples per cluster. This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations. AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited. Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository, alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.   We hope these datasets, together with the BenchmarkQED tools, help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub.  Opens in a new tab #benchmarkqedautomatedbenchmarking #ofrag #systems
    BenchmarkQED: Automated benchmarking of RAG systems
    www.microsoft.com
    One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation (RAG) as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics.  To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub (opens in new tab). It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.   BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model (LLM) to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks.  In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes. In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset.  Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.   AutoQ: Automated query synthesis This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset. AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom). Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum.  AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset. Figure 2. Synthesis process and example query for each of the four AutoQ query classes.  About Microsoft Research Advancing science and technology to benefit humanity View our story Opens in a new tab AutoE: Automated evaluation framework  Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation: Comprehensiveness: Does the answer address all relevant aspects of the question?  Diversity: Does it present varied perspectives or insights?  Empowerment: Does it help the reader understand and make informed judgments?  Relevance: Does it address what the question is specifically asking?   The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics. An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class (200 total). AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge. These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method. Figure 3. Win rates of four LazyGraphRAG (LGR) configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%. The four LazyGraphRAG conditions (LGR_b200_c200, LGR_b50_c200, LGR_b50_c600, LGR_b200_c200_mini) differ by query budget (b50, b200) and chunk size (c200, c600). All used GPT-4o mini for relevance tests and GPT-4o for query expansion (to five subqueries) and answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout. Comparison systems were GraphRAG (Local, Global, and Drift Search), Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG (opens in new tab), RAPTOR (opens in new tab), and TREX (opens in new tab). All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy. LazyGraphRAG outperformed every comparison condition using the same generative model (GPT-4o), winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration (LGR_b200_c200). For DataLocal queries, the smaller budget (LGR_b50_c200) performed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk size (LGR_b50_c600) had a slight edge, likely because longer chunks provide a more coherent context. Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall. Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset. Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall. Figure 4. Win rates of LazyGraphRAG (LGR) over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition.  AutoD: Automated data sampling and summarization Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities. The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clusters (breadth) and the number of samples per cluster (depth). This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations. AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited. Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech (opens in new tab) podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository (opens in new tab), alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.   We hope these datasets, together with the BenchmarkQED tools (opens in new tab), help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub (opens in new tab).  Opens in a new tab
    Like
    Love
    Wow
    Sad
    Angry
    487
    · 0 Commentarii ·0 Distribuiri ·0 previzualizare
  • Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards

    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed.
    Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding
    Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs.
    These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.

    Technical Architecture
    Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function.

    The models are trained using a robust multi-stage training pipeline:

    Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks.
    Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications.
    Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization.

    This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings.
    Performance Benchmarks and Insights
    The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks.

    On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series.
    On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B.
    On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA.

    For reranking:

    Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers.
    Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance.

    Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions.
    Conclusion
    Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation.

    Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
    #alibaba #qwen #team #releases #qwen3embedding
    Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation. However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages, making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to thetoken. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity, fine-tuning performance in downstream applications. Model merging: Spherical linear interpolationof multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB, Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB: Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops, emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding #alibaba #qwen #team #releases #qwen3embedding
    Alibaba Qwen Team Releases Qwen3-Embedding and Qwen3-Reranker Series – Redefining Multilingual Embedding and Ranking Standards
    www.marktechpost.com
    Text embedding and reranking are foundational to modern information retrieval systems, powering applications such as semantic search, recommendation systems, and retrieval-augmented generation (RAG). However, current approaches often face key challenges—particularly in achieving both high multilingual fidelity and task adaptability without relying on proprietary APIs. Existing models frequently fall short in scenarios requiring nuanced semantic understanding across multiple languages or domain-specific tasks like code retrieval and instruction following. Moreover, most open-source models either lack scale or flexibility, while commercial APIs remain costly and closed. Qwen3-Embedding and Qwen3-Reranker: A New Standard for Open-Source Embedding Alibaba’s Qwen Team has unveiled the Qwen3-Embedding and Qwen3-Reranker Series—models that set a new benchmark in multilingual text embedding and relevance ranking. Built on the Qwen3 foundation models, the series includes variants in 0.6B, 4B, and 8B parameter sizes and supports a wide range of languages (119 in total), making it one of the most versatile and performant open-source offerings to date. These models are now open-sourced under the Apache 2.0 license on Hugging Face, GitHub, and ModelScope, and are also accessible via Alibaba Cloud APIs. These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs. Technical Architecture Qwen3-Embedding models adopt a dense transformer-based architecture with causal attention, producing embeddings by extracting the hidden state corresponding to the [EOS] token. Instruction-awareness is a key feature: input queries are formatted as {instruction} {query}<|endoftext|>, enabling task-conditioned embeddings. The reranker models are trained with a binary classification format, judging document-query relevance in an instruction-guided manner using a token likelihood-based scoring function. The models are trained using a robust multi-stage training pipeline: Large-scale weak supervision: 150M synthetic training pairs generated using Qwen3-32B, covering retrieval, classification, STS, and bitext mining across languages and tasks. Supervised fine-tuning: 12M high-quality data pairs are selected using cosine similarity (>0.7), fine-tuning performance in downstream applications. Model merging: Spherical linear interpolation (SLERP) of multiple fine-tuned checkpoints ensures robustness and generalization. This synthetic data generation pipeline enables control over data quality, language diversity, task difficulty, and more—resulting in a high degree of coverage and relevance in low-resource settings. Performance Benchmarks and Insights The Qwen3-Embedding and Qwen3-Reranker series demonstrate strong empirical performance across several multilingual benchmarks. On MMTEB (216 tasks across 250+ languages), Qwen3-Embedding-8B achieves a mean task score of 70.58, surpassing Gemini and GTE-Qwen2 series. On MTEB (English v2): Qwen3-Embedding-8B reaches 75.22, outperforming other open models including NV-Embed-v2 and GritLM-7B. On MTEB-Code: Qwen3-Embedding-8B leads with 80.68, excelling in applications like code retrieval and Stack Overflow QA. For reranking: Qwen3-Reranker-0.6B already outperforms Jina and BGE rerankers. Qwen3-Reranker-8B achieves 81.22 on MTEB-Code and 72.94 on MMTEB-R, marking state-of-the-art performance. Ablation studies confirm the necessity of each training stage. Removing synthetic pretraining or model merging led to significant performance drops (up to 6 points on MMTEB), emphasizing their contributions. Conclusion Alibaba’s Qwen3-Embedding and Qwen3-Reranker Series present a robust, open, and scalable solution to multilingual and instruction-aware semantic representation. With strong empirical results across MTEB, MMTEB, and MTEB-Code, these models bridge the gap between proprietary APIs and open-source accessibility. Their thoughtful training design—leveraging high-quality synthetic data, instruction-tuning, and model merging—positions them as ideal candidates for enterprise applications in search, retrieval, and RAG pipelines. By open-sourcing these models, the Qwen team not only pushes the boundaries of language understanding but also empowers the broader community to innovate on top of a solid foundation. Check out the Paper, Technical details, Qwen3-Embedding and Qwen3-Reranker. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Asif RazzaqWebsite |  + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Building an Iterative AI Workflow Agent Using LangGraph and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/From Clicking to Reasoning: WebChoreArena Benchmark Challenges Agents with Memory-Heavy and Multi-Page TasksAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Mistral AI Introduces Mistral Code: A Customizable AI Coding Assistant for Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
    Like
    Love
    Wow
    Angry
    Sad
    332
    · 0 Commentarii ·0 Distribuiri ·0 previzualizare
  • Bring Receipts: New NVIDIA AI Blueprint Detects Fraudulent Credit Card Transactions With Precision

    Editor’s note: This blog, originally published on October 28, 2024, has been updated.
    Financial losses from worldwide credit card transaction fraud are projected to reach more than billion over the next decade.
    The new NVIDIA AI Blueprint for financial fraud detection can help combat this burgeoning epidemic — using accelerated data processing and advanced algorithms to improve AI’s ability to detect and prevent credit card transaction fraud.
    Launched this week at the Money20/20 financial services conference, the blueprint provides a reference example for financial institutions to identify subtle patterns and anomalies in transaction data based on user behavior to improve accuracy and reduce false positives compared with traditional methods.
    It shows developers how to build a financial fraud detection workflow by providing reference code, deployment tools and a reference architecture.
    Companies can streamline the migration of their fraud detection workflows from traditional compute to accelerated compute using the NVIDIA AI Enterprise software platform and NVIDIA accelerated computing. The NVIDIA AI Blueprint is available for customers to run on Amazon Web Services, with availability coming soon on Dell Technologies and Hewlett Packard Enterprise. Customers can also use the blueprint through service offerings from NVIDIA partners including Cloudera, EXL, Infosys and SHI International.

    Businesses embracing comprehensive machine learningtools and strategies can observe up to an estimated 40% improvement in fraud detection accuracy, boosting their ability to identify and stop fraudsters faster and mitigate harm.
    As such, leading financial organizations like American Express and Capital One have been using AI to build proprietary solutions that mitigate fraud and enhance customer protection.
    The new AI Blueprint accelerates model training and inference, and demonstrates how these components can be wrapped into a single, easy-to-use software offering, powered by NVIDIA AI.
    Currently optimized for credit card transaction fraud, the blueprint could be adapted for use cases such as new account fraud, account takeover and money laundering.
    Using Accelerated Computing and Graph Neural Networks for Fraud Detection
    Traditional data science pipelines lack the compute acceleration to handle the massive data volumes required for effective fraud detection. ML models like XGBoost are effective for detecting anomalies in individual transactions but fall short when fraud involves complex networks of linked accounts and devices.
    Helping address these gaps, NVIDIA RAPIDS — part of the NVIDIA CUDA-X collection of microservices, libraries, tools and technologies — enables payment companies to speed up data processing and transform raw data into powerful features at scale. These companies can fuel their AI models and integrate them with graph neural networksto uncover hidden, large-scale fraud patterns by analyzing relationships across different transactions, users and devices.
    The use of gradient-boosted decision trees — a type of ML algorithm — tapping into libraries such as XGBoost, has long been the standard for fraud detection.
    The new AI Blueprint for financial fraud detection enhances the XGBoost ML model with NVIDIA CUDA-X Data Science libraries including GNNs to generate embeddings that can be used as additional features to help reduce false positives.
    The GNN embeddings are fed into XGBoost to create and train a model that can then be orchestrated. In addition, NVIDIA Dynamo-Triton, formerly NVIDIA Triton Inference Server, boosts real-time inferencing while optimizing AI model throughput, latency and utilization.
    NVIDIA CUDA-X Data Science and Dynamo-Triton are included with NVIDIA AI Enterprise.
    Leading Financial Services Organizations Adopt AI
    During a time when many large North American financial institutions are reporting online or mobile fraud losses continue to increase, AI is helping to combat this trend.
    American Express, which began using AI to fight fraud in 2010, leverages fraud detection algorithms to monitor all customer transactions globally in real time, generating fraud decisions in just milliseconds. Using a combination of advanced algorithms, one of which tapped into the NVIDIA AI platform, American Express enhanced model accuracy, advancing the company’s ability to better fight fraud.
    European digital bank bunq uses generative AI and large language models to help detect fraud and money laundering. Its AI-powered transaction-monitoring system achieved nearly 100x faster model training speeds with NVIDIA accelerated computing.
    BNY announced in March 2024 that it became the first major bank to deploy an NVIDIA DGX SuperPOD with DGX H100 systems, which will help build solutions that support fraud detection and other use cases.
    And now, systems integrators, software vendors and cloud service providers can integrate the new NVIDIA blueprint for fraud detection to boost their financial services applications and help keep customers’ money, identities and digital accounts safe.
    Explore the NVIDIA AI Blueprint for financial fraud detection and read this NVIDIA Technical Blog on supercharging fraud detection with GNNs.
    Learn more about AI for fraud detection by visiting the AI Summit at Money20/20, running this week in Amsterdam.
    See notice regarding software product information.
    #bring #receipts #new #nvidia #blueprint
    Bring Receipts: New NVIDIA AI Blueprint Detects Fraudulent Credit Card Transactions With Precision
    Editor’s note: This blog, originally published on October 28, 2024, has been updated. Financial losses from worldwide credit card transaction fraud are projected to reach more than billion over the next decade. The new NVIDIA AI Blueprint for financial fraud detection can help combat this burgeoning epidemic — using accelerated data processing and advanced algorithms to improve AI’s ability to detect and prevent credit card transaction fraud. Launched this week at the Money20/20 financial services conference, the blueprint provides a reference example for financial institutions to identify subtle patterns and anomalies in transaction data based on user behavior to improve accuracy and reduce false positives compared with traditional methods. It shows developers how to build a financial fraud detection workflow by providing reference code, deployment tools and a reference architecture. Companies can streamline the migration of their fraud detection workflows from traditional compute to accelerated compute using the NVIDIA AI Enterprise software platform and NVIDIA accelerated computing. The NVIDIA AI Blueprint is available for customers to run on Amazon Web Services, with availability coming soon on Dell Technologies and Hewlett Packard Enterprise. Customers can also use the blueprint through service offerings from NVIDIA partners including Cloudera, EXL, Infosys and SHI International. Businesses embracing comprehensive machine learningtools and strategies can observe up to an estimated 40% improvement in fraud detection accuracy, boosting their ability to identify and stop fraudsters faster and mitigate harm. As such, leading financial organizations like American Express and Capital One have been using AI to build proprietary solutions that mitigate fraud and enhance customer protection. The new AI Blueprint accelerates model training and inference, and demonstrates how these components can be wrapped into a single, easy-to-use software offering, powered by NVIDIA AI. Currently optimized for credit card transaction fraud, the blueprint could be adapted for use cases such as new account fraud, account takeover and money laundering. Using Accelerated Computing and Graph Neural Networks for Fraud Detection Traditional data science pipelines lack the compute acceleration to handle the massive data volumes required for effective fraud detection. ML models like XGBoost are effective for detecting anomalies in individual transactions but fall short when fraud involves complex networks of linked accounts and devices. Helping address these gaps, NVIDIA RAPIDS — part of the NVIDIA CUDA-X collection of microservices, libraries, tools and technologies — enables payment companies to speed up data processing and transform raw data into powerful features at scale. These companies can fuel their AI models and integrate them with graph neural networksto uncover hidden, large-scale fraud patterns by analyzing relationships across different transactions, users and devices. The use of gradient-boosted decision trees — a type of ML algorithm — tapping into libraries such as XGBoost, has long been the standard for fraud detection. The new AI Blueprint for financial fraud detection enhances the XGBoost ML model with NVIDIA CUDA-X Data Science libraries including GNNs to generate embeddings that can be used as additional features to help reduce false positives. The GNN embeddings are fed into XGBoost to create and train a model that can then be orchestrated. In addition, NVIDIA Dynamo-Triton, formerly NVIDIA Triton Inference Server, boosts real-time inferencing while optimizing AI model throughput, latency and utilization. NVIDIA CUDA-X Data Science and Dynamo-Triton are included with NVIDIA AI Enterprise. Leading Financial Services Organizations Adopt AI During a time when many large North American financial institutions are reporting online or mobile fraud losses continue to increase, AI is helping to combat this trend. American Express, which began using AI to fight fraud in 2010, leverages fraud detection algorithms to monitor all customer transactions globally in real time, generating fraud decisions in just milliseconds. Using a combination of advanced algorithms, one of which tapped into the NVIDIA AI platform, American Express enhanced model accuracy, advancing the company’s ability to better fight fraud. European digital bank bunq uses generative AI and large language models to help detect fraud and money laundering. Its AI-powered transaction-monitoring system achieved nearly 100x faster model training speeds with NVIDIA accelerated computing. BNY announced in March 2024 that it became the first major bank to deploy an NVIDIA DGX SuperPOD with DGX H100 systems, which will help build solutions that support fraud detection and other use cases. And now, systems integrators, software vendors and cloud service providers can integrate the new NVIDIA blueprint for fraud detection to boost their financial services applications and help keep customers’ money, identities and digital accounts safe. Explore the NVIDIA AI Blueprint for financial fraud detection and read this NVIDIA Technical Blog on supercharging fraud detection with GNNs. Learn more about AI for fraud detection by visiting the AI Summit at Money20/20, running this week in Amsterdam. See notice regarding software product information. #bring #receipts #new #nvidia #blueprint
    Bring Receipts: New NVIDIA AI Blueprint Detects Fraudulent Credit Card Transactions With Precision
    blogs.nvidia.com
    Editor’s note: This blog, originally published on October 28, 2024, has been updated. Financial losses from worldwide credit card transaction fraud are projected to reach more than $403 billion over the next decade. The new NVIDIA AI Blueprint for financial fraud detection can help combat this burgeoning epidemic — using accelerated data processing and advanced algorithms to improve AI’s ability to detect and prevent credit card transaction fraud. Launched this week at the Money20/20 financial services conference, the blueprint provides a reference example for financial institutions to identify subtle patterns and anomalies in transaction data based on user behavior to improve accuracy and reduce false positives compared with traditional methods. It shows developers how to build a financial fraud detection workflow by providing reference code, deployment tools and a reference architecture. Companies can streamline the migration of their fraud detection workflows from traditional compute to accelerated compute using the NVIDIA AI Enterprise software platform and NVIDIA accelerated computing. The NVIDIA AI Blueprint is available for customers to run on Amazon Web Services, with availability coming soon on Dell Technologies and Hewlett Packard Enterprise. Customers can also use the blueprint through service offerings from NVIDIA partners including Cloudera, EXL, Infosys and SHI International. Businesses embracing comprehensive machine learning (ML) tools and strategies can observe up to an estimated 40% improvement in fraud detection accuracy, boosting their ability to identify and stop fraudsters faster and mitigate harm. As such, leading financial organizations like American Express and Capital One have been using AI to build proprietary solutions that mitigate fraud and enhance customer protection. The new AI Blueprint accelerates model training and inference, and demonstrates how these components can be wrapped into a single, easy-to-use software offering, powered by NVIDIA AI. Currently optimized for credit card transaction fraud, the blueprint could be adapted for use cases such as new account fraud, account takeover and money laundering. Using Accelerated Computing and Graph Neural Networks for Fraud Detection Traditional data science pipelines lack the compute acceleration to handle the massive data volumes required for effective fraud detection. ML models like XGBoost are effective for detecting anomalies in individual transactions but fall short when fraud involves complex networks of linked accounts and devices. Helping address these gaps, NVIDIA RAPIDS — part of the NVIDIA CUDA-X collection of microservices, libraries, tools and technologies — enables payment companies to speed up data processing and transform raw data into powerful features at scale. These companies can fuel their AI models and integrate them with graph neural networks (GNNs) to uncover hidden, large-scale fraud patterns by analyzing relationships across different transactions, users and devices. The use of gradient-boosted decision trees — a type of ML algorithm — tapping into libraries such as XGBoost, has long been the standard for fraud detection. The new AI Blueprint for financial fraud detection enhances the XGBoost ML model with NVIDIA CUDA-X Data Science libraries including GNNs to generate embeddings that can be used as additional features to help reduce false positives. The GNN embeddings are fed into XGBoost to create and train a model that can then be orchestrated. In addition, NVIDIA Dynamo-Triton, formerly NVIDIA Triton Inference Server, boosts real-time inferencing while optimizing AI model throughput, latency and utilization. NVIDIA CUDA-X Data Science and Dynamo-Triton are included with NVIDIA AI Enterprise. Leading Financial Services Organizations Adopt AI During a time when many large North American financial institutions are reporting online or mobile fraud losses continue to increase, AI is helping to combat this trend. American Express, which began using AI to fight fraud in 2010, leverages fraud detection algorithms to monitor all customer transactions globally in real time, generating fraud decisions in just milliseconds. Using a combination of advanced algorithms, one of which tapped into the NVIDIA AI platform, American Express enhanced model accuracy, advancing the company’s ability to better fight fraud. European digital bank bunq uses generative AI and large language models to help detect fraud and money laundering. Its AI-powered transaction-monitoring system achieved nearly 100x faster model training speeds with NVIDIA accelerated computing. BNY announced in March 2024 that it became the first major bank to deploy an NVIDIA DGX SuperPOD with DGX H100 systems, which will help build solutions that support fraud detection and other use cases. And now, systems integrators, software vendors and cloud service providers can integrate the new NVIDIA blueprint for fraud detection to boost their financial services applications and help keep customers’ money, identities and digital accounts safe. Explore the NVIDIA AI Blueprint for financial fraud detection and read this NVIDIA Technical Blog on supercharging fraud detection with GNNs. Learn more about AI for fraud detection by visiting the AI Summit at Money20/20, running this week in Amsterdam. See notice regarding software product information.
    0 Commentarii ·0 Distribuiri ·0 previzualizare
  • Mistral AI launches code embedding model, claims edge over OpenAI and Cohere

    French startup Mistral AI on Wednesday unveiled Codestral Embed, its first code-specific embedding model, claiming it outperforms rival offerings from OpenAI, Cohere, and Voyage.

    The company said the model supports configurable embedding outputs with varying dimensions and precision levels, allowing users to manage trade-offs between retrieval performance and storage requirements.

    “Codestral Embed with dimension 256 and int8 precision still performs better than any model from our competitors,” Mistral AI said in a statement.

    Codestral Embed is designed for use cases such as code completion, editing, or explanation tasks. It can also be applied in semantic search, duplicate detection, and repository-level analytics across large-scale codebases, the company said.“Codestral Embed supports unsupervised grouping of code based on functionality or structure,” Mistral AI added. “This is useful for analyzing repository composition, identifying emergent architecture patterns, or feeding into automated documentation and categorization systems.”

    The model is available through Mistral’s API under the name codestral-embed-2505, priced at per million tokens. A batch API version is offered at a 50 percent discount, and on-premise deployments are available through direct consultation with the company’s applied AI team.The launch follows Mistral’s recent introduction of the Agents API, which the company said complements its Chat Completion API and is intended to simplify the development of agent-based applications.

    Enterprise interest in embeddings

    Advanced code embedding models are gaining traction as key tools in enterprise software development, offering improvements in productivity, code quality, and risk management across the software lifecycle.

    “Models like Mistral’s Codestral Embed enable precise semantic code search and similarity detection, allowing enterprises to quickly identify reusable code and near-duplicates across large repositories,” said Prabhu Ram, VP of the industry research group at Cybermedia Research. “By facilitating rapid retrieval of relevant code snippets for bug fixes, feature enhancements, or onboarding, these embeddings significantly improve maintenance workflows.”

    However, despite promising early benchmarks, the long-term value of such models will depend on how well they perform in production environments.

    Factors such as ease of integration, scalability across enterprise systems, and consistency under real-world coding conditions will play a crucial role in determining their adoption.

    “Codestral Embed’s strong technical foundation and flexible deployment options make it a compelling solution for AI-driven software development, though its real-world impact will require validation beyond initial benchmark results,” Ram added.

    Further reading

    Vector Institute aims to clear up confusion about AI model performance

    When LLMs become influencers

    Researchers reveal flaws in AI agent benchmarking

    What misleading Meta Llama 4 benchmark scores show enterprise leaders about evaluating AI performance claims

    How CIOs navigate generative AI in the enterprise
    #mistral #launches #code #embedding #model
    Mistral AI launches code embedding model, claims edge over OpenAI and Cohere
    French startup Mistral AI on Wednesday unveiled Codestral Embed, its first code-specific embedding model, claiming it outperforms rival offerings from OpenAI, Cohere, and Voyage. The company said the model supports configurable embedding outputs with varying dimensions and precision levels, allowing users to manage trade-offs between retrieval performance and storage requirements. “Codestral Embed with dimension 256 and int8 precision still performs better than any model from our competitors,” Mistral AI said in a statement. Codestral Embed is designed for use cases such as code completion, editing, or explanation tasks. It can also be applied in semantic search, duplicate detection, and repository-level analytics across large-scale codebases, the company said.“Codestral Embed supports unsupervised grouping of code based on functionality or structure,” Mistral AI added. “This is useful for analyzing repository composition, identifying emergent architecture patterns, or feeding into automated documentation and categorization systems.” The model is available through Mistral’s API under the name codestral-embed-2505, priced at per million tokens. A batch API version is offered at a 50 percent discount, and on-premise deployments are available through direct consultation with the company’s applied AI team.The launch follows Mistral’s recent introduction of the Agents API, which the company said complements its Chat Completion API and is intended to simplify the development of agent-based applications. Enterprise interest in embeddings Advanced code embedding models are gaining traction as key tools in enterprise software development, offering improvements in productivity, code quality, and risk management across the software lifecycle. “Models like Mistral’s Codestral Embed enable precise semantic code search and similarity detection, allowing enterprises to quickly identify reusable code and near-duplicates across large repositories,” said Prabhu Ram, VP of the industry research group at Cybermedia Research. “By facilitating rapid retrieval of relevant code snippets for bug fixes, feature enhancements, or onboarding, these embeddings significantly improve maintenance workflows.” However, despite promising early benchmarks, the long-term value of such models will depend on how well they perform in production environments. Factors such as ease of integration, scalability across enterprise systems, and consistency under real-world coding conditions will play a crucial role in determining their adoption. “Codestral Embed’s strong technical foundation and flexible deployment options make it a compelling solution for AI-driven software development, though its real-world impact will require validation beyond initial benchmark results,” Ram added. Further reading Vector Institute aims to clear up confusion about AI model performance When LLMs become influencers Researchers reveal flaws in AI agent benchmarking What misleading Meta Llama 4 benchmark scores show enterprise leaders about evaluating AI performance claims How CIOs navigate generative AI in the enterprise #mistral #launches #code #embedding #model
    Mistral AI launches code embedding model, claims edge over OpenAI and Cohere
    www.computerworld.com
    French startup Mistral AI on Wednesday unveiled Codestral Embed, its first code-specific embedding model, claiming it outperforms rival offerings from OpenAI, Cohere, and Voyage. The company said the model supports configurable embedding outputs with varying dimensions and precision levels, allowing users to manage trade-offs between retrieval performance and storage requirements. “Codestral Embed with dimension 256 and int8 precision still performs better than any model from our competitors,” Mistral AI said in a statement. Codestral Embed is designed for use cases such as code completion, editing, or explanation tasks. It can also be applied in semantic search, duplicate detection, and repository-level analytics across large-scale codebases, the company said.“Codestral Embed supports unsupervised grouping of code based on functionality or structure,” Mistral AI added. “This is useful for analyzing repository composition, identifying emergent architecture patterns, or feeding into automated documentation and categorization systems.” The model is available through Mistral’s API under the name codestral-embed-2505, priced at $0.15 per million tokens. A batch API version is offered at a 50 percent discount, and on-premise deployments are available through direct consultation with the company’s applied AI team.The launch follows Mistral’s recent introduction of the Agents API, which the company said complements its Chat Completion API and is intended to simplify the development of agent-based applications. Enterprise interest in embeddings Advanced code embedding models are gaining traction as key tools in enterprise software development, offering improvements in productivity, code quality, and risk management across the software lifecycle. “Models like Mistral’s Codestral Embed enable precise semantic code search and similarity detection, allowing enterprises to quickly identify reusable code and near-duplicates across large repositories,” said Prabhu Ram, VP of the industry research group at Cybermedia Research. “By facilitating rapid retrieval of relevant code snippets for bug fixes, feature enhancements, or onboarding, these embeddings significantly improve maintenance workflows.” However, despite promising early benchmarks, the long-term value of such models will depend on how well they perform in production environments. Factors such as ease of integration, scalability across enterprise systems, and consistency under real-world coding conditions will play a crucial role in determining their adoption. “Codestral Embed’s strong technical foundation and flexible deployment options make it a compelling solution for AI-driven software development, though its real-world impact will require validation beyond initial benchmark results,” Ram added. Further reading Vector Institute aims to clear up confusion about AI model performance When LLMs become influencers Researchers reveal flaws in AI agent benchmarking What misleading Meta Llama 4 benchmark scores show enterprise leaders about evaluating AI performance claims How CIOs navigate generative AI in the enterprise
    0 Commentarii ·0 Distribuiri ·0 previzualizare
  • Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy

    Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge. 
    RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards, which focus on the final answer, and process-based rewards, which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency. 
    Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU. 
    The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning. 
    The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models. Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks. 

    In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools. 

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
    Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces
    #apple #duke #researchers #present #reinforcement
    Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy
    Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge.  RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards, which focus on the final answer, and process-based rewards, which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency.  Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU.  The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning.  The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models. Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks.  In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces #apple #duke #researchers #present #reinforcement
    Apple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy
    www.marktechpost.com
    Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge.  RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards (ORM), which focus on the final answer, and process-based rewards (PRM), which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency.  Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU.  The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning.  The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models (1.5B and 7B). Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks.  In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text GenerationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept EmbeddingsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary SearchSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces
    0 Commentarii ·0 Distribuiri ·0 previzualizare
  • Unstructured Data Management Tips

    John Edwards, Technology Journalist & AuthorMay 26, 20255 Min ReadLuis Moreira via Alamy Stock PhotoStructured data, such as names and phone numbers, fits neatly into rows and columns. Unstructured data, however, has no fixed scheme, and may have a highly complex format such as audio files or web pages. Unfortunately, there's no single best way to effectively manage unstructured data. On the bright side, there are several approaches that can be used to successfully tackle this critical, yet persistently elusive challenge. Here are five tested ways to achieve effective unstructured data management from experts who participated in online interviews. Tip 1. Use AI-powered vector databases combined with retrieval-augmented generation "One of the most effective methods I've seen is using AI-powered vector databases combined with retrieval augmented generation," says Anbang Xu, founder of AI video generator firm Jogg.AI. A former senior software engineer at Google, Xu suggests that instead of forcing unstructured data into rigid schemas, using vector databases will allow enterprises to store and retrieve data based on contextual meaning rather than exact keyword matches. "This is especially powerful for text, audio, video, and image data, where traditional search methods fall short," he notes.  For example, Xu says, organizations using AI-powered embeddings can organize and query vast amounts of unstructured data by meaning rather than syntax. "This is what powers advanced AI applications like intelligent search, chatbots, and recommendation systems," he explains. "At Jogg.AI, we’ve seen first-hand how AI-driven indexing and retrieval make it significantly easier to turn raw, unstructured data into actionable insights." Related:Tip 2. Take a schema-on-read approach Another innovative approach to managing unstructured data is schema-on-read. "Unlike traditional databases, which define the schema -- the data's structure -- before it's stored, schema-on-read defers this process until the data is actually read or queried," says Kamal Hathi, senior vice president and general manager of machine-generated data monitoring and analysis software firm at Splunk, a Cisco company. This approach is particularly effective for unstructured and semi-structured data, where the schema is not predefined or rigid, Hathi says. "Traditional databases require a predefined schema, which makes working with unstructured data challenging and less flexible." The key advantage of schema-on-read is that it enables users to work with raw data without needing to apply traditional extract-transform-loadprocesses, Hathi states. "This, in turn, allows for working with the diversity typically seen in machine-generated data, such as system and application telemetry logs." Related:Tip 3. Look to the cloud Manage unstructured data by integrating it with structured data in a cloud environment using metadata tagging and AI-driven classifications, suggests Cam Ogden, a senior vice president at data integrity firm Precisely. "Traditionally, structured data -- like customer databases or financial records -- reside in well-organized systems such as relational databases or data warehouses," he says. However, to fully leverage all of their data, organizations need to break down the silos that separate structured data from other forms of data, including unstructured data such as text, images, or log files. This is where the cloud comes into play. Integrating structured and unstructured data in the cloud allows for more comprehensive analytics, enabling organizations to extract deeper insights from previously siloed information, Ogden says. AI-powered tools can classify and enrich both structured and unstructured data, making it easier to discover, analyze, and govern in a central platform, he notes. "The cloud offers the scalability and flexibility required to handle large volumes of data while supporting dynamic analytics workloads." Additionally, cloud platforms offer advanced data governance capabilities, ensuring that both structured and unstructured data remain secure, compliant, and aligned with business objectives. "This approach not only optimizes data management but also positions organizations to make more informed and effective data-driven decisions in real-time." Related:Tip 4. Use AI-powered classification and indexing One of the best ways to get a grip on unstructured data is to use AI-powered classification and indexing, says Adhiran Thirmal, a senior solutions engineer at cybersecurity firm Security Compass. "With machine learningand natural language processing, you can automatically sort, tag, and organize data based on its content and context," he explains. "Pairing this approach with a scalable data storage system, like a data lake or object storage, makes it easier to find and use information when you need it." AI takes the manual work out of organizing data, Thirmal says. "No more wasting time digging through files or struggling to keep things in order," he states. "AI can quickly surface the information you need, reducing human error and improving efficiency. It's also excellent for compliance, ensuring sensitive data -- like personal or financial information -- is properly handled and protected." Tip 5. Create a unified, sovereign data platform An innovative approach to managing unstructured data goes beyond outdated data lake methods, says Benjamin Anderson, senior vice president of technology at database services provider EnterpriseDB. A unified, sovereign data platform integrates unstructured, semi-structured, and structured data in a single system, eliminating the need for separate solutions. "This approach delivers quality-of-service features previously available only for structured data," he explains. "With a hybrid control plane, organizations can centrally manage their data across multiple environments, including various cloud platforms and on-premises infrastructure." When it comes to managing diverse forms of data, whether structured, unstructured, or semi-structured, the traditional approach required multiple databases and storage solutions, adding operational complexity, cost, and compliance risk, Anderson notes. "Consolidating structured and unstructured data into a single multi-model data platform will help accelerate transactional, analytical, and AI workloads." About the AuthorJohn EdwardsTechnology Journalist & AuthorJohn Edwards is a veteran business technology journalist. His work has appeared in The New York Times, The Washington Post, and numerous business and technology publications, including Computerworld, CFO Magazine, IBM Data Management Magazine, RFID Journal, and Electronic Design. He has also written columns for The Economist's Business Intelligence Unit and PricewaterhouseCoopers' Communications Direct. John has authored several books on business technology topics. His work began appearing online as early as 1983. Throughout the 1980s and 90s, he wrote daily news and feature articles for both the CompuServe and Prodigy online services. His "Behind the Screens" commentaries made him the world's first known professional blogger.See more from John EdwardsWebinarsMore WebinarsReportsMore ReportsNever Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.SIGN-UPYou May Also Like
    #unstructured #data #management #tips
    Unstructured Data Management Tips
    John Edwards, Technology Journalist & AuthorMay 26, 20255 Min ReadLuis Moreira via Alamy Stock PhotoStructured data, such as names and phone numbers, fits neatly into rows and columns. Unstructured data, however, has no fixed scheme, and may have a highly complex format such as audio files or web pages. Unfortunately, there's no single best way to effectively manage unstructured data. On the bright side, there are several approaches that can be used to successfully tackle this critical, yet persistently elusive challenge. Here are five tested ways to achieve effective unstructured data management from experts who participated in online interviews. Tip 1. Use AI-powered vector databases combined with retrieval-augmented generation "One of the most effective methods I've seen is using AI-powered vector databases combined with retrieval augmented generation," says Anbang Xu, founder of AI video generator firm Jogg.AI. A former senior software engineer at Google, Xu suggests that instead of forcing unstructured data into rigid schemas, using vector databases will allow enterprises to store and retrieve data based on contextual meaning rather than exact keyword matches. "This is especially powerful for text, audio, video, and image data, where traditional search methods fall short," he notes.  For example, Xu says, organizations using AI-powered embeddings can organize and query vast amounts of unstructured data by meaning rather than syntax. "This is what powers advanced AI applications like intelligent search, chatbots, and recommendation systems," he explains. "At Jogg.AI, we’ve seen first-hand how AI-driven indexing and retrieval make it significantly easier to turn raw, unstructured data into actionable insights." Related:Tip 2. Take a schema-on-read approach Another innovative approach to managing unstructured data is schema-on-read. "Unlike traditional databases, which define the schema -- the data's structure -- before it's stored, schema-on-read defers this process until the data is actually read or queried," says Kamal Hathi, senior vice president and general manager of machine-generated data monitoring and analysis software firm at Splunk, a Cisco company. This approach is particularly effective for unstructured and semi-structured data, where the schema is not predefined or rigid, Hathi says. "Traditional databases require a predefined schema, which makes working with unstructured data challenging and less flexible." The key advantage of schema-on-read is that it enables users to work with raw data without needing to apply traditional extract-transform-loadprocesses, Hathi states. "This, in turn, allows for working with the diversity typically seen in machine-generated data, such as system and application telemetry logs." Related:Tip 3. Look to the cloud Manage unstructured data by integrating it with structured data in a cloud environment using metadata tagging and AI-driven classifications, suggests Cam Ogden, a senior vice president at data integrity firm Precisely. "Traditionally, structured data -- like customer databases or financial records -- reside in well-organized systems such as relational databases or data warehouses," he says. However, to fully leverage all of their data, organizations need to break down the silos that separate structured data from other forms of data, including unstructured data such as text, images, or log files. This is where the cloud comes into play. Integrating structured and unstructured data in the cloud allows for more comprehensive analytics, enabling organizations to extract deeper insights from previously siloed information, Ogden says. AI-powered tools can classify and enrich both structured and unstructured data, making it easier to discover, analyze, and govern in a central platform, he notes. "The cloud offers the scalability and flexibility required to handle large volumes of data while supporting dynamic analytics workloads." Additionally, cloud platforms offer advanced data governance capabilities, ensuring that both structured and unstructured data remain secure, compliant, and aligned with business objectives. "This approach not only optimizes data management but also positions organizations to make more informed and effective data-driven decisions in real-time." Related:Tip 4. Use AI-powered classification and indexing One of the best ways to get a grip on unstructured data is to use AI-powered classification and indexing, says Adhiran Thirmal, a senior solutions engineer at cybersecurity firm Security Compass. "With machine learningand natural language processing, you can automatically sort, tag, and organize data based on its content and context," he explains. "Pairing this approach with a scalable data storage system, like a data lake or object storage, makes it easier to find and use information when you need it." AI takes the manual work out of organizing data, Thirmal says. "No more wasting time digging through files or struggling to keep things in order," he states. "AI can quickly surface the information you need, reducing human error and improving efficiency. It's also excellent for compliance, ensuring sensitive data -- like personal or financial information -- is properly handled and protected." Tip 5. Create a unified, sovereign data platform An innovative approach to managing unstructured data goes beyond outdated data lake methods, says Benjamin Anderson, senior vice president of technology at database services provider EnterpriseDB. A unified, sovereign data platform integrates unstructured, semi-structured, and structured data in a single system, eliminating the need for separate solutions. "This approach delivers quality-of-service features previously available only for structured data," he explains. "With a hybrid control plane, organizations can centrally manage their data across multiple environments, including various cloud platforms and on-premises infrastructure." When it comes to managing diverse forms of data, whether structured, unstructured, or semi-structured, the traditional approach required multiple databases and storage solutions, adding operational complexity, cost, and compliance risk, Anderson notes. "Consolidating structured and unstructured data into a single multi-model data platform will help accelerate transactional, analytical, and AI workloads." About the AuthorJohn EdwardsTechnology Journalist & AuthorJohn Edwards is a veteran business technology journalist. His work has appeared in The New York Times, The Washington Post, and numerous business and technology publications, including Computerworld, CFO Magazine, IBM Data Management Magazine, RFID Journal, and Electronic Design. He has also written columns for The Economist's Business Intelligence Unit and PricewaterhouseCoopers' Communications Direct. John has authored several books on business technology topics. His work began appearing online as early as 1983. Throughout the 1980s and 90s, he wrote daily news and feature articles for both the CompuServe and Prodigy online services. His "Behind the Screens" commentaries made him the world's first known professional blogger.See more from John EdwardsWebinarsMore WebinarsReportsMore ReportsNever Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.SIGN-UPYou May Also Like #unstructured #data #management #tips
    Unstructured Data Management Tips
    www.informationweek.com
    John Edwards, Technology Journalist & AuthorMay 26, 20255 Min ReadLuis Moreira via Alamy Stock PhotoStructured data, such as names and phone numbers, fits neatly into rows and columns. Unstructured data, however, has no fixed scheme, and may have a highly complex format such as audio files or web pages. Unfortunately, there's no single best way to effectively manage unstructured data. On the bright side, there are several approaches that can be used to successfully tackle this critical, yet persistently elusive challenge. Here are five tested ways to achieve effective unstructured data management from experts who participated in online interviews. Tip 1. Use AI-powered vector databases combined with retrieval-augmented generation "One of the most effective methods I've seen is using AI-powered vector databases combined with retrieval augmented generation," says Anbang Xu, founder of AI video generator firm Jogg.AI. A former senior software engineer at Google, Xu suggests that instead of forcing unstructured data into rigid schemas, using vector databases will allow enterprises to store and retrieve data based on contextual meaning rather than exact keyword matches. "This is especially powerful for text, audio, video, and image data, where traditional search methods fall short," he notes.  For example, Xu says, organizations using AI-powered embeddings can organize and query vast amounts of unstructured data by meaning rather than syntax. "This is what powers advanced AI applications like intelligent search, chatbots, and recommendation systems," he explains. "At Jogg.AI, we’ve seen first-hand how AI-driven indexing and retrieval make it significantly easier to turn raw, unstructured data into actionable insights." Related:Tip 2. Take a schema-on-read approach Another innovative approach to managing unstructured data is schema-on-read. "Unlike traditional databases, which define the schema -- the data's structure -- before it's stored, schema-on-read defers this process until the data is actually read or queried," says Kamal Hathi, senior vice president and general manager of machine-generated data monitoring and analysis software firm at Splunk, a Cisco company. This approach is particularly effective for unstructured and semi-structured data, where the schema is not predefined or rigid, Hathi says. "Traditional databases require a predefined schema, which makes working with unstructured data challenging and less flexible." The key advantage of schema-on-read is that it enables users to work with raw data without needing to apply traditional extract-transform-load (ETL) processes, Hathi states. "This, in turn, allows for working with the diversity typically seen in machine-generated data, such as system and application telemetry logs." Related:Tip 3. Look to the cloud Manage unstructured data by integrating it with structured data in a cloud environment using metadata tagging and AI-driven classifications, suggests Cam Ogden, a senior vice president at data integrity firm Precisely. "Traditionally, structured data -- like customer databases or financial records -- reside in well-organized systems such as relational databases or data warehouses," he says. However, to fully leverage all of their data, organizations need to break down the silos that separate structured data from other forms of data, including unstructured data such as text, images, or log files. This is where the cloud comes into play. Integrating structured and unstructured data in the cloud allows for more comprehensive analytics, enabling organizations to extract deeper insights from previously siloed information, Ogden says. AI-powered tools can classify and enrich both structured and unstructured data, making it easier to discover, analyze, and govern in a central platform, he notes. "The cloud offers the scalability and flexibility required to handle large volumes of data while supporting dynamic analytics workloads." Additionally, cloud platforms offer advanced data governance capabilities, ensuring that both structured and unstructured data remain secure, compliant, and aligned with business objectives. "This approach not only optimizes data management but also positions organizations to make more informed and effective data-driven decisions in real-time." Related:Tip 4. Use AI-powered classification and indexing One of the best ways to get a grip on unstructured data is to use AI-powered classification and indexing, says Adhiran Thirmal, a senior solutions engineer at cybersecurity firm Security Compass. "With machine learning (ML) and natural language processing (NLP), you can automatically sort, tag, and organize data based on its content and context," he explains. "Pairing this approach with a scalable data storage system, like a data lake or object storage, makes it easier to find and use information when you need it." AI takes the manual work out of organizing data, Thirmal says. "No more wasting time digging through files or struggling to keep things in order," he states. "AI can quickly surface the information you need, reducing human error and improving efficiency. It's also excellent for compliance, ensuring sensitive data -- like personal or financial information -- is properly handled and protected." Tip 5. Create a unified, sovereign data platform An innovative approach to managing unstructured data goes beyond outdated data lake methods, says Benjamin Anderson, senior vice president of technology at database services provider EnterpriseDB. A unified, sovereign data platform integrates unstructured, semi-structured, and structured data in a single system, eliminating the need for separate solutions. "This approach delivers quality-of-service features previously available only for structured data," he explains. "With a hybrid control plane, organizations can centrally manage their data across multiple environments, including various cloud platforms and on-premises infrastructure." When it comes to managing diverse forms of data, whether structured, unstructured, or semi-structured, the traditional approach required multiple databases and storage solutions, adding operational complexity, cost, and compliance risk, Anderson notes. "Consolidating structured and unstructured data into a single multi-model data platform will help accelerate transactional, analytical, and AI workloads." About the AuthorJohn EdwardsTechnology Journalist & AuthorJohn Edwards is a veteran business technology journalist. His work has appeared in The New York Times, The Washington Post, and numerous business and technology publications, including Computerworld, CFO Magazine, IBM Data Management Magazine, RFID Journal, and Electronic Design. He has also written columns for The Economist's Business Intelligence Unit and PricewaterhouseCoopers' Communications Direct. John has authored several books on business technology topics. His work began appearing online as early as 1983. Throughout the 1980s and 90s, he wrote daily news and feature articles for both the CompuServe and Prodigy online services. His "Behind the Screens" commentaries made him the world's first known professional blogger.See more from John EdwardsWebinarsMore WebinarsReportsMore ReportsNever Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.SIGN-UPYou May Also Like
    0 Commentarii ·0 Distribuiri ·0 previzualizare
  • Abstracts: Zero-shot models in single-cell biology with Alex Lu

    TranscriptGRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts! 

    ALEX LU: Yeah, I’m really excited to be joining you today. 
    HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters. 
    LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover. 
    HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it? 
    LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model. 
    HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models? 
    LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets. 
    HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are. 
    LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes. 
    HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why? 
    LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models. 
    HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field? 
    LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes. HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts!  
    #abstracts #zeroshot #models #singlecell #biology
    Abstracts: Zero-shot models in single-cell biology with Alex Lu
    TranscriptGRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts!  ALEX LU: Yeah, I’m really excited to be joining you today.  HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters.  LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover.  HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it?  LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model.  HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models?  LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets.  HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are.  LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes.  HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why?  LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models.  HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field?  LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes. HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts!   #abstracts #zeroshot #models #singlecell #biology
    Abstracts: Zero-shot models in single-cell biology with Alex Lu
    www.microsoft.com
    Transcript [MUSIC] GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers.  [MUSIC FADES] On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts!  ALEX LU: Yeah, I’m really excited to be joining you today.  HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters.  LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover.  HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it?  LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model.  HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models?  LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets.  HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are.  LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes.  HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why?  LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models.  HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field?  LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes.  [MUSIC] HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts!  [MUSIC FADES] 
    0 Commentarii ·0 Distribuiri ·0 previzualizare
CGShares https://cgshares.com