


AI/ML Research and Dev News Platform (1 million+monthly traffic) | 50k+ ML subreddit | Contact: Asif@marktechpost.com
1 people like this
705 Posts
2 Photos
0 Videos
0
Reviews
Share
Share this page
Recent Updates
-
A Coding Implementation of Extracting Structured Data Using LangSmith, Pydantic, LangChain, and Claude 3.7 Sonnetwww.marktechpost.comUnlock the power of structured data extraction with LangChain and Claude 3.7 Sonnet, transforming raw text into actionable insights. This tutorial focuses on tracing LLM tool calling using LangSmith, enabling real-time debugging and performance monitoring of your extraction system. We utilize Pydantic schemas for precise data formatting and LangChains flexible prompting to guide Claude. Experience example-driven refinement, eliminating the need for complex training. This is a glimpse into LangSmiths capabilities, showcasing how to build robust extraction pipelines for diverse applications, from document processing to automated data entry.First, we need to install the necessary packages. Well use langchain-core and langchain_anthropic to interface with the Claude model.!pip install --upgrade langchain-core!pip install langchain_anthropicIf youre using LangSmith for tracing and debugging, you can set up environment variables:LANGSMITH_TRACING=TrueLANGSMITH_ENDPOINT="https://api.smith.langchain.com"LANGSMITH_API_KEY="Your API KEY"LANGSMITH_PROJECT="extraction_api"Next, we must define the schema for the information we want to extract. Well use Pydantic models to create a structured representation of a person.from typing import Optionalfrom pydantic import BaseModel, Fieldclass Person(BaseModel): """Information about a person.""" name: Optional[str] = Field(default=None, description="The name of the person") hair_color: Optional[str] = Field( default=None, description="The color of the person's hair if known" ) height_in_meters: Optional[str] = Field( default=None, description="Height measured in meters" )Now, well define a prompt template that instructs Claude on how to perform the extraction task:from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholderprompt_template = ChatPromptTemplate.from_messages( [ ( "system", "You are an expert extraction algorithm. " "Only extract relevant information from the text. " "If you do not know the value of an attribute asked to extract, " "return null for the attribute's value.", ), ("human", "{text}"), ])This template provides clear instructions to the model about its task and how to handle missing information.Next, well initialize the Claude model that will perform our information extraction:import getpassimport osif not os.environ.get("ANTHROPIC_API_KEY"): os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Enter API key for Anthropic: ")from langchain.chat_models import init_chat_modelllm = init_chat_model("claude-3-7-sonnet-20250219", model_provider="anthropic")Now, well configure our LLM to return structured output according to our schema:structured_llm = llm.with_structured_output(schema=Person)This key step tells the model to format its responses according to our Person schema.Lets test our extraction system with a simple example:text = "Alan Smith is 6 feet tall and has blond hair."prompt = prompt_template.invoke({"text": text})result = structured_llm.invoke(prompt)print(result)Now, Lets try a more complex example:from typing import Listclass Data(BaseModel): """Container for extracted information about people.""" people: List[Person] = Field(default_factory=list, description="List of people mentioned in the text")structured_llm = llm.with_structured_output(schema=Data)text = "My name is Jeff, my hair is black and I am 6 feet tall. Anna has the same color hair as me."prompt = prompt_template.invoke({"text": text})result = structured_llm.invoke(prompt)print(result)# Next exampletext = "The solar system is large, (it was discovered by Nicolaus Copernicus), but earth has only 1 moon."prompt = prompt_template.invoke({"text": text})result = structured_llm.invoke(prompt)print(result)In conclusion, this tutorial demonstrates building a structured information extraction system with LangChain and Claude that transforms unstructured text into organized data about people. The approach uses Pydantic schemas, custom prompts, and example-driven improvement without requiring specialized training pipelines. The systems power comes from its flexibility, domain adaptability, and utilization of advanced LLM reasoning capabilities.Here is the Colab Notebook. Also,dont forget to follow us onTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our85k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meet LocAgent: Graph-Based AI Agents Transforming Code Localization for Scalable Software MaintenanceAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sea AI Lab Researchers Introduce Dr. GRPO: A Bias-Free Reinforcement Learning Method that Enhances Math Reasoning Accuracy in Large Language Models Without Inflating ResponsesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Code Implementation of a Rapid Disaster Assessment Tool Using IBMs Open-Source ResNet-50 Model0 Comments ·0 Shares ·17 Views
-
Qwen Releases the Qwen2.5-VL-32B-Instruct: A 32B Parameter VLM that Surpasses Qwen2.5-VL-72B and Other Models like GPT-4o Miniwww.marktechpost.comIn the evolving field of artificial intelligence, vision-language models (VLMs) have become essential tools, enabling machines to interpret and generate insights from both visual and textual data. Despite advancements, challenges remain in balancing model performance with computational efficiency, especially when deploying large-scale models in resource-limited settings.Qwen has introduced the Qwen2.5-VL-32B-Instruct, a 32-billion-parameter VLM that surpasses its larger predecessor, the Qwen2.5-VL-72B, and other models like GPT-4o Mini, while being released under the Apache 2.0 license. This development reflects a commitment to open-source collaboration and addresses the need for high-performing yet computationally manageable models.Technically, the Qwen2.5-VL-32B-Instruct model offers several enhancements:Visual Understanding: The model excels in recognizing objects and analyzing texts, charts, icons, graphics, and layouts within images.Agent Capabilities: It functions as a dynamic visual agent capable of reasoning and directing tools for computer and phone interactions.Video Comprehension: The model can understand videos over an hour long and pinpoint relevant segments, demonstrating advanced temporal localization.Object Localization: It accurately identifies objects in images by generating bounding boxes or points, providing stable JSON outputs for coordinates and attributes.Structured Output Generation: The model supports structured outputs for data like invoices, forms, and tables, benefiting applications in finance and commerce.These features enhance the models applicability across various domains requiring nuanced multimodal understanding. Empirical evaluations highlight the models strengths:Vision Tasks: On the Massive Multitask Language Understanding (MMMU) benchmark, the model scored 70.0, surpassing the Qwen2-VL-72Bs 64.5. In MathVista, it achieved 74.7 compared to the previous 70.5. Notably, in OCRBenchV2, the model scored 57.2/59.1, a significant improvement over the prior 47.8/46.1. In Android Control tasks, it achieved 69.6/93.3, exceeding the previous 66.4/84.4.Text Tasks: The model demonstrated competitive performance with a score of 78.4 on MMLU, 82.2 on MATH, and an impressive 91.5 on HumanEval, outperforming models like GPT-4o Mini in certain areas.These results underscore the models balanced proficiency across diverse tasks. In conclusion, the Qwen2.5-VL-32B-Instruct represents a significant advancement in vision-language modeling, achieving a harmonious blend of performance and efficiency. Its open-source availability under the Apache 2.0 license encourages the global AI community to explore, adapt, and build upon this robust model, potentially accelerating innovation and application across various sectors.Check outthe Model Weights.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from NVIDIA Introduces Cosmos-Reason1: A Multimodal Model for Physical Common Sense and Embodied ReasoningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from UC Berkeley Introduces TULIP: A Unified Contrastive Learning Model for High-Fidelity Vision and Language UnderstandingNikhilhttps://www.marktechpost.com/author/nikhil0980/Meta AI Researchers Introduced SWEET-RL and CollaborativeAgentBench: A Step-Wise Reinforcement Learning Framework to Train Multi-Turn Language Agents for Realistic Human-AI Collaboration TasksNikhilhttps://www.marktechpost.com/author/nikhil0980/OpenAI Introduced Advanced Audio Models gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe: Enhancing Real-Time Speech Synthesis and Transcription Capabilities for Developers0 Comments ·0 Shares ·22 Views
-
This AI Paper from NVIDIA Introduces Cosmos-Reason1: A Multimodal Model for Physical Common Sense and Embodied Reasoningwww.marktechpost.comArtificial intelligence systems designed for physical settings require more than just perceptual abilitiesthey must also reason about objects, actions, and consequences in dynamic, real-world environments. These systems must understand spatial arrangements, cause-and-effect relationships, and the progression of events over time. In applications like robotics, self-driving vehicles, or assistive technologies, AI must comprehend its surroundings physical constraints and affordances to make intelligent and safe decisions. This fusion of perception with structured reasoning about physical dynamics forms the backbone of Physical AI.A core issue for such systems is their inability to conclude physical environments using integrated visual and contextual information. Although vision-language models have made significant progress, they still struggle to determine whether a task has been completed, what action should follow next, or whether a proposed action is feasible. The gap between perception and decision-making becomes especially critical when AI needs to operate independently and interpret tasks from complex visual scenarios. These systems remain unreliable in high-stakes or fast-changing environments without mechanisms to verify their reasoning.Existing models such as LLaVA, GPT-4o, and Gemini 2.0 Flash are proficient in handling text and visual data but underperform physically grounded reasoning. Tasks like identifying temporal order, spatial continuity, or object permanence are rarely handled effectively. Popular benchmarks often fail to evaluate such scenarios, offering limited insight into a models ability to reason about physical events or agent actions. Moreover, current systems usually rely on textual cues rather than making decisions based on visual evidence, leading to inconsistent or incorrect conclusions when applied to the physical world.Researchers from NVIDIA introduced Cosmos-Reason1, a family of vision-language models developed specifically for reasoning about physical environments. These models were released in two sizes: 8 billion and 56 billion parameters. The models were built with a structured approach that included defining ontologies for physical common sense, constructing specialized training data, and designing a comprehensive suite of evaluation benchmarks. These benchmarks test capabilities such as action prediction, task verification, and judgment of physical feasibility. The research team developed datasets including BridgeData V2, RoboVQA, RoboFail, AgiBot, HoloAssist, and AV to rigorously evaluate the models.Cosmos-Reason1 uses a hybrid Mamba-MLP-Transformer architecture that integrates both vision and language components. The training process was conducted in multiple phases. Initially, a vision encoder and language model were pretrained and fine-tuned using general supervised data. Then, a physical AI-specific supervised fine-tuning (SFT) phase introduced datasets focused on space, time, and object interactions. The final reinforcement learning (RL) phase applied rule-based rewards to improve performance in areas like arrow of time detection, spatial puzzles, and object permanence. The RL setup used a modular framework that leveraged distributed computing to scale training efficiently. The model responses were structured using tags, allowing reward systems to evaluate both correctness and reasoning structure. Each question had up to nine model-generated responses, and RL training continued for 500 iterations using a global batch size of 128 questions.Evaluation of Cosmos-Reason1 showed a substantial performance increase compared to other models. In the physical common sense benchmark, Cosmos-Reason1-56B achieved an average accuracy of 60.2%, outperforming OpenAI o1, which scored 59.9%. The 8B variant also improved, reaching 52.3%. Cosmos-Reason1-56B scored an average of 63.7% for embodied reasoning tasks, up from a 53.5% baseline. Benchmarks like RoboVQA and HoloAssist showed strong gains, with the 56B model scoring 80.0% and 57.8%, respectively. Cosmos-Reason1-8B improved to 68.7% on intuitive physics tasks, showing strong gains in object permanence and spatial puzzle reasoning. However, the model faced challenges on datasets like RoboFail due to a lack of sufficiently diverse training examples.In conclusion, this research introduces a targeted and layered strategy to advance AI systems that reason about physical interactions. The researchers at NVIDIA created a scalable training method combined with a comprehensive evaluation to tackle long-standing gaps in embodied reasoning. Cosmos-Reason1 demonstrates how structured fine-tuning and reinforcement learning can build AI systems more aligned with real-world physical logic and agent behavior.Check outthe Paper and GitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from UC Berkeley Introduces TULIP: A Unified Contrastive Learning Model for High-Fidelity Vision and Language UnderstandingNikhilhttps://www.marktechpost.com/author/nikhil0980/Meta AI Researchers Introduced SWEET-RL and CollaborativeAgentBench: A Step-Wise Reinforcement Learning Framework to Train Multi-Turn Language Agents for Realistic Human-AI Collaboration TasksNikhilhttps://www.marktechpost.com/author/nikhil0980/OpenAI Introduced Advanced Audio Models gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe: Enhancing Real-Time Speech Synthesis and Transcription Capabilities for DevelopersNikhilhttps://www.marktechpost.com/author/nikhil0980/How to Use SQL Databases with Python: A Beginner-Friendly Tutorial0 Comments ·0 Shares ·14 Views
-
Lyra: A Computationally Efficient Subquadratic Architecture for Biological Sequence Modelingwww.marktechpost.comDeep learning architectures like CNNs and Transformers have significantly advanced biological sequence modeling by capturing local and long-range dependencies. However, their application in biological contexts is constrained by high computational demands and the need for large datasets. CNNs efficiently detect local sequence patterns with subquadratic scaling, whereas Transformers leverage self-attention to model global interactions but require quadratic scaling, making them computationally expensive. Hybrid models, such as Enformers, integrate CNNs and Transformers to balance local and international context modeling, but they still face scalability issues. Large-scale Transformer-based models, including AlphaFold2 and ESM3, have achieved breakthroughs in protein structure prediction and sequence-function modeling. Yet, their reliance on extensive parameter scaling limits their efficiency in biological systems where data availability is often restricted. This highlights the need for more computationally efficient approaches to model sequence-to-function relationships accurately.To overcome these challenges, epistasisthe interaction between mutations within a sequenceprovides a structured mathematical framework for biological sequence modeling. Multilinear polynomials can represent these interactions, offering a principled way to understand sequence-function relationships. State space models (SSMs) naturally align with this polynomial structure, using hidden dimensions to approximate epistatic effects. Unlike Transformers, SSMs utilize Fast Fourier Transform (FFT) convolutions to model global dependencies efficiently while maintaining subquadratic scaling. Additionally, integrating gated depthwise convolutions enhances local feature extraction and expressivity through adaptive feature selection. This hybrid approach balances computational efficiency with interpretability, making it a promising alternative to Transformer-based architectures for biological sequence modeling.Researchers from institutions, including MIT, Harvard, and Carnegie Mellon, introduce Lyra, a subquadratic sequence modeling architecture designed for biological applications. Lyra integrates SSMs to capture long-range dependencies with projected gated convolutions for local feature extraction, enabling efficient O(N log N) scaling. It effectively models epistatic interactions and achieves state-of-the-art performance across over 100 biological tasks, including protein fitness prediction, RNA function analysis, and CRISPR guide design. Lyra operates with significantly fewer parametersup to 120,000 times smaller than existing modelswhile being 64.18 times faster in inference, democratizing access to advanced biological sequence modeling.Lyra consists of two key components: Projected Gated Convolution (PGC) blocks and a state-space layer with depthwise convolution (S4D). With approximately 55,000 parameters, the model includes two PGC blocks for capturing local dependencies, followed by an S4D layer for modeling long-range interactions. PGC processes input sequences by projecting them to intermediate dimensions, applying depthwise 1D convolutions and linear projections, and recombining features through element-wise multiplication. S4D leverages diagonal state-space models to compute convolution kernels using matrices A, B, and C, efficiently capturing sequence-wide dependencies through weighted exponential terms and enhancing Lyras ability to model biological data effectively.Lyra is a sequence modeling architecture designed to capture local and long-range dependencies in biological sequences efficiently. It integrates PGCs for localized modeling and diagonalized S4D for global interactions. Lyra approximates complex epistatic interactions using polynomial expressivity, outperforming Transformer-based models in tasks like protein fitness landscape prediction and deep mutational scanning. It achieves state-of-the-art accuracy across various protein and nucleic acid modeling applications, including disorder prediction, mutation impact analysis, and RNA-dependent RNA polymerase detection, while maintaining a significantly smaller parameter count and lower computational cost than existing large-scale models.In conclusion, Lyra introduces a subquadratic architecture for biological sequence modeling, leveraging SSMs to approximate multilinear polynomial functions efficiently. This enables superior modeling of epistatic interactions while significantly reducing computational demands. By integrating PGCs for local feature extraction, Lyra achieves state-of-the-art performance across over 100 biological tasks, including protein fitness prediction, RNA analysis, and CRISPR guide design. It outperforms large foundation models with far fewer parameters and faster inference, requiring only one or two GPUs for training within hours. Lyras efficiency democratizes access to advanced biological modeling with therapeutics, pathogen surveillance, and biomanufacturing applications.Check outthe Paper.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Fin-R1: A Specialized Large Language Model for Financial Reasoning and Decision-MakingSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft AI Releases RD-Agent: An AI-Driven Tool for Performing R&D with LLM-based AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/KBLAM: Efficient Knowledge Base Augmentation for Large Language Models Without Retrieval OverheadSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemQ: Enhancing Knowledge Graph Question Answering with Memory-Augmented Query Reconstruction0 Comments ·0 Shares ·2 Views
-
This AI Paper from UC Berkeley Introduces TULIP: A Unified Contrastive Learning Model for High-Fidelity Vision and Language Understandingwww.marktechpost.comRecent advancements in artificial intelligence have significantly improved how machines learn to associate visual content with language. Contrastive learning models have been pivotal in this transformation, particularly those aligning images and text through a shared embedding space. These models are central to zero-shot classification, image-text retrieval, and multimodal reasoning. However, while these tools have pushed boundaries in aligning high-level concepts between modalities, they still face challenges in processing more nuanced, spatially precise, and detailed visual information.One of the major unresolved challenges lies in balancing semantic understanding with high-resolution visual recognition. Most existing contrastive models prioritize broad semantic alignment over spatial fidelity, causing them to underperform in tasks that require an understanding of object count, depth, fine-grained textures, or precise object locations. These limitations arise from how models are trainedoften on large-scale, loosely labeled datasetsand optimization strategies that favor global feature matching over detailed visual analysis. The absence of spatially-aware representations hampers performance in more granular vision tasks.Available models such as CLIP, ALIGN, and SigLIP have achieved strong performance on many classification and retrieval benchmarks. These models leverage large datasets to match image-text pairs in a contrastive manner, bringing semantically similar examples closer together in the embedding space. However, this focus often overlooks detailed representations crucial for specialized tasks. For instance, models trained with only image-text pairs may successfully describe what is present but struggle in tasks like counting distinct objects or distinguishing subtle variations between similar items. Vision-centric models like DINO or MAE offer strong feature extraction but lack language interpretability, making them less suitable for multimodal applications.Researchers from the University of California, Berkeley, introduced a new model called TULIP (Towards Unified Language-Image Pretraining) to address these limitations. Designed as an open-source, plug-in replacement for existing CLIP-like models, TULIP enhances the integration of semantic alignment with high-fidelity visual representation. The innovation combines several contrastive learning techniques with generative data augmentation and reconstruction-based regularization. It is designed to preserve high-level understanding and fine-grained details, bridging the gap between language comprehension and detailed visual analysis.TULIPs methodology integrates three contrastive learning strategies: image-image, image-text, and text-text contrastive learning. This unified framework is powered by a module called GeCo (Generative Contrastive view augmentation), which uses large generative models to create challenging augmentations of images and text. These include semantically identical or subtly altered variations, generating positive and negative contrastive pairs. The image encoder leverages a vision transformer architecture with a masked autoencoder reconstruction loss, while the text encoder utilizes language models to paraphrase the content. Regularization objectives encourage the model to retain essential details like texture, layout, and color alongside semantics.Performance benchmarks demonstrate that TULIP achieves notable improvements across various tasks. On ImageNet-1K zero-shot classification, TULIP reaches up to 89.6% accuracy, outperforming SigLIP by 2-3 percentage points across several datasets. In few-shot classification, it nearly doubles performance over SigLIP on RxRx1, increasing accuracy from 4.6% to 9.8%. On MMVP, a vision-language benchmark, TULIP improves performance over SigLIP by more than 3. It also outperforms competing models on the Winoground benchmark, becoming the first CIT model to achieve better-than-random results on group-based reasoning tasks. BLINK evaluations lead to tasks like spatial reasoning and object localization, rivaling or surpassing some GPT-4-based systems.This research provides a compelling solution to a fundamental multimodal learning tradeoff: achieving visual detail and semantic coherence. The research team has shown that introducing generative augmentations and multi-view contrastive techniques into pretraining significantly boosts the models capacity for complex visual and linguistic reasoning. TULIP sets a new direction for future vision-language systems that handle broad and fine-grained understanding in a unified model.Check outthe Paper, Project Page and GitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Meta AI Researchers Introduced SWEET-RL and CollaborativeAgentBench: A Step-Wise Reinforcement Learning Framework to Train Multi-Turn Language Agents for Realistic Human-AI Collaboration TasksNikhilhttps://www.marktechpost.com/author/nikhil0980/OpenAI Introduced Advanced Audio Models gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe: Enhancing Real-Time Speech Synthesis and Transcription Capabilities for DevelopersNikhilhttps://www.marktechpost.com/author/nikhil0980/How to Use SQL Databases with Python: A Beginner-Friendly TutorialNikhilhttps://www.marktechpost.com/author/nikhil0980/Cloning, Forking, and Merging Repositories on GitHub: A Beginners Guide0 Comments ·0 Shares ·19 Views
-
SuperBPE: Advancing Language Models with Cross-Word Tokenizationwww.marktechpost.comLanguage models (LMs) face a fundamental challenge in how to perceive textual data through tokenization. Current subword tokenizers segment text into vocabulary tokens that cannot bridge whitespace, adhering to an artificial constraint that treats space as a semantic boundary. This practice ignores the reality that meaning often exceeds individual words multi-word expressions like a lot of function as single semantic units, with English speakers mentally storing thousands of such phrases. Cross-linguistically, the same concepts may be expressed as single or multiple words, depending on the language. Notably, some languages like Chinese and Japanese use no whitespace, allowing tokens to span multiple words or sentences without apparent performance degradation.Previous research has explored several approaches beyond traditional subword tokenization. Some studies investigated processing text at multiple granularity levels or creating multi-word tokens through frequency-based n-gram identification. Other researchers have explored multi-token prediction (MTP), allowing language models to predict various tokens in a single step, which confirms models capability to process more than one subword simultaneously. However, these approaches require architectural modifications and fix the number of tokens predicted per step. Some researchers have pursued tokenizer-free approaches, modeling text directly as byte sequences. However, this significantly increases sequence lengths and computational requirements, leading to complex architectural solutions.Researchers from the University of Washington, NVIDIA, and the Allen Institute for AI have proposed SuperBPE, a tokenization algorithm that creates a vocabulary containing both traditional subword tokens and innovative superword tokens that span multiple words. This approach enhances the popular byte-pair encoding (BPE) algorithm by implementing a pretokenization curriculum by initially maintaining whitespace boundaries to learn subword tokens, then removing these constraints to allow for superword token formation. While standard BPE quickly reaches diminishing returns and begins using increasingly rare subwords as vocabulary size grows, SuperBPE continues discovering common multi-word sequences to encode as single tokens, improving encoding efficiency.SuperBPE operates through a two-stage training process that modifies the pretokenization step of traditional BPE, mentioned above. This approach intuitively builds semantic units and combines them into common sequences for greater efficiency. Setting t=T (t is transition point and T is target size) produces standard BPE, while t=0 creates a naive whitespace-free BPE. Training SuperBPE requires more computational resources than standard BPE because, without whitespace pretokenization, the training data consists of extremely long words with minimal deduplication. However, this increased training cost a few hours on 100 CPUs and occurs only once, which is negligible compared to the resources required for language model pretraining.SuperBPE shows impressive performance across 30 benchmarks spanning knowledge, reasoning, coding, reading comprehension, etc. All SuperBPE models outperform the BPE baseline, with the strongest 8B model achieving an average improvement of 4.0% and surpassing the baseline on 25 out of 30 individual tasks. Multiple-choice tasks show substantial gains, with a +9.7% improvement. The only statistically significant underperformance occurs in the LAMBADA task, where SuperBPE experiences a final accuracy drop from 75.8% to 70.6%. Moreover, all reasonable transition points yield stronger results than the baseline. The most encoding-efficient transition point delivers a +3.1% performance improvement while reducing inference computing by 35%.In conclusion, researchers introduced SuperBPE, a more effective tokenization approach developed by enhancing the standard BPE algorithm to incorporate superword tokens. Despite tokenization serving as the fundamental interface between language models and text, tokenization algorithms have remained relatively static. SuperBPE challenges this status quo by recognizing that tokens can extend beyond traditional subword boundaries to include multi-word expressions. SuperBPE tokenizers enable language models to achieve superior performance across numerous downstream tasks while reducing inference computational costs. These advantages require no modifications to the underlying model architecture, making SuperBPE a seamless replacement for traditional BPE in modern language model development pipelines.Check outthe Paper and Project Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday ConversationsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Emerging Trends in Modern Machine Translation Using Large Reasoning ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/SYMBOLIC-MOE: Mixture-of-Experts MoE Framework for Adaptive Instance-Level Mixing of Pre-Trained LLM Experts0 Comments ·0 Shares ·18 Views
-
TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recommendations by Combining Multi-Step Reasoning with Real-Time Biomedical Tool Integrationwww.marktechpost.comPrecision therapy has emerged as a critical approach in healthcare, tailoring treatments to individual patient profiles to optimise outcomes while reducing risks. However, determining the appropriate medication involves a complex analysis of numerous factors: patient characteristics, comorbidities, potential drug interactions, contraindications, current clinical guidelines, drug mechanisms, and disease biology. While Large Language Models (LLMs) have demonstrated therapeutic task capabilities through pretraining and fine-tuning medical data, they face significant limitations. These models lack access to updated biomedical knowledge, frequently generate hallucinations, and struggle to reason reliably across multiple clinical variables. Also, retraining LLMs with new medical information proves computationally prohibitive due to catastrophic forgetting. The models also risk incorporating unverified or deliberately misleading medical content from their extensive training data, further compromising their reliability in clinical applications.Tool-augmented LLMs have been developed to address knowledge limitations through external retrieval mechanisms like retrieval-augmented generation (RAG). These systems attempt to overcome hallucination issues by fetching drug and disease information from external databases. However, they still fall short in executing the multi-step reasoning process essential for effective treatment selection. Precision therapy would benefit significantly from iterative reasoning capabilities where models could access verified information sources, systematically evaluate potential interactions, and dynamically refine treatment recommendations based on comprehensive clinical analysis.Researchers from Harvard Medical School, MIT Lincoln Laboratory, Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Broad Institute of MIT and Harvard, and Harvard Data Science Initiative introduce TXAGENT, representing an innovative AI system delivering evidence-grounded treatment recommendations by integrating multi-step reasoning with real-time biomedical tools. The agent generates natural language responses while providing transparent reasoning traces that document its decision-making process. It employs goal-driven tool selection, accessing external databases and specialized machine learning models to ensure accuracy. Supporting this framework is TOOLUNIVERSE, a comprehensive biomedical toolbox containing 211 expert-curated tools covering drug mechanisms, interactions, clinical guidelines, and disease annotations. These tools incorporate trusted sources like openFDA, Open Targets, and the Human Phenotype Ontology. To optimize tool selection, TXAGENT implements TOOLRAG, an ML-based retrieval system that dynamically identifies the most relevant tools from TOOLUNIVERSE based on query context.TXAGENTs architecture integrates three core components: TOOLUNIVERSE, comprising 211 diverse biomedical tools; a specialized LLM fine-tuned for multi-step reasoning and tool execution; and the TOOLRAG model for adaptive tool retrieval. Tool compatibility is enabled through TOOLGEN, a multi-agent system that generates tools from API documentation. The agent undergoes fine-tuning with TXAGENT-INSTRUCT, an extensive dataset containing 378,027 instruction-tuning samples derived from 85,340 multi-step reasoning traces, encompassing 177,626 reasoning steps and 281,695 function calls. This dataset is generated by QUESTIONGEN and TRACEGEN, multi-agent systems that create diverse therapeutic queries and stepwise reasoning traces covering treatment information and drug data from FDA labels dating back to 1939.TXAGENT demonstrates exceptional capabilities in therapeutic reasoning through its multi-tool approach. The system utilizes numerous verified knowledge bases, including FDA-approved drug labels and Open Targets, to ensure accurate and reliable responses with transparent reasoning traces. It excels in four key areas: knowledge grounding using tool calls, retrieving verified information from trusted sources; goal-oriented tool selection through the TOOLRAG model; multi-step therapeutic reasoning for complex problems requiring multiple information sources; and real-time retrieval from continuously updated knowledge sources. Importantly, TXAGENT successfully identified indications for Bizengri, a drug approved in December 2024, well after its base models knowledge cutoff, by querying the openFDA API directly rather than relying on outdated internal knowledge.TXAGENT represents a significant advancement in AI-assisted precision medicine, addressing critical limitations of traditional LLMs through multi-step reasoning and targeted tool integration. By generating transparent reasoning trails alongside recommendations, the system provides interpretable decision-making processes for therapeutic problems. The integration of TOOLUNIVERSE enables real-time access to verified biomedical knowledge, allowing TXAGENT to make recommendations based on current data rather than static training information. This approach enables the system to stay current with newly approved medications, assess appropriate indications, and deliver evidence-based prescriptions. By grounding all responses in verified sources and providing traceable decision steps, TXAGENT establishes a new standard for trustworthy AI in clinical decision support.Check outthe Paper, Project Page and GitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Building a Retrieval-Augmented Generation (RAG) System with FAISS and Open-Source LLMsMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meet PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PCMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Implementing Text-to-Speech TTS with BARK Using Hugging Faces Transformers library in a Google Colab environmentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Salesforce AI Releases Text2Data: A Training Framework for Low-Resource Data Generation0 Comments ·0 Shares ·42 Views
-
Meet LocAgent: Graph-Based AI Agents Transforming Code Localization for Scalable Software Maintenancewww.marktechpost.comSoftware maintenance is an integral part of the software development lifecycle, where developers frequently revisit existing codebases to fix bugs, implement new features, and optimize performance. A critical task in this phase is code localization, pinpointing specific locations in a codebase that must be modified. This process has gained significance with modern software projects increasing scale and complexity. The growing reliance on automation and AI-driven tools has led to integrating large language models (LLMs) in supporting tasks like bug detection, code search, and suggestion. However, despite the advancement of LLMs in language tasks, enabling these models to understand the semantics and structures of complex codebases remains a technical challenge researchers strive to overcome.Talking about the problems, one of the most persistent problems in software maintenance is accurately identifying the relevant parts of a codebase that need changes based on user-reported issues or feature requests. Often, issue descriptions in natural language mention symptoms but not the actual root cause in code. This disconnect makes it difficult for developers and automated tools to link descriptions to the exact code elements needing updates. Furthermore, traditional methods struggle with complex code dependencies, especially when the relevant code spans multiple files or requires hierarchical reasoning. Poor code localization contributes to inefficient bug resolution, incomplete patches, and longer development cycles.Prior methods for code localization mostly depend on dense retrieval models or agent-based approaches. Dense retrieval requires embedding the entire codebase into a searchable vector space, which is difficult to maintain and update for large repositories. These systems often perform poorly when issue descriptions lack direct references to relevant code. On the other hand, some recent approaches use agent-based models that simulate a human-like exploration of the codebase. However, they often rely on directory traversal and lack an understanding of deeper semantic links like inheritance or function invocation. This limits their ability to handle complex relationships between code elements not explicitly linked.A team of researchers from Yale University, University of Southern California, Stanford University, and All Hands AI developed LocAgent, a graph-guided agent framework to transform code localization. Rather than depending on lexical matching or static embeddings, LocAgent converts entire codebases into directed heterogeneous graphs. These graphs include nodes for directories, files, classes, and functions and edges to capture relationships like function invocation, file imports, and class inheritance. This structure allows the agent to reason across multiple levels of code abstraction. The system then applies tools like SearchEntity, TraverseGraph, and RetrieveEntity to allow LLMs to explore the system step-by-step. The use of sparse hierarchical indexing ensures rapid access to entities, and the graph design supports multi-hop traversal, which is essential for finding connections across distant parts of the codebase.LocAgent performs indexing within seconds and supports real-time usage, making it practical for developers and organizations. The researchers fine-tuned two open-source models, Qwen2.5-7B, and Qwen2.5-32B, on a curated set of successful localization trajectories. These models performed impressively on standard benchmarks. For instance, on the SWE-Bench-Lite dataset, LocAgent achieved 92.7% file-level accuracy using Qwen2.5-32B, compared to 86.13% with Claude-3.5 and lower scores from other models. On the newly introduced Loc-Bench dataset, which contains 660 examples across bug reports (282), feature requests (203), security issues (31), and performance problems (144), LocAgent again showed competitive results, achieving 84.59% Acc@5 and 87.06% Acc@10 at the file level. Even the smaller Qwen2.5-7B model delivered performance close to high-cost proprietary models while costing only $0.05 per example, a stark contrast to the $0.66 cost of Claude-3.5.The core mechanism relies on a detailed graph-based indexing process. Each node, whether representing a class or function, is uniquely identified by a fully qualified name and indexed using BM25 for flexible keyword search. The model enables agents to simulate a reasoning chain that begins with extracting issue-relevant keywords, proceeds through graph traversals, and concludes with code retrievals for specific nodes. These actions are scored using a confidence estimation approach based on prediction consistency over multiple iterations. Notably, when the researchers disabled tools like TraverseGraph or SearchEntity, performance dropped by up to 18%, highlighting their importance. Further, multi-hop reasoning was critical; fixing traversal hops to one led to a decline in function-level accuracy from 71.53% to 66.79%.When applied to downstream tasks like GitHub issue resolution, LocAgent increased the issue pass rate (Pass@10) from 33.58% in baseline Agentless systems to 37.59% with the fine-tuned Qwen2.5-32B model. The frameworks modularity and open-source nature make it a compelling solution for organizations looking for in-house alternatives to commercial LLMs. The introduction of Loc-Bench, with its broader representation of maintenance tasks, ensures fair evaluation without contamination from pre-training data.Some Key Takeaways from the Research on LocAgent include the following:LocAgent transforms codebases into heterogeneous graphs for multi-level code reasoning.It achieved up to 92.7% file-level accuracy on SWE-Bench-Lite with Qwen2.5-32B.Reduced code localization cost by approximately 86% compared to proprietary models. Introduced Loc-Bench dataset with 660 examples: 282 bugs, 203 features, 31 security, 144 performance.Fine-tuned models (Qwen2.5-7B, Qwen2.5-32B) performed comparably to Claude-3.5.Tools like TraverseGraph and SearchEntity proved essential, with accuracy drops when disabled.Demonstrated real-world utility by improving GitHub issue resolution rates.It offers a scalable, cost-efficient, and effective alternative to proprietary LLM solutions.Check outthe Paper and GitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Sea AI Lab Researchers Introduce Dr. GRPO: A Bias-Free Reinforcement Learning Method that Enhances Math Reasoning Accuracy in Large Language Models Without Inflating ResponsesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Code Implementation of a Rapid Disaster Assessment Tool Using IBMs Open-Source ResNet-50 ModelAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images0 Comments ·0 Shares ·44 Views
-
Achieving Critical Reliability in Instruction-Following with LLMs: How to Achieve AI Customer Service Thats 100% Reliablewww.marktechpost.comEnsuring reliable instruction-following in LLMs remains a critical challenge. This is particularly important in customer-facing applications, where mistakes can be costly. Traditional prompt engineering techniques fail to deliver consistent results. A more structured and managed approach is necessary to improve adherence to business rules while maintaining flexibility.This article explores key innovations, including granular atomic guidelines, dynamic evaluation and filtering of instructions, and Attentive Reasoning Queries (ARQs), while acknowledging implementation limitations and trade-offs.The Challenge: Inconsistent AI Performance in Customer ServiceLLMs are already providing tangible business value when used as assistants to human representatives in customer service scenarios. However, their reliability as autonomous customer-facing agents remains a challenge.Traditional approaches to developing conversational LLM applications often fail in real-world use cases. The two most common approaches are:Iterative prompt engineering, which leads to inconsistent, unpredictable behavior.Flowchart-based processing, which sacrifices the real magic of LLM-powered interactions: dynamic, free-flowing, human-like interactions.In high-stakes customer-facing applications, such as banking, even minor errors can have serious consequences. For instance, an incorrectly executed API call (like transferring money) can lead to lawsuits and reputational damage. Conversely, mechanical interactions that lack naturalness and rapport hurt customer trust and engagement, limiting containment rates (cases resolved without human intervention).For LLMs to reach their full potential as dynamic, autonomous agents in real-world cases, we must make them follow business-specific instructions consistently and at scale, while maintaining the flexibility of natural, free-flowing interactions.How to Create a Reliable, Autonomous Customer Service Agent with LLMsTo address these gaps in LLMs and current approaches, and achieve a level of reliability and control that works well in real-world cases, we must question the approaches that failed.One of the first questions I had when I started working on Parlant (an open-source framework for customer-facing AI agents) was, If an AI agent is found to mishandle a particular customer scenario, what would be the optimal process for fixing it? Adding additional demands to an already-lengthy prompt, like Heres how you should approach scenario X would quickly become complicated to manage, and the results werent consistent anyhow. Besides that, adding those instructions unconditionally posed an alignment risk since LLMs are inherently biased by their input. It was therefore important that instructions for scenario X did not leak into other scenarios which potentially required a different approach.We thus realized that instructions needed to apply only in their intended context. This made sense because, in real-life, when we catch unsatisfactory behavior in real-time in a customer-service interaction, we usually know how to correct it: Were able to specify both what needs to improve as well as the context in which our feedback should apply. For example, Be concise and to the point when discussing premium-plan benefits, but Be willing to explain our offering at length when comparing it to other solutions.In addition to this contextualization of instructions, in training a highly capable agent that can handle many use cases, wed clearly need to tweak many instructions over time as we shaped our agents behavior to business needs and preferences. We needed a systematic approach.Stepping back and rethinking, from first principles, our ideal expectations from modern AI-based interactions and how to develop them, this is what we understood about how such interactions should feel to customers:Empathetic and coherent: Customers should feel in good hands when using AI.Fluid, like Instant Messaging (IM): Allowing customers to switch topics back and forth, express themselves using multiple messages, and ask about multiple topics at a time.Personalized: You should feel that the AI agent knows its speaking to you and understands your context.From a developer perspective, we also realized that:Crafting the right conversational UX is an evolutionary process. We should be able to confidently modify agent behavior in different contexts, quickly and easily, without worrying about breaking existing behavior.Instructions should be respected consistently. This is hard to do with LLMs, which are inherently unpredictable creatures. An innovative solution was required.Agent decisions should be transparent. The spectrum of possible issues related to natural language and behavior is too wide. Resolving issues in instruction-following without clear indications of how an agent interpreted our instructions in a given scenario would be highly impractical in production environments with deadlines.Implementing Parlants Design GoalsOur main challenge was how to control and adjust an AI agents behavior while ensuring that instructions are not spoken in vainthat the AI agent implements them accurately and consistently. This led to a strategic design decision: granular, atomic guidelines.1. Granular Atomic GuidelinesComplex prompts often overwhelm LLMs, leading to incomplete or inconsistent outputs with respect to the instructions they specify. We solved this in Parlant by dropping broad prompts for self-contained, atomic guidelines. Each guideline consists of:Condition: A natural-language query that determines when the instruction should apply (e.g., The customer inquires about a refund)Action: The specific instruction the LLM should follow (e.g., Confirm order details and offer an overview of the refund process.)By segmenting instructions into manageable units and systematically focusing their attention on each one at a time, we could get the LLM to evaluate and enforce them with higher accuracy.2. Filtering and Supervision MechanismLLMs are highly influenced by the content of their prompts, even if parts of the prompt are not directly relevant to the conversation at hand.Instead of presenting all guidelines at once, we made Parlant dynamically match and apply only the relevant set of instructions at each step of the conversation. This real-time matching can then be leveraged for:Reduced cognitive overload for the LLM: Wed avoid prompt leaks and increase the models focus on the right instructions, leading to higher consistency.Supervision: We added a mechanism to highlight each guidelines impact and enforce its application, increasing conformance across the board.Explainability: Every evaluation and decision generated by the system includes a rationale detailing how guidelines were interpreted and the reasoning behind skipping or activating them at each point in the conversation.Continuous improvement: By monitoring guideline effectiveness and agent interpretation, developers could easily refine their AIs behavior over time. Because guidelines are atomic and supervised, you could easily make structured changes without breaking fragile prompts.3. Attentive Reasoning Queries (ARQs)While Chain of Thought (CoT) prompting improves reasoning, it remains limited in its ability to maintain consistent, context-sensitive responses over time. Parlant introduces Attentive Reasoning Queries (ARQs)a technique weve devised to ensure that multi-step reasoning stays effective, accurate, and predictable, even across thousands of runs. You can find our research paper on ARQs vs. CoT on parlant.io and arxiv.org.ARQs work by directing the LLMs attention back to high-priority instructions at key points in the response generation process, getting the LLM to attend to those instructions and reason about them right before it needs to apply them. We found that localizing the reasoning around the part of the response where a specific instruction needs to be applied provided significantly greater accuracy and consistency than a preliminary, nonspecific reasoning process like CoT.Acknowledging LimitationsWhile these innovations improve instruction-following, there are challenges to consider:Computational overhead: Implementing filtering and reasoning mechanisms increases processing time. However, with hardware and LLMs improving by the day, we saw this as a possibly controversial, yet strategic design choice.Alternative approaches: In some low-risk applications, such as assistive AI co-pilots, simpler methods like prompt-tuning or workflow-based approaches often suffice.Why Consistency Is Crucial for Enterprise-Grade Conversational AIIn regulated industries like finance, healthcare, and legal services, even 99% accuracy poses significant risk. A bank handling millions of monthly conversations cannot afford thousands of potentially critical errors. Beyond accuracy, AI systems must be constrained such that errors, even when they occur, remain within strict, acceptable bounds.In response to the demand for greater accuracy in such applications, AI solution vendors often argue that humans also make mistakes. While this is true, the difference is that, with human employees, correcting them is usually straightforward. You can ask them why they handled a situation the way they did. You can provide direct feedback and monitor their results. But relying on best-effort prompt-engineering, while being blind to why an AI agent even made some decision in the first place, is an approach that simply doesnt scale beyond basic demos.This is why a structured feedback mechanism is so important. It allows you to pinpoint what changes need to be made, and how to make them while keeping existing functionality intact. Its this realization that put us on the right track with Parlant early on.Handling Millions of Customer Interactions with Autonomous AI AgentsFor enterprises to deploy AI at scale, consistency and transparency are non-negotiable. A financial chatbot providing unauthorized advice, a healthcare assistant misguiding patients, or an e-commerce agent misrepresenting products can all have severe consequences.Parlant redefines AI alignment by enabling:Enhanced operational efficiency: Reducing human intervention while ensuring high-quality AI interactions.Consistent brand alignment: Maintaining coherence with business values.Regulatory compliance: Adhering to industry standards and legal requirements.This methodology represents a shift in how AI alignment is approached in the first place. Using modular guidelines with intelligent filtering instead of long, complex prompts; adding explicit supervision and validation mechanisms to ensure things go as plannedthese innovations mark a new standard for achieving reliability with LLMs. As AI-driven automation continues to expand in adoption, ensuring consistent instruction-following will become an accepted necessity, not an innovative luxury.If your company is looking to deploy robust AI-powered customer service or any other customer-facing application, you should look into Parlant, an agent framework for controlled, explainable, and enterprise-ready AI interactions. Yam MarcovitzParlant Tech Lead, CEOatEmcie|Website| + postsBioYam Marcovitz is Parlant's Tech Lead and CEO at Emcie. An experienced software builder with extensive experience in mission-critical software and system architecture, Yams background informs his distinctive approach to developing controllable, predictable, and aligned AI systems.Yam Marcovitzhttps://www.marktechpost.com/author/yam-marcovitz/Are Autoregressive LLMs Really Doomed? A Commentary on Yann LeCuns Recent Keynote at AI Action Summit0 Comments ·0 Shares ·34 Views
-
A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday Conversationswww.marktechpost.comLanguage processing in the brain presents a challenge due to its inherently complex, multidimensional, and context-dependent nature. Psycholinguists have attempted to construct well-defined symbolic features and processes for domains, such as phonemes for speech analysis and part-of-speech units for syntactic structures. Despite acknowledging some cross-domain interactions, research has focused on modeling each linguistic subfield in isolation through controlled experimental manipulations. This divide-and-conquer strategy shows limitations, as a significant gap has emerged between natural language processing and formal psycholinguistic theories. These models and theories struggle to capture the subtle, non-linear, context-dependent interactions occurring within and across levels of linguistic analysis.Recent advances in LLMs have dramatically improved conversational language processing, summarization, and generation. These models excel in handling syntactic, semantic, and pragmatic properties of written text and in recognizing speech from acoustic recordings. Multimodal, end-to-end models represent a significant theoretical advancement over text-only models by providing a unified framework for transforming continuous auditory input into speech and word-level linguistic dimensions during natural conversations. Unlike traditional approaches, these deep acoustic-to-speech-to-language models shift to multidimensional vectorial representations where all elements of speech and language are embedded into continuous vectors across a population of simple computing units by optimizing straightforward objectives.Researchers from Hebrew University, Google Research, Princeton University, Maastricht University, Massachusetts General Hospital and Harvard Medical School, New York University School of Medicine, and Harvard University have presented a unified computational framework that connects acoustic, speech, and word-level linguistic structures to investigate the neural basis of everyday conversations in the human brain. They utilized electrocorticography to record neural signals across 100 hours of natural speech production and detailed as participants engaged in open-ended real-life conversations. The team extracted various embedding like low-level acoustic, mid-level speech, and contextual word embeddings from a multimodal speech-to-text model called Whisper. Their model predicts neural activity at each level of the language processing hierarchy across hours of previously unseen conversations.The internal workings of the Whisper acoustic-to-speech-to-language model are examined to model and predict neural activity during daily conversations. Three types of embeddings are extracted from the model for every word patients speak or hear: acoustic embeddings from the auditory input layer, speech embeddings from the final speech encoder layer, and language embeddings from the decoders final layers. For each embedding type, electrode-wise encoding models are constructed to map the embeddings to neural activity during speech production and comprehension. The encoding models show a remarkable alignment between human brain activity and the models internal population code, accurately predicting neural responses across hundreds of thousands of words in conversational data.The Whisper models acoustic, speech, and language embeddings show exceptional predictive accuracy for neural activity across hundreds of thousands of words during speech production and comprehension throughout the cortical language network. During speech production, a hierarchical processing is observed where articulatory areas (preCG, postCG, STG) are better predicted by speech embeddings, while higher-level language areas (IFG, pMTG, AG) align with language embeddings. The encoding models show temporal specificity, with performance peaking more than 300ms before word onset during production and 300ms after onset during comprehension, with speech embeddings better predicting activity in perceptual and articulatory areas and language embeddings excelling in high-order language areas.In summary, the acoustic-to-speech-to-language model offers a unified computational framework for investigating the neural basis of natural language processing. This integrated approach is a paradigm shift toward non-symbolic models based on statistical learning and high-dimensional embedding spaces. As these models evolve to process natural speech better, their alignment with cognitive processes may similarly improve. Some advanced models like GPT-4o incorporate visual modality alongside speech and text, while others integrate embodied articulation systems mimicking human speech production. The fast improvement of these models supports a shift to a unified linguistic paradigm that emphasizes the role of usage-based statistical learning in language acquisition as it is materialized in real-life contexts.Check outthe Paper, and Google Blog.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Emerging Trends in Modern Machine Translation Using Large Reasoning ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision ModelsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/SYMBOLIC-MOE: Mixture-of-Experts MoE Framework for Adaptive Instance-Level Mixing of Pre-Trained LLM ExpertsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Researchers from the University of Cambridge and Monash University Introduce ReasonGraph: A Web-based Platform to Visualize and Analyze LLM Reasoning Processes0 Comments ·0 Shares ·29 Views
-
Meta AI Researchers Introduced SWEET-RL and CollaborativeAgentBench: A Step-Wise Reinforcement Learning Framework to Train Multi-Turn Language Agents for Realistic Human-AI Collaboration Taskswww.marktechpost.comLarge language models (LLMs) are rapidly transforming into autonomous agents capable of performing complex tasks that require reasoning, decision-making, and adaptability. These agents are deployed in web navigation, personal assistance, and software development. To act effectively in real-world settings, these agents must handle multi-turn interactions that span several steps or decision points. This introduces the need for training methods beyond simple response generation and instead focuses on optimizing the entire trajectory of interactions. Reinforcement learning (RL) has emerged as a compelling approach to train such agents by refining their decision-making based on long-term rewards.Despite their potential, LLM-based agents struggle with multi-turn decision-making. A major challenge lies in assigning proper credit to actions taken at earlier stages of interaction, which influence later outcomes. Traditional training methods rely on next-token prediction or imitate high-probability actions, which do not account for long-term dependencies or cumulative goals. As a result, these methods fail to address the high variance and inefficiency of long-horizon tasks, particularly in collaborative scenarios where understanding human intent and reasoning across multiple steps is critical.Various reinforcement learning techniques have been adapted to fine-tune LLMs, especially from single-turn human feedback scenarios. Tools like PPO, RAFT, and DPO have been explored but exhibit significant limitations when applied to sequential interactions. These methods often fail at effective credit assignment across turns, making them less effective for multi-turn decision-making tasks. Benchmarks used to evaluate such tools lack the diversity and complexity required to assess performance in collaborative, real-world settings robustly. Value-based learning approaches are another alternative, but their need for custom heads and large amounts of task-specific fine-tuning data limit their generalization capabilities.FAIR at Meta and UC Berkeley researchers proposed a new reinforcement learning method called SWEET-RL (Step-WisE Evaluation from Training-time Information). They also introduced a benchmark known as CollaborativeAgentBench or ColBench. This benchmark is central to the study, providing over 10,000 training tasks and over 1,000 test cases across two domains: backend programming and frontend design. ColBench simulates real collaboration between an AI agent and a human partner, where agents must ask questions, refine their understanding, and provide iterative solutions. For programming, agents are required to write functions in Python by asking for clarifications to refine missing specifications. In front-end tasks, agents must generate HTML code that matches a visual target through feedback-based corrections. Each task is designed to stretch the reasoning ability of the agent and mimic real-world constraints like limited interactions, capped at 10 turns per session.SWEET-RL is built around an asymmetric actor-critic structure. The critic has access to additional information during training, such as the correct solution, which is not visible to the actor. This information allows the critic to evaluate each decision made by the agent with a much finer resolution. Instead of training a value function that estimates overall reward, SWEET-RL directly models an advantage function at each turn, using the Bradley-Terry optimization objective. The advantage function determines how much better or worse a particular action is compared to alternatives, helping the agent learn precise behaviors. For example, if an action aligns better with the human partners expectation, it receives a higher advantage score. This method simplifies credit assignment and aligns better with the pre-training architecture of LLMs, which rely on token-level prediction.SWEET-RL achieved a 6% absolute improvement over other multi-turn reinforcement learning methods across both programming and design tasks. On backend programming tasks, it passed 48.0% of tests and achieved a success rate of 34.4%, compared to 28.2% for Multi-Turn DPO and 22.4% for zero-shot performance. On frontend design tasks, it reached a cosine similarity score of 76.9% and a win rate of 40.4%, improving from 38.6% with DPO and 33.8% with fine-tuning. Even when evaluated against top proprietary models like GPT-4o and O1-Mini, SWEET-RL closed the performance gap significantly, enabling the open-source Llama-3.1-8B model to match or exceed GPT-4os frontend win rate of 40.4%.This research demonstrates that effective training of interactive agents hinges on precise, turn-by-turn feedback rather than generalized value estimations or broad supervision. SWEET-RL significantly improves credit assignment by leveraging training-time information and an architecture-aligned optimization approach. It enhances generalization, reduces training variance, and shows strong scalability, achieving better results with increased data. The algorithm also remains effective when applied to off-policy datasets, underlining its practicality in real-world scenarios with imperfect data. The research team created a meaningful evaluation framework by introducing ColBench as a benchmark tailored for realistic, multi-turn tasks. This combination with SWEET-RL provides a strong foundation for developing agents that can reason, adapt, and collaborate effectively over extended interactions.Several key takeaways from this research include:SWEET-RL improved backend programming success rates from 28.2% (DPO) to 34.4% and frontend win rates from 38.6% to 40.4%.It allowed Llama-3.1-8B to match the performance of GPT-4o, reducing dependency on proprietary models.The critic uses training-time information (e.g., correct solutions) that is invisible to the actor, creating an asymmetric training setup.Tasks in ColBench are capped at 10 rounds per session and include over 10,000 procedurally generated training examples.ColBench measures outcomes using unit test pass rates (for code) and cosine similarity (for web design), providing reliable evaluation.SWEET-RL directly learns a turn-wise advantage function, improving credit assignment without needing an intermediate value function.The model scales effectively with more data and performs well even on off-policy datasets from weaker models.Compared to traditional fine-tuning methods, SWEET-RL delivers higher performance with less overfitting and greater generalization.Check outthe Paper, GitHub Page and Dataset.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/OpenAI Introduced Advanced Audio Models gpt-4o-mini-tts, gpt-4o-transcribe, and gpt-4o-mini-transcribe: Enhancing Real-Time Speech Synthesis and Transcription Capabilities for DevelopersNikhilhttps://www.marktechpost.com/author/nikhil0980/How to Use SQL Databases with Python: A Beginner-Friendly TutorialNikhilhttps://www.marktechpost.com/author/nikhil0980/Cloning, Forking, and Merging Repositories on GitHub: A Beginners GuideNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces a Latent Token Approach: Enhancing LLM Reasoning Efficiency with VQ-VAE Compression0 Comments ·0 Shares ·30 Views
-
Sea AI Lab Researchers Introduce Dr. GRPO: A Bias-Free Reinforcement Learning Method that Enhances Math Reasoning Accuracy in Large Language Models Without Inflating Responseswww.marktechpost.comA critical advancement in recent times has been exploring reinforcement learning (RL) techniques to improve LLMs beyond traditional supervised fine-tuning methods. RL allows models to learn optimal responses through reward signals, enhancing their reasoning and decision-making capabilities. RL introduces a feedback-driven training loop that better aligns with human-like learning processes, particularly in tasks involving step-by-step problem-solving or math reasoning. This intersection of LLMs and RL is becoming a prominent area for academic research and industry innovation.A central challenge in improving LLMs for complex reasoning tasks is ensuring these models develop better thinking skills rather than longer outputs. In reinforcement learning-based training of LLMs, a pattern has emerged where models begin generating excessively long responses without necessarily improving answer quality. This raises concerns about optimization biases in RL methods that may favor verbosity over correctness. Another complication arises from the base models themselves; some already show signs of reasoning capabilities, which makes it difficult to isolate the real impact of RL tuning. Therefore, understanding how training strategies and model foundations affect final performance becomes essential.Previously, reinforcement learning post-training for LLMs often relied on algorithms like Proximal Policy Optimization (PPO), commonly used in various open-source implementations. These implementations frequently included a response-length normalization step, which inadvertently introduced biases favoring longer or shorter outputs depending on the correctness of the response. In particular, Group Relative Policy Optimization (GRPO) was introduced as a variant to optimize policy updates at the group level. While effective, GRPO has been criticized for embedding subtle optimization biases that affect the length and quality of model responses. These existing techniques, though innovative, have shown limitations that obscure the actual gains from reinforcement learning.Researchers from Sea AI Lab, the National University of Singapore, and Singapore Management University introduced a new approach called Dr. GRPO (Group Relative Policy Optimization Done Right) to address these issues. This method removes the problematic normalization terms from the GRPO formulation. Specifically, it eliminates the response length and standard deviation scaling factors that caused imbalances in model updates. The revised algorithm computes gradients more fairly across different responses and question types. They applied this method to train Qwen2.5-Math-7B, an open-source base model and demonstrated its effectiveness on multiple benchmarks. The training process used 27 hours of computing on 8 A100 GPUs, a relatively modest setup considering the results achieved.The researchers tested their method on prominent math reasoning benchmarks, including AIME 2024, AMC, MATH500, Minerva Math, and OlympiadBench. The model trained with Dr. GRPO achieved 43.3% accuracy on AIME 2024, significantly outperforming SimpleRL-Zero-7B (36.0%), Prime-Zero-7B (27.6%), and OpenReasoner-Zero-7B (16.7%). It also demonstrated strong average performance across all tasks: 40.9% on MATH500, 45.8% on Minerva, and 62.7% on OlympiadBench. These results validate the effectiveness of the bias-free RL method. Importantly, the model performed better and showed more efficient token usage. Incorrect responses became shorter and more focused, a notable shift from previous training methods encouraging overextended answers regardless of correctness.Beyond the training algorithm, the team also examined the nature of base models used in R1-Zero-like RL settings. They found that some models, such as Qwen2.5, display advanced capabilities even before training, possibly due to pretraining on concatenated question-answer data. For example, the Qwen2.5-Math-7B model achieved 38.2% average accuracy without any RL fine-tuning, outperforming many models trained using traditional methods. This preexisting reasoning capacity complicates claims about the benefits of RL, as improvements may partly stem from prior training strategies rather than new learning through reinforcement. DeepSeek-V3-Base, another examined model, showed spontaneous Aha moments and instances of self-reflection before RL, further suggesting that some reasoning skills may already be embedded in base models.The performance dynamics were carefully tracked during training. Using Dr. GRPO, models avoided the tendency to inflate response lengths. The evaluation revealed that Dr. GRPO kept output lengths stable while increasing reward signals, suggesting a direct correlation between training and improved accuracy, not just verbosity. In contrast, traditional GRPO led to progressively longer incorrect responses, falsely indicating improvement. This observation aligns with findings that many open-source PPO implementations unwittingly introduce response-length bias, a flaw inherited from pretraining practices.The researchers also explored how different templates and question sets influence model behavior. The Qwen2.5-Math-1.5B base model performed best without prompt templates, scoring 61.6% on Minerva Math and 45.8% on MATH500. Surprisingly, using templates often decreased performance before RL recovered it. This highlights how mismatches between model pretraining and inference format can obscure true reasoning capabilities. Also, models trained on small, simple question sets like GSM-8K often outperformed those trained on larger datasets, challenging the assumption that broader coverage always leads to better reasoning.Several Key Takeaways from the Research include the following:DeepSeek-V3-Base and Qwen2.5 models exhibit reasoning capabilities even before RL, indicating strong pretraining effects.Dr. GRPO eliminates biases in GRPO by removing length and reward normalization terms, improving token efficiency.The Qwen2.5-Math-7B model, trained with Dr. GRPO, achieved:43.3% on AIME 202462.7% on OlympiadBench45.8% on Minerva Math40.9% on MATH500The average score across all benchmarks: 40.3%Incorrect responses were significantly shorter using Dr. GRPO, avoiding unnecessary verbosity seen in other methods.Qwen2.5 models perform better without prompt templates, suggesting they may be pretrained on Q&A formatted data.Smaller question sets like GSM-8K can perform better than larger ones, countering expectations.Open-source PPO implementations often contain unintended response-length biases that Dr. GRPO successfully removes.In conclusion, the study reveals critical insights into how RL affects large language model behavior. Researchers found that pretraining plays a substantial role in determining baseline capabilities. They also demonstrated that optimization biases in popular RL algorithms can mislead training and evaluation. The introduction of Dr. GRPO corrected these issues, leading to more interpretable and efficient model training. With only 27 hours of training, their model reached state-of-the-art results on major math reasoning benchmarks. These findings reshape how the community should evaluate RL-enhanced LLMs, focusing more on method transparency and base model characteristics than on mere performance metrics.Check outthe Paper and GitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Code Implementation of a Rapid Disaster Assessment Tool Using IBMs Open-Source ResNet-50 ModelAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About ImagesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Open Sources Dynamo: An Open-Source Inference Library for Accelerating and Scaling AI Reasoning Models in AI Factories0 Comments ·0 Shares ·42 Views
-
A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0www.marktechpost.comRAG-powered conversational research assistants address the limitations of traditional language models by combining them with information retrieval systems. The system searches through specific knowledge bases, retrieves relevant information, and presents it conversationally with proper citations. This approach reduces hallucinations, handles domain-specific knowledge, and grounds responses in retrieved text. In this tutorial, we will demonstrate building such an assistant using the open-source model TinyLlama-1.1B-Chat-v1.0 from Hugging Face, FAISS from Meta, and the LangChain framework to answer questions about scientific papers.First, lets install the necessary libraries:Copy CodeCopiedUse a different Browser!pip install langchain-community langchain pypdf sentence-transformers faiss-cpu transformers accelerate einopsNow, lets import the required libraries:Copy CodeCopiedUse a different Browserimport osimport torchfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain_community.document_loaders import PyPDFLoaderfrom langchain_community.vectorstores import FAISSfrom langchain_community.embeddings import HuggingFaceEmbeddingsfrom langchain.chains import ConversationalRetrievalChainfrom langchain_community.llms import HuggingFacePipelinefrom transformers import AutoTokenizer, AutoModelForCausalLM, pipelineimport pandas as pd from IPython.display import display, MarkdownWe will mount drive to save the paper in further step:Copy CodeCopiedUse a different Browserfrom google.colab import drivedrive.mount('/content/drive')print("Google Drive mounted")For our knowledge base, well use PDF documents of scientific papers. Lets create a function to load and process these documents:Copy CodeCopiedUse a different Browserdef load_documents(pdf_folder_path): documents = [] if not pdf_folder_path: print("Downloading a sample paper...") !wget -q https://arxiv.org/pdf/1706.03762.pdf -O attention.pdf pdf_docs = ["attention.pdf"] else: pdf_docs = [os.path.join(pdf_folder_path, f) for f in os.listdir(pdf_folder_path) if f.endswith('.pdf')] print(f"Found {len(pdf_docs)} PDF documents") for pdf_path in pdf_docs: try: loader = PyPDFLoader(pdf_path) documents.extend(loader.load()) print(f"Loaded: {pdf_path}") except Exception as e: print(f"Error loading {pdf_path}: {e}") return documentsdocuments = load_documents("")Next, we need to split these documents into smaller chunks for efficient retrieval:Copy CodeCopiedUse a different Browserdef split_documents(documents): text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len, ) chunks = text_splitter.split_documents(documents) print(f"Split {len(documents)} documents into {len(chunks)} chunks") return chunkschunks = split_documents(documents)Well use sentence-transformers to create vector embeddings for our document chunks:Copy CodeCopiedUse a different Browserdef create_vector_store(chunks): print("Loading embedding model...") embedding_model = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'} ) print("Creating vector store...") vector_store = FAISS.from_documents(chunks, embedding_model) print("Vector store created successfully!") return vector_storevector_store = create_vector_store(chunks)Now, lets load an open-source language model to generate responses. Well use TinyLlama, which is small enough to run on Colab but still powerful enough for our task:Copy CodeCopiedUse a different Browserdef load_language_model(): print("Loading language model...") model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" try: import subprocess print("Installing/updating bitsandbytes...") subprocess.check_call(["pip", "install", "-U", "bitsandbytes"]) print("Successfully installed/updated bitsandbytes") except: print("Could not update bitsandbytes, will proceed without 8-bit quantization") from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline import torch tokenizer = AutoTokenizer.from_pretrained(model_id) if torch.cuda.is_available(): try: quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, llm_int8_has_fp16_weight=False ) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config ) print("Model loaded with 8-bit quantization") except Exception as e: print(f"Error with quantization: {e}") print("Falling back to standard model loading without quantization") model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) else: model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float32, device_map="auto" ) pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_length=2048, temperature=0.2, top_p=0.95, repetition_penalty=1.2, return_full_text=False ) from langchain_community.llms import HuggingFacePipeline llm = HuggingFacePipeline(pipeline=pipe) print("Language model loaded successfully!") return llmllm = load_language_model()Now, lets build our assistant by combining the vector store and language model:Copy CodeCopiedUse a different Browserdef format_research_assistant_output(query, response, sources): output = f"n{'=' * 50}n" output += f"USER QUERY: {query}n" output += f"{'-' * 50}nn" output += f"ASSISTANT RESPONSE:n{response}nn" output += f"{'-' * 50}n" output += f"SOURCES REFERENCED:nn" for i, doc in enumerate(sources): output += f"Source #{i+1}:n" content_preview = doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content wrapped_content = textwrap.fill(content_preview, width=80) output += f"{wrapped_content}nn" output += f"{'=' * 50}n" return outputimport textwrapresearch_assistant = create_research_assistant(vector_store, llm)test_queries = [ "What is the key idea behind the Transformer model?", "Explain self-attention mechanism in simple terms.", "Who are the authors of the paper?", "What are the main advantages of using attention mechanisms?"]for query in test_queries: response, sources = research_assistant(query, return_sources=True) formatted_output = format_research_assistant_output(query, response, sources) print(formatted_output)In this tutorial, we built a conversational research assistant using Retrieval-Augmented Generation with open-source models. RAG enhances language models by integrating document retrieval, reducing hallucination, and ensuring domain-specific accuracy. The guide walks through setting up the environment, processing scientific papers, creating vector embeddings using FAISS and sentence transformers, and integrating an open-source language model like TinyLlama. The assistant retrieves relevant document chunks and generates responses with citations. This implementation allows users to query a knowledge base, making AI-powered research more reliable and efficient for answering domain-specific questions.Here is the Colab Notebook. Also,dont forget to follow us onTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our85k+ ML SubReddit.The post A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0 appeared first on MarkTechPost.0 Comments ·0 Shares ·42 Views
-
Fin-R1: A Specialized Large Language Model for Financial Reasoning and Decision-Makingwww.marktechpost.comLLMs are advancing rapidly across multiple domains, yet their effectiveness in tackling complex financial problems remains an area of active investigation. The iterative development of LLMs has significantly driven the evolution of artificial intelligence toward artificial general intelligence (AGI). OpenAIs o1 series and similar models like QwQ and Marco-o1 have improved complex reasoning capabilities by extending chain-of-thought reasoning through an iterative exploration-reflection approach. In finance, models such as XuanYuan-FinX1-Preview and Fino1 have showcased the potential of LLMs in cognitive reasoning tasks. Meanwhile, DeepSeekR1 adopts a different strategy, relying solely on RL with multi-stage training to enhance reasoning and inference abilities. By combining thousands of unsupervised RL training steps with a small cold-start dataset, DeepSeekR1 demonstrates strong emergent reasoning performance and readability, highlighting the effectiveness of RL-based methodologies in improving large-scale language models.Despite these advancements, general-purpose LLMs struggle to adapt to specialized financial reasoning tasks. Financial decision-making requires interdisciplinary knowledge, including legal regulations, economic indicators, and mathematical modeling, while also demanding logical, step-by-step reasoning. Several challenges arise when deploying LLMs in financial applications. First, fragmented financial data complicates knowledge integration, leading to inconsistencies that hinder comprehensive understanding. Second, the black-box nature of LLMs makes their reasoning process difficult to interpret, conflicting with regulatory requirements for transparency and accountability. Finally, LLMs often struggle with generalization across financial scenarios, producing unreliable outputs in high-risk applications. These limitations pose significant barriers to their adoption in real-world financial systems, where accuracy and traceability are critical.Researchers from Shanghai University of Finance & Economics, Fudan University, and FinStep have developed Fin-R1, a specialized LLM for financial reasoning. With a compact 7-billion-parameter architecture, Fin-R1 reduces deployment costs while addressing key economic challenges: fragmented data, lack of reasoning control, and weak generalization. It is trained on Fin-R1-Data, a high-quality dataset containing 60,091 CoT sourced from authoritative financial data. A two-stage training approachSupervised Fine-Tuning (SFT) followed by RLFin-R1 enhances accuracy and interpretability. It performs well in financial benchmarks, excelling in financial compliance and robo-advisory applications.The study presents a two-stage framework for constructing Fin-R1. The data generation phase involves creating a high-quality financial reasoning dataset, Fin-R1-Data, through data distillation with DeepSeek-R1 and filtering using an LLM-as-judge approach. In the model training phase, Fin-R1 is fine-tuned on Qwen2.5-7B-Instruct using SFT and Group Relative Policy Optimization (GRPO) to enhance reasoning and output consistency. The dataset combines open-source and proprietary financial data, refined through rigorous filtering. Training integrates supervised learning and reinforcement learning, incorporating structured prompts and reward mechanisms to improve financial reasoning accuracy and standardization.The reasoning abilities of Fin-R1 in financial scenarios were evaluated through a comparative analysis against several state-of-the-art models, including DeepSeek-R1, Fin-R1-SFT, and various Qwen and Llama-based architectures. Despite its compact 7B parameter size, Fin-R1 achieved a notable average score of 75.2, ranking second overall. It outperformed all models of similar scale and exceeded DeepSeek-R1-Distill-Llama-70B by 8.7 points. Fin-R1 ranked highest in FinQA and ConvFinQA with scores of 76.0 and 85.0, respectively, demonstrating strong financial reasoning and cross-task generalization, particularly in benchmarks like Ant_Finance, TFNS, and Finance-Instruct-500K.In conclusion, Fin-R1 is a large financial reasoning language model designed to tackle key challenges in financial AI, including fragmented data, inconsistent reasoning logic, and limited business generalization. It delivers state-of-the-art performance by utilizing a two-stage training processSFT and RLon the high-quality Fin-R1-Data dataset. With a compact 7B parameter scale, it achieves scores of 85.0 in ConvFinQA and 76.0 in FinQA, outperforming larger models. Future work aims to enhance financial multimodal capabilities, strengthen regulatory compliance, and expand real-world applications, driving innovation in fintech while ensuring efficient and intelligent financial decision-making.Check outTwitterand dont forget to join our85k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Microsoft AI Releases RD-Agent: An AI-Driven Tool for Performing R&D with LLM-based AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/KBLAM: Efficient Knowledge Base Augmentation for Large Language Models Without Retrieval OverheadSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemQ: Enhancing Knowledge Graph Question Answering with Memory-Augmented Query ReconstructionSana Hassanhttps://www.marktechpost.com/author/sana-hassan/VisualWebInstruct: A Large-Scale Multimodal Reasoning Dataset for Enhancing Vision-Language Models0 Comments ·0 Shares ·31 Views
-
Microsoft AI Releases RD-Agent: An AI-Driven Tool for Performing R&D with LLM-based Agentswww.marktechpost.comResearch and development (R&D) is crucial in driving productivity, particularly in the AI era. However, conventional automation methods in R&D often lack the intelligence to handle complex research challenges and innovation-driven tasks, making them less effective than human experts. Conversely, researchers leverage deep domain knowledge to generate ideas, test hypotheses, and refine processes through iterative experimentation. The rise of LLMs offers a potential solution by introducing advanced reasoning and decision-making capabilities, allowing them to function as intelligent agents that enhance efficiency in data-driven R&D workflows.Despite their potential, LLMs must overcome key challenges to deliver meaningful industrial impact in R&D. A major limitation is their inability to evolve beyond their initial training, restricting their capacity to adapt to emerging developments. Additionally, while LLMs possess broad general knowledge, they often lack the depth required for specialized domains, limiting their effectiveness in solving industry-specific problems. To maximize their impact, LLMs must continuously acquire specialized knowledge through practical industry applications, ensuring they remain relevant and capable of addressing complex R&D challenges.Researchers at Microsoft Research Asia have developed RD-Agent, an AI-powered tool designed to automate R&D processes using LLMs. RD-Agent operates through an autonomous framework with two key components: Research, which generates and explores new ideas, and Development, which implements them. The system continuously improves through iterative refinement. RD-Agent functions as both a research assistant and a data-mining agent, automating tasks like reading papers, identifying financial and healthcare data patterns, and optimizing feature engineering. Now open-source on GitHub, RD-Agent is actively evolving to support more applications and enhance industry productivity.In R&D, two primary challenges must be addressed: enabling continuous learning and acquiring specialized knowledge. Traditional LLMs, once trained, struggle to expand their expertise, limiting their ability to tackle industry-specific problems. To overcome this, RD-Agent employs a dynamic learning framework that integrates real-world feedback, allowing it to refine hypotheses and accumulate domain knowledge over time. RD-Agent continuously proposes, tests, and improves ideas by automating the research process, linking scientific exploration with real-world validation. This iterative feedback loop ensures that knowledge is systematically acquired and applied like human experts refine their understanding through experience.In the development phase, RD-Agent enhances efficiency by prioritizing tasks and optimizing execution strategies through Co-STEER, a data-driven approach that evolves via continuous learning. This system begins with simple tasks and refines its development methods based on real-world feedback. To evaluate R&D capabilities, researchers have introduced RD2Bench, a benchmarking system that assesses LLM agents on model and data development tasks. Looking ahead, automating feedback comprehension, task scheduling, and cross-domain knowledge transfer remains a major challenge. By integrating research and development processes through continuous feedback, RD-Agent aims to revolutionize automated R&D, boosting innovation and efficiency across disciplines.In conclusion, RD-Agent is an open-source AI-driven framework designed to automate and enhance R&D processes. It integrates two core componentsResearch for idea generation and development for implementationto ensure continuous improvement through iterative feedback. By incorporating real-world data, RD-Agent evolves dynamically and acquires specialized knowledge. The system employs Co-STEER, a data-centric approach, and RD2Bench, a benchmarking tool, to refine development strategies and evaluate AI-driven R&D capabilities. This integrated approach enhances innovation, fosters cross-domain knowledge transfer, and improves efficiency, marking a significant step toward intelligent and automated research and development.Check outthe Paper and GitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/KBLAM: Efficient Knowledge Base Augmentation for Large Language Models Without Retrieval OverheadSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemQ: Enhancing Knowledge Graph Question Answering with Memory-Augmented Query ReconstructionSana Hassanhttps://www.marktechpost.com/author/sana-hassan/VisualWebInstruct: A Large-Scale Multimodal Reasoning Dataset for Enhancing Vision-Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Groundlight Research Team Released an Open-Source AI Framework that Makes It Easy to Build Visual Reasoning Agents (with GRPO)0 Comments ·0 Shares ·50 Views
-
Code Implementation of a Rapid Disaster Assessment Tool Using IBMs Open-Source ResNet-50 Modelwww.marktechpost.comIn this tutorial, we explore an innovative and practical application of IBMs open-source ResNet-50 deep learning model, showcasing its capability to classify satellite imagery for disaster management rapidly. Leveraging pretrained convolutional neural networks (CNNs), this approach empowers users to swiftly analyze satellite images to identify and categorize disaster-affected areas, such as floods, wildfires, or earthquake damage. Using Google Colab, well walk through a step-by-step process to easily set up the environment, preprocess images, perform inference, and interpret results.First, we install essential libraries for PyTorch-based image processing and visualization tasks.!pip install torch torchvision matplotlib pillowWe import necessary libraries and load the pretrained IBM-supported ResNet-50 model from PyTorch, preparing it for inference tasks.import torchimport torchvision.models as modelsimport torchvision.transforms as transformsfrom PIL import Imageimport requestsfrom io import BytesIOimport matplotlib.pyplot as pltmodel = models.resnet50(pretrained=True)model.eval()Now, we define the standard preprocessing pipeline for images, resizing and cropping them, converting them into tensors, and normalizing them to match ResNet-50s input requirements.preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] )])Here, we retrieve a satellite image from a given URL, preprocess it, classify it using the pretrained ResNet-50 model, and visualize the image with its top prediction. It also prints the top five predictions with associated probabilities.def classify_satellite_image(url): response = requests.get(url) img = Image.open(BytesIO(response.content)).convert('RGB') input_tensor = preprocess(img) input_batch = input_tensor.unsqueeze(0) with torch.no_grad(): output = model(input_batch) labels_url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt" labels = requests.get(labels_url).text.split("n") probabilities = torch.nn.functional.softmax(output[0], dim=0) top5_prob, top5_catid = torch.topk(probabilities, 5) plt.imshow(img) plt.axis('off') plt.title("Top Prediction: {}".format(labels[top5_catid[0]])) plt.show() print("Top 5 Predictions:") for i in range(top5_prob.size(0)): print(labels[top5_catid[i]], top5_prob[i].item())Finally, we download a wildfire-related satellite image, classify it using the pretrained ResNet-50 model, and visually display it along with its top five predictions.image_url = "https://upload.wikimedia.org/wikipedia/commons/0/05/Burnout_ops_on_Mangum_Fire_McCall_Smokejumpers.jpg"classify_satellite_image(image_url)In conclusion, weve successfully harnessed IBMs open-source ResNet-50 model in Google Colab to efficiently classify satellite imagery, supporting critical disaster assessment and response tasks. The approach outlined demonstrates the practicality and accessibility of advanced machine learning models and emphasizes how pretrained CNNs can be creatively applied to real-world challenges. With minimal setup, we now have a powerful tool at our disposal.Here is the Colab Notebook. Also,dont forget to follow us onTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our85k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About ImagesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Open Sources Dynamo: An Open-Source Inference Library for Accelerating and Scaling AI Reasoning Models in AI FactoriesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Building a Semantic Search Engine with Sentence Transformers, FAISS, and all-MiniLM-L6-v2Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Just Open Sourced Canary 1B and 180M Flash Multilingual Speech Recognition and Translation Models0 Comments ·0 Shares ·70 Views
-
Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Imageswww.marktechpost.comArtificial intelligence has made significant strides in recent years, yet integrating real-time speech interaction with visual content remains a complex challenge. Traditional systems often rely on separate components for voice activity detection, speech recognition, textual dialogue, and text-to-speech synthesis. This segmented approach can introduce delays and may not capture the nuances of human conversation, such as emotions or non-speech sounds. These limitations are particularly evident in applications designed to assist visually impaired individuals, where timely and accurate descriptions of visual scenes are essential.Addressing these challenges, Kyutai has introduced MoshiVis, an open-source Vision Speech Model (VSM) that enables natural, real-time speech interactions about images. Building upon their earlier work with Moshia speech-text foundation model designed for real-time dialogueMoshiVis extends these capabilities to include visual inputs. This enhancement allows users to engage in fluid conversations about visual content, marking a noteworthy advancement in AI development.Technically, MoshiVis augments Moshi by integrating lightweight cross-attention modules that infuse visual information from an existing visual encoder into Moshis speech token stream. This design ensures that Moshis original conversational abilities remain intact while introducing the capacity to process and discuss visual inputs. A gating mechanism within the cross-attention modules enables the model to selectively engage with visual data, maintaining efficiency and responsiveness. Notably, MoshiVis adds approximately 7 milliseconds of latency per inference step on consumer-grade devices, such as a Mac Mini with an M4 Pro Chip, resulting in a total of 55 milliseconds per inference step. This performance stays well below the 80-millisecond threshold for real-time latency, ensuring smooth and natural interactions.In practical applications, MoshiVis demonstrates its ability to provide detailed descriptions of visual scenes through natural speech. For instance, when presented with an image depicting green metal structures surrounded by trees and a building with a light brown exterior, MoshiVis articulates:I see two green metal structures with a mesh top, and theyre surrounded by large trees. In the background, you can see a building with a light brown exterior and a black roof, which appears to be made of stone.This capability opens new avenues for applications such as providing audio descriptions for the visually impaired, enhancing accessibility, and enabling more natural interactions with visual information. By releasing MoshiVis as an open-source project, Kyutai invites the research community and developers to explore and expand upon this technology, fostering innovation in vision-speech models. The availability of the model weights, inference code, and visual speech benchmarks further supports collaborative efforts to refine and diversify the applications of MoshiVis.In conclusion, MoshiVis represents a significant advancement in AI, merging visual understanding with real-time speech interaction. Its open-source nature encourages widespread adoption and development, paving the way for more accessible and natural interactions with technology. As AI continues to evolve, innovations like MoshiVis bring us closer to seamless integration of multimodal understanding, enhancing user experiences across various domains.Check outthe Technical details and Try it here.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our80k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Open Sources Dynamo: An Open-Source Inference Library for Accelerating and Scaling AI Reasoning Models in AI FactoriesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Building a Semantic Search Engine with Sentence Transformers, FAISS, and all-MiniLM-L6-v2Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Just Open Sourced Canary 1B and 180M Flash Multilingual Speech Recognition and Translation ModelsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Claimify: A Novel LLM-based Claim-Extraction Method that Outperforms Prior Solutions to Produce More Accurate, Comprehensive, and Substantiated Claims from LLM Outputs0 Comments ·0 Shares ·60 Views
-
NVIDIA AI Open Sources Dynamo: An Open-Source Inference Library for Accelerating and Scaling AI Reasoning Models in AI Factorieswww.marktechpost.comThe rapid advancement of artificial intelligence (AI) has led to the development of complex models capable of understanding and generating human-like text. Deploying these large language models (LLMs) in real-world applications presents significant challenges, particularly in optimizing performance and managing computational resources efficiently.Challenges in Scaling AI Reasoning ModelsAs AI models grow in complexity, their deployment demands increase, especially during the inference phasethe stage where models generate outputs based on new data. Key challenges include:Resource Allocation: Balancing computational loads across extensive GPU clusters to prevent bottlenecks and underutilization is complex.Latency Reduction: Ensuring rapid response times is critical for user satisfaction, necessitating low-latency inference processes.Cost Management: The substantial computational requirements of LLMs can lead to escalating operational costs, making cost-effective solutions essential.Introducing NVIDIA DynamoIn response to these challenges, NVIDIA has introduced Dynamo, an open-source inference library designed to accelerate and scale AI reasoning models efficiently and cost-effectively. As the successor to the NVIDIA Triton Inference Server, Dynamo offers a modular framework tailored for distributed environments, enabling seamless scaling of inference workloads across large GPU fleets. Technical Innovations and BenefitsDynamo incorporates several key innovations that collectively enhance inference performance:Disaggregated Serving: This approach separates the context (prefill) and generation (decode) phases of LLM inference, allocating them to distinct GPUs. By allowing each phase to be optimized independently, disaggregated serving improves resource utilization and increases the number of inference requests served per GPU. GPU Resource Planner: Dynamos planning engine dynamically adjusts GPU allocation in response to fluctuating user demand, preventing over- or under-provisioning and ensuring optimal performance. Smart Router: This component efficiently directs incoming inference requests across large GPU fleets, minimizing costly recomputations by leveraging knowledge from prior requests, known as KV cache. Low-Latency Communication Library (NIXL): NIXL accelerates data transfer between GPUs and across diverse memory and storage types, reducing inference response times and simplifying data exchange complexities. KV Cache Manager: By offloading less frequently accessed inference data to more cost-effective memory and storage devices, Dynamo reduces overall inference costs without impacting user experience. Performance InsightsDynamos impact on inference performance is substantial. When serving the open-source DeepSeek-R1 671B reasoning model on NVIDIA GB200 NVL72, Dynamo increased throughputmeasured in tokens per second per GPUby up to 30 times. Additionally, serving the Llama 70B model on NVIDIA Hopper resulted in more than a twofold increase in throughput. These enhancements enable AI service providers to serve more inference requests per GPU, accelerate response times, and reduce operational costs, thereby maximizing returns on their accelerated compute investments. ConclusionNVIDIA Dynamo represents a significant advancement in the deployment of AI reasoning models, addressing critical challenges in scaling, efficiency, and cost-effectiveness. Its open-source nature and compatibility with major AI inference backends, including PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM, empower enterprises, startups, and researchers to optimize AI model serving across disaggregated inference environments. By leveraging Dynamos innovative features, organizations can enhance their AI capabilities, delivering faster and more efficient AI services to meet the growing demands of modern applications.Check outthe Technical details and GitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our80k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Guide to Building a Semantic Search Engine with Sentence Transformers, FAISS, and all-MiniLM-L6-v2Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Just Open Sourced Canary 1B and 180M Flash Multilingual Speech Recognition and Translation ModelsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Claimify: A Novel LLM-based Claim-Extraction Method that Outperforms Prior Solutions to Produce More Accurate, Comprehensive, and Substantiated Claims from LLM OutputsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build a Document Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchain0 Comments ·0 Shares ·42 Views
-
How to Use SQL Databases with Python: A Beginner-Friendly Tutorialwww.marktechpost.comThis tutorial will guide you through the process of using SQL databases with Python, focusing on MySQL as the database management system. You will learn how to set up your environment, connect to a database, and perform basic operations such as creating, reading, updating, and deleting records.PrerequisitesBefore you start, ensure you have the following installed:Python: Make sure Python is installed on your machine. You can download it from python.org.MySQL Server: You will need to have MySQL installed on your system to interact with it directly, run the commands, and set up the user permissions.Heres how you can install MySQL on your system:Install MySQL (if not already installed):Start MySQL service:Secure the installation (sets up the root password and other settings):Access MySQL: Once MySQL is installed, you can log in to the MySQL shell:MySQL Connector for Python: Install the MySQL connector using pip. Open your command line and run:Setting Up Your Python EnvironmentImport Required LibrariesEstablish a Connection to the DatabaseCreating a DatabaseTo create a new database, execute the following commands:Creating TablesOnce the database is created, you need to create tables within it. Heres how to create a simple teacher table:Inserting Data into TablesTo insert data into your teacher table, use the following code:Reading Data from TablesTo read data from the teacher table:Updating RecordsTo update an existing record in the table:Deleting RecordsTo delete a record from the table:Closing the ConnectionFinally, dont forget to close your cursor and connection once youre done:ConclusionThis tutorial covers the basics of using SQL databases with Python. You learned how to set up your environment, create a database and tables, and perform basic CRUD (Create, Read, Update, Delete) operations. For more advanced topics like using SQL with Pandas or exploring different SQL databases like SQLite or PostgreSQL, consider checking out additional tutorials or courses. Feel free to experiment with more complex queries and database structures as you become more comfortable with SQL and Python! NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Cloning, Forking, and Merging Repositories on GitHub: A Beginners GuideNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces a Latent Token Approach: Enhancing LLM Reasoning Efficiency with VQ-VAE CompressionNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces R1-Onevision: A Cross-Modal Formalization Model for Advancing Multimodal Reasoning and Structured Visual InterpretationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Columbia University Introduces Manify: A Python Library for Non-Euclidean Representation Learning0 Comments ·0 Shares ·63 Views
-
KBLAM: Efficient Knowledge Base Augmentation for Large Language Models Without Retrieval Overheadwww.marktechpost.comLLMs have demonstrated strong reasoning and knowledge capabilities, yet they often require external knowledge augmentation when their internal representations lack specific details. One method for incorporating new information is supervised fine-tuning, where models are trained on additional datasets to update their weights. However, this approach is inefficient as it requires retraining whenever new knowledge is introduced and may lead to catastrophic forgetting, degrading the models performance on general tasks. To overcome these limitations, alternative techniques that preserve the models weights have gained popularity. RAG is one approach that retrieves relevant knowledge from unstructured text and appends it to the input query before passing it through the model. By dynamically retrieving information, RAG enables LLMs to access large knowledge bases while maintaining a smaller context size. However, as long-context models such as GPT-4 and Gemini have emerged, researchers have explored in-context learning, where external knowledge is directly provided in the models input. This eliminates the need for retrieval but comes with computational challenges, as processing long contexts requires significantly more memory and time.Several advanced techniques have been developed to enhance LLMs ability to integrate external knowledge more efficiently. Structured attention mechanisms improve memory efficiency by segmenting the context into independent sections, reducing the computational load of self-attention. Key-value (KV) caching optimizes response generation by storing precomputed embeddings at different layers, allowing the model to recall relevant information without recalculating it. This reduces the complexity from quadratic to linear concerning context length. Unlike traditional KV caching, which requires full recomputation when the input changes, newer methods allow selective updates, making external knowledge integration more flexible.Researchers from Johns Hopkins University and Microsoft propose a Knowledge Base Augmented Language Model (KBLAM), a method for integrating external knowledge into LLMs. KBLAM converts structured knowledge base (KB) triples into key-value vector pairs, seamlessly embedding them within the LLMs attention layers. Unlike RAG, it eliminates external retrievers, and unlike in-context learning, it scales linearly with KB size. KBLAM enables efficient dynamic updates without retraining and enhances interpretability. Trained using instruction tuning on synthetic data, it improves reliability by refusing to answer when relevant knowledge is absent, reducing hallucinations and enhancing scalability.KBLAM enhances LLMs by integrating a KB through two steps. First, each KB triple is converted into continuous key-value embeddings, termed knowledge tokens, using a pre-trained sentence encoder and linear adapters. These tokens are then incorporated into each attention layer via a rectangular attention structure, allowing efficient retrieval without altering the LLMs core parameters. This method ensures scalability, mitigates positional bias and maintains reasoning abilities. Additionally, instruction tuning optimizes knowledge token projection without modifying the LLM, using a synthetic KB to prevent memorization. This approach efficiently integrates large KBs while preserving the models original capabilities.The empirical evaluation of KBLAM demonstrates its effectiveness as a knowledge retrieval and reasoning model. After instruction tuning, its attention matrix exhibits interpretable patterns, allowing accurate retrieval. KBLAM achieves performance comparable to in-context learning while significantly reducing memory usage and maintaining scalability up to 10K triples. It can also refuse to answer when no relevant knowledge is found, with over-refusal occurring later than in-context learning. The model is trained on an instruction-tuned Llama3-8B and optimized using AdamW. Evaluation of synthetic and Enron datasets confirms KBLAMs strong retrieval accuracy, efficient knowledge integration, and ability to minimize hallucinations.In conclusion, KBLAM is an approach for enhancing LLMs with external KBs. It encodes KB entries as continuous key-value vector pairs using pre-trained sentence encoders with linear adapters and integrates them into LLMs through a specialized attention mechanism. Unlike Retrieval-Augmented Generation, KBLAM removes external retrieval modules, and unlike in-context learning, it scales linearly with KB size. This enables efficient integration of over 10K triples into an 8B LLM within an 8K context window on a single A100 GPU. Experiments show its effectiveness in question-answering and reasoning tasks while maintaining interpretability and enabling dynamic knowledge updates.Check outthe Paper and GitHub Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our85k+ ML SubReddit.The post KBLAM: Efficient Knowledge Base Augmentation for Large Language Models Without Retrieval Overhead appeared first on MarkTechPost.0 Comments ·0 Shares ·59 Views
-
A Step-by-Step Guide to Building a Semantic Search Engine with Sentence Transformers, FAISS, and all-MiniLM-L6-v2www.marktechpost.comSemantic search goes beyond traditional keyword matching by understanding the contextual meaning of search queries. Instead of simply matching exact words, semantic search systems capture the intent and contextual definition of the query and return relevant results even when they dont contain the same keywords.In this tutorial, well implement a semantic search system using Sentence Transformers, a powerful library built on top of Hugging Faces Transformers that provides pre-trained models specifically optimized for generating sentence embeddings. These embeddings are numerical representations of text that capture semantic meaning, allowing us to find similar content through vector similarity. Well create a practical application: a semantic search engine for a collection of scientific abstracts that can answer research queries with relevant papers, even when the terminology differs between the query and relevant documents.First, lets install the necessary libraries in our Colab notebook:!pip install sentence-transformers faiss-cpu numpy pandas matplotlib datasetsNow, lets import the libraries well need:import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sentence_transformers import SentenceTransformerimport faissfrom typing import List, Dict, Tupleimport timeimport reimport torchFor our demonstration, well use a collection of scientific paper abstracts. Lets create a small dataset of abstracts from various fields:abstracts = [ { "id": 1, "title": "Deep Learning for Natural Language Processing", "abstract": "This paper explores recent advances in deep learning models for natural language processing tasks. We review transformer architectures including BERT, GPT, and T5, and analyze their performance on various benchmarks including question answering, sentiment analysis, and text classification." }, { "id": 2, "title": "Climate Change Impact on Marine Ecosystems", "abstract": "Rising ocean temperatures and acidification are severely impacting coral reefs and marine biodiversity. This study presents data collected over a 10-year period, demonstrating accelerated decline in reef ecosystems and proposing conservation strategies to mitigate further damage." }, { "id": 3, "title": "Advancements in mRNA Vaccine Technology", "abstract": "The development of mRNA vaccines represents a breakthrough in immunization technology. This review discusses the mechanism of action, stability improvements, and clinical efficacy of mRNA platforms, with special attention to their rapid deployment during the COVID-19 pandemic." }, { "id": 4, "title": "Quantum Computing Algorithms for Optimization Problems", "abstract": "Quantum computing offers potential speedups for solving complex optimization problems. This paper presents quantum algorithms for combinatorial optimization and compares their theoretical performance with classical methods on problems including traveling salesman and maximum cut." }, { "id": 5, "title": "Sustainable Urban Planning Frameworks", "abstract": "This research proposes frameworks for sustainable urban development that integrate renewable energy systems, efficient public transportation networks, and green infrastructure. Case studies from five cities demonstrate reductions in carbon emissions and improvements in quality of life metrics." }, { "id": 6, "title": "Neural Networks for Computer Vision", "abstract": "Convolutional neural networks have revolutionized computer vision tasks. This paper examines recent architectural innovations including residual connections, attention mechanisms, and vision transformers, evaluating their performance on image classification, object detection, and segmentation benchmarks." }, { "id": 7, "title": "Blockchain Applications in Supply Chain Management", "abstract": "Blockchain technology enables transparent and secure tracking of goods throughout supply chains. This study analyzes implementations across food, pharmaceutical, and retail industries, quantifying improvements in traceability, reduction in counterfeit products, and enhanced consumer trust." }, { "id": 8, "title": "Genetic Factors in Autoimmune Disorders", "abstract": "This research identifies key genetic markers associated with increased susceptibility to autoimmune conditions. Through genome-wide association studies of 15,000 patients, we identified novel variants that influence immune system regulation and may serve as targets for personalized therapeutic approaches." }, { "id": 9, "title": "Reinforcement Learning for Robotic Control Systems", "abstract": "Deep reinforcement learning enables robots to learn complex manipulation tasks through trial and error. This paper presents a framework that combines model-based planning with policy gradient methods to achieve sample-efficient learning of dexterous manipulation skills." }, { "id": 10, "title": "Microplastic Pollution in Freshwater Systems", "abstract": "This study quantifies microplastic contamination across 30 freshwater lakes and rivers, identifying primary sources and transport mechanisms. Results indicate correlation between population density and contamination levels, with implications for water treatment policies and plastic waste management." }]papers_df = pd.DataFrame(abstracts)print(f"Dataset loaded with {len(papers_df)} scientific papers")papers_df[["id", "title"]]Now well load a pre-trained Sentence Transformer model from Hugging Face. Well use the all-MiniLM-L6-v2 model, which provides a good balance between performance and speed:model_name = 'all-MiniLM-L6-v2'model = SentenceTransformer(model_name)print(f"Loaded model: {model_name}")Next, well convert our text abstracts into dense vector embeddings:documents = papers_df['abstract'].tolist()document_embeddings = model.encode(documents, show_progress_bar=True)print(f"Generated {len(document_embeddings)} embeddings with dimension {document_embeddings.shape[1]}")FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. Well use it to index our document embeddings:dimension = document_embeddings.shape[1] index = faiss.IndexFlatL2(dimension)index.add(np.array(document_embeddings).astype('float32'))print(f"Created FAISS index with {index.ntotal} vectors")Now lets implement a function that takes a query, converts it to an embedding, and retrieves the most similar documents:def semantic_search(query: str, top_k: int = 3) -> List[Dict]: """ Search for documents similar to query Args: query: Text to search for top_k: Number of results to return Returns: List of dictionaries containing document info and similarity score """ query_embedding = model.encode([query]) distances, indices = index.search(np.array(query_embedding).astype('float32'), top_k) results = [] for i, idx in enumerate(indices[0]): results.append({ 'id': papers_df.iloc[idx]['id'], 'title': papers_df.iloc[idx]['title'], 'abstract': papers_df.iloc[idx]['abstract'], 'similarity_score': 1 - distances[0][i] / 2 }) return resultsLets test our semantic search with various queries that demonstrate its ability to understand meaning beyond exact keywords:test_queries = [ "How do transformers work in natural language processing?", "What are the effects of global warming on ocean life?", "Tell me about COVID vaccine development", "Latest algorithms in quantum computing", "How can cities reduce their carbon footprint?"]for query in test_queries: print("\n" + "="*80) print(f"Query: {query}") print("="*80) results = semantic_search(query, top_k=3) for i, result in enumerate(results): print(f"\nResult #{i+1} (Score: {result['similarity_score']:.4f}):") print(f"Title: {result['title']}") print(f"Abstract snippet: {result['abstract'][:150]}...")Lets visualize the document embeddings to see how they cluster by topic:from sklearn.decomposition import PCApca = PCA(n_components=2)reduced_embeddings = pca.fit_transform(document_embeddings)plt.figure(figsize=(12, 8))plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], s=100, alpha=0.7)for i, (x, y) in enumerate(reduced_embeddings): plt.annotate(papers_df.iloc[i]['title'][:20] + "...", (x, y), fontsize=9, alpha=0.8)plt.title('Document Embeddings Visualization (PCA)')plt.xlabel('Component 1')plt.ylabel('Component 2')plt.grid(True, linestyle='--', alpha=0.7)plt.tight_layout()plt.show()Lets create a more interactive search interface:from IPython.display import display, HTML, clear_outputimport ipywidgets as widgetsdef run_search(query_text): clear_output(wait=True) display(HTML(f"<h3>Query: {query_text}</h3>")) start_time = time.time() results = semantic_search(query_text, top_k=5) search_time = time.time() - start_time display(HTML(f"<p>Found {len(results)} results in {search_time:.4f} seconds</p>")) for i, result in enumerate(results): html = f""" <div style="margin-bottom: 20px; padding: 15px; border: 1px solid #ddd; border-radius: 5px;"> <h4>{i+1}. {result['title']} <span style="color: #007bff;">(Score: {result['similarity_score']:.4f})</span></h4> <p>{result['abstract']}</p> </div> """ display(HTML(html))search_box = widgets.Text( value='', placeholder='Type your search query here...', description='Search:', layout=widgets.Layout(width='70%'))search_button = widgets.Button( description='Search', button_style='primary', tooltip='Click to search')def on_button_clicked(b): run_search(search_box.value)search_button.on_click(on_button_clicked)display(widgets.HBox([search_box, search_button]))In this tutorial, weve built a complete semantic search system using Sentence Transformers. This system can understand the meaning behind user queries and return relevant documents even when there isnt exact keyword matching. Weve seen how embedding-based search provides more intelligent results than traditional methods.Here is the Colab Notebook. Also,dont forget to follow us onTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our85k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Just Open Sourced Canary 1B and 180M Flash Multilingual Speech Recognition and Translation ModelsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Claimify: A Novel LLM-based Claim-Extraction Method that Outperforms Prior Solutions to Produce More Accurate, Comprehensive, and Substantiated Claims from LLM OutputsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build a Document Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and LangchainAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Open-Sources cuOpt: An AI-Powered Decision Optimization EngineUnlocking Real-Time Optimization at an Unprecedented Scale0 Comments ·0 Shares ·55 Views
-
Microsoft AI Introduces Claimify: A Novel LLM-based Claim-Extraction Method that Outperforms Prior Solutions to Produce More Accurate, Comprehensive, and Substantiated Claims from LLM Outputswww.marktechpost.comThe widespread adoption of Large Language Models (LLMs) has significantly changed the landscape of content creation and consumption. However, it has also introduced critical challenges regarding accuracy and factual reliability. The content generated by LLMs often includes claims that lack proper verification, potentially leading to misinformation. Therefore, accurately extracting claims from these outputs for effective fact-checking has become essential, albeit challenging due to inherent ambiguities and context dependencies.Microsoft AI Research has recently developed Claimify, an advanced claim-extraction method based on LLMs, specifically designed to enhance accuracy, comprehensiveness, and context-awareness in extracting claims from LLM outputs. Claimify addresses the limitations of existing methods by explicitly dealing with ambiguity. Unlike other approaches, it identifies sentences with multiple possible interpretations and only proceeds with claim extraction when the intended meaning is clearly determined within the given context. This careful approach ensures higher accuracy and reliability, particularly benefiting subsequent fact-checking efforts.From a technical standpoint, Claimify employs a structured pipeline comprising three key stages: Selection, Disambiguation, and Decomposition. During the Selection stage, Claimify leverages LLMs to identify sentences that contain verifiable information, filtering out those without factual content. In the Disambiguation stage, it uniquely focuses on detecting and resolving ambiguities, such as unclear references or multiple plausible interpretations. Claims are extracted only if ambiguities can be confidently resolved. The final stage, Decomposition, involves converting each clarified sentence into precise, context-independent claims. This structured process enhances both the accuracy and completeness of the resulting claims.In evaluations using the BingCheck datasetwhich covers a broad range of topics and complex LLM-generated responsesClaimify demonstrated notable improvements over previous methods. It achieved a high entailment rate of 99%, indicating a strong consistency between the extracted claims and the original content. Regarding coverage, Claimify captured 87.6% of verifiable content while maintaining a high precision rate of 96.7%, outperforming comparable approaches. Its systematic approach to decontextualization also ensured that essential contextual details were retained, resulting in better-grounded claims compared to prior methods.Overall, Claimify represents a meaningful advancement in the automated extraction of reliable claims from LLM-generated content. By methodically addressing ambiguity and contextuality through a structured and careful evaluation framework, Claimify establishes a new standard for accuracy and reliability. As reliance on LLM-produced content continues to grow, tools like Claimify will play an increasingly crucial role in ensuring the trustworthiness and factual integrity of this content.Check outthe Paper and Technical details.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our80k+ ML SubReddit.The post Microsoft AI Introduces Claimify: A Novel LLM-based Claim-Extraction Method that Outperforms Prior Solutions to Produce More Accurate, Comprehensive, and Substantiated Claims from LLM Outputs appeared first on MarkTechPost.0 Comments ·0 Shares ·30 Views
-
NVIDIA AI Just Open Sourced Canary 1B and 180M Flash Multilingual Speech Recognition and Translation Modelswww.marktechpost.comIn the realm of artificial intelligence, multilingual speech recognition and translation have become essential tools for facilitating global communication. However, developing models that can accurately transcribe and translate multiple languages in real-time presents significant challenges. These challenges include managing diverse linguistic nuances, maintaining high accuracy, ensuring low latency, and deploying models efficiently across various devices.To address these challenges, NVIDIA AI has open-sourced two models: Canary 1B Flash and Canary 180M Flash. These models are designed for multilingual speech recognition and translation, supporting languages such as English, German, French, and Spanish. Released under the permissive CC-BY-4.0 license, these models are available for commercial use, encouraging innovation within the AI community.Technically, both models utilize an encoder-decoder architecture. The encoder is based on FastConformer, which efficiently processes audio features, while the Transformer Decoder handles text generation. Task-specific tokens, including <target language>, <task>, <toggle timestamps>, and <toggle PnC> (punctuation and capitalization), guide the models output. The Canary 1B Flash model comprises 32 encoder layers and 4 decoder layers, totaling 883 million parameters, whereas the Canary 180M Flash model consists of 17 encoder layers and 4 decoder layers, amounting to 182 million parameters. This design ensures scalability and adaptability to various languages and tasks. Performance metrics indicate that the Canary 1B Flash model achieves an inference speed exceeding 1000 RTFx on open ASR leaderboard datasets, enabling real-time processing. In English automatic speech recognition (ASR) tasks, it attains a word error rate (WER) of 1.48% on the Librispeech Clean dataset and 2.87% on the Librispeech Other dataset. For multilingual ASR, the model achieves WERs of 4.36% for German, 2.69% for Spanish, and 4.47% for French on the MLS test set. In automatic speech translation (AST) tasks, the model demonstrates robust performance with BLEU scores of 32.27 for English to German, 22.6 for English to Spanish, and 41.22 for English to French on the FLEURS test set. Data as of March 20 2025The smaller Canary 180M Flash model also delivers impressive results, with an inference speed surpassing 1200 RTFx. It achieves a WER of 1.87% on the Librispeech Clean dataset and 3.83% on the Librispeech Other dataset for English ASR. For multilingual ASR, the model records WERs of 4.81% for German, 3.17% for Spanish, and 4.75% for French on the MLS test set. In AST tasks, it achieves BLEU scores of 28.18 for English to German, 20.47 for English to Spanish, and 36.66 for English to French on the FLEURS test set. Both models support word-level and segment-level timestamping, enhancing their utility in applications requiring precise alignment between audio and text. Their compact sizes make them suitable for on-device deployment, enabling offline processing and reducing dependency on cloud services. Moreover, their robustness leads to fewer hallucinations during translation tasks, ensuring more reliable outputs. The open-source release under the CC-BY-4.0 license encourages commercial utilization and further development by the community.In conclusion, NVIDIAs open-sourcing of the Canary 1B and 180M Flash models represents a significant advancement in multilingual speech recognition and translation. Their high accuracy, real-time processing capabilities, and adaptability for on-device deployment address many existing challenges in the field. By making these models publicly available, NVIDIA not only demonstrates its commitment to advancing AI research but also empowers developers and organizations to build more inclusive and efficient communication tools.Check outthe Twitterand dont forget to join our80k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Microsoft AI Introduces Claimify: A Novel LLM-based Claim-Extraction Method that Outperforms Prior Solutions to Produce More Accurate, Comprehensive, and Substantiated Claims from LLM OutputsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Implementation to Build a Document Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and LangchainAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Open-Sources cuOpt: An AI-Powered Decision Optimization EngineUnlocking Real-Time Optimization at an Unprecedented ScaleAsif Razzaqhttps://www.marktechpost.com/author/6flvq/IBM and Hugging Face Researchers Release SmolDocling: A 256M Open-Source Vision Language Model for Complete Document OCR0 Comments ·0 Shares ·47 Views
-
A Coding Implementation to Build a Document Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchainwww.marktechpost.comIn todays information-rich world, finding relevant documents quickly is crucial. Traditional keyword-based search systems often fall short when dealing with semantic meaning. This tutorial demonstrates how to build a powerful document search engine using:Hugging Faces embedding models to convert text into rich vector representationsChroma DB as our vector database for efficient similarity searchSentence transformers for high-quality text embeddingsThis implementation enables semantic search capabilities finding documents based on meaning rather than just keyword matching. By the end of this tutorial, youll have a working document search engine that can:Process and embed text documentsStore these embeddings efficientlyRetrieve the most semantically similar documents to any queryHandle a variety of document types and search needsPlease follow the detailed steps mentioned below in sequence to implement DocSearchAgent.First, we need to install the necessary libraries.!pip install chromadb sentence-transformers langchain datasetsLets start by importing the libraries well use:import osimport numpy as npimport pandas as pdfrom datasets import load_datasetimport chromadbfrom chromadb.utils import embedding_functionsfrom sentence_transformers import SentenceTransformerfrom langchain.text_splitter import RecursiveCharacterTextSplitterimport timeFor this tutorial, well use a subset of Wikipedia articles from the Hugging Face datasets library. This gives us a diverse set of documents to work with.dataset = load_dataset("wikipedia", "20220301.en", split="train[:1000]")print(f"Loaded {len(dataset)} Wikipedia articles")documents = []for i, article in enumerate(dataset): doc = { "id": f"doc_{i}", "title": article["title"], "text": article["text"], "url": article["url"] } documents.append(doc)df = pd.DataFrame(documents)df.head(3)Now, lets split our documents into smaller chunks for more granular searching:text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len,)chunks = []chunk_ids = []chunk_sources = []for i, doc in enumerate(documents): doc_chunks = text_splitter.split_text(doc["text"]) chunks.extend(doc_chunks) chunk_ids.extend([f"chunk_{i}_{j}" for j in range(len(doc_chunks))]) chunk_sources.extend([doc["title"]] * len(doc_chunks))print(f"Created {len(chunks)} chunks from {len(documents)} documents")Well use a pre-trained sentence transformer model from Hugging Face to create our embeddings:model_name = "sentence-transformers/all-MiniLM-L6-v2"embedding_model = SentenceTransformer(model_name)sample_text = "This is a sample text to test our embedding model."sample_embedding = embedding_model.encode(sample_text)print(f"Embedding dimension: {len(sample_embedding)}")Now, lets set up Chroma DB, a lightweight vector database perfect for our search engine:chroma_client = chromadb.Client()embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)collection = chroma_client.create_collection( name="document_search", embedding_function=embedding_function)batch_size = 100for i in range(0, len(chunks), batch_size): end_idx = min(i + batch_size, len(chunks)) batch_ids = chunk_ids[i:end_idx] batch_chunks = chunks[i:end_idx] batch_sources = chunk_sources[i:end_idx] collection.add( ids=batch_ids, documents=batch_chunks, metadatas=[{"source": source} for source in batch_sources] ) print(f"Added batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1} to the collection")print(f"Total documents in collection: {collection.count()}")Now comes the exciting part searching through our documents:def search_documents(query, n_results=5): """ Search for documents similar to the query. Args: query (str): The search query n_results (int): Number of results to return Returns: dict: Search results """ start_time = time.time() results = collection.query( query_texts=[query], n_results=n_results ) end_time = time.time() search_time = end_time - start_time print(f"Search completed in {search_time:.4f} seconds") return resultsqueries = [ "What are the effects of climate change?", "History of artificial intelligence", "Space exploration missions"]for query in queries: print(f"\nQuery: {query}") results = search_documents(query) for i, (doc, metadata) in enumerate(zip(results['documents'][0], results['metadatas'][0])): print(f"\nResult {i+1} from {metadata['source']}:") print(f"{doc[:200]}...") Lets create a simple function to provide a better user experience:def interactive_search(): """ Interactive search interface for the document search engine. """ while True: query = input("\nEnter your search query (or 'quit' to exit): ") if query.lower() == 'quit': print("Exiting search interface...") break n_results = int(input("How many results would you like? ")) results = search_documents(query, n_results) print(f"\nFound {len(results['documents'][0])} results for '{query}':") for i, (doc, metadata, distance) in enumerate(zip( results['documents'][0], results['metadatas'][0], results['distances'][0] )): relevance = 1 - distance print(f"\n--- Result {i+1} ---") print(f"Source: {metadata['source']}") print(f"Relevance: {relevance:.2f}") print(f"Excerpt: {doc[:300]}...") print("-" * 50)interactive_search()Lets add the ability to filter our search results by metadata:def filtered_search(query, filter_source=None, n_results=5): """ Search with optional filtering by source. Args: query (str): The search query filter_source (str): Optional source to filter by n_results (int): Number of results to return Returns: dict: Search results """ where_clause = {"source": filter_source} if filter_source else None results = collection.query( query_texts=[query], n_results=n_results, where=where_clause ) return resultsunique_sources = list(set(chunk_sources))print(f"Available sources for filtering: {len(unique_sources)}")print(unique_sources[:5]) if len(unique_sources) > 0: filter_source = unique_sources[0] query = "main concepts and principles" print(f"\nFiltered search for '{query}' in source '{filter_source}':") results = filtered_search(query, filter_source=filter_source) for i, doc in enumerate(results['documents'][0]): print(f"\nResult {i+1}:") print(f"{doc[:200]}...") In conclusion, we demonstrate how to build a semantic document search engine using Hugging Face embedding models and ChromaDB. The system retrieves documents based on meaning rather than just keywords by transforming text into vector representations. The implementation processes Wikipedia articles chunks them for granularity, embeds them using sentence transformers, and stores them in a vector database for efficient retrieval. The final product features interactive searching, metadata filtering, and relevance ranking.Here is the Colab Notebook. Also,dont forget to follow us onTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our80k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Open-Sources cuOpt: An AI-Powered Decision Optimization EngineUnlocking Real-Time Optimization at an Unprecedented ScaleAsif Razzaqhttps://www.marktechpost.com/author/6flvq/IBM and Hugging Face Researchers Release SmolDocling: A 256M Open-Source Vision Language Model for Complete Document OCRAsif Razzaqhttps://www.marktechpost.com/author/6flvq/ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at ScaleAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide to Build an Optical Character Recognition (OCR) App in Google Colab Using OpenCV and Tesseract-OCR0 Comments ·0 Shares ·54 Views
-
Cloning, Forking, and Merging Repositories on GitHub: A Beginners Guidewww.marktechpost.comThis comprehensive guide walks you through the essential GitHub operations of cloning, forking, and merging repositories. Whether youre new to version control or looking to solidify your understanding of GitHub workflows, this tutorial will equip you with the fundamental skills needed to collaborate effectively on coding projects.Understanding GitHub RepositoriesGitHub repositories serve as central storage locations for projects, containing all files, folders, and the complete history of changes. Before diving into specific operations, its important to understand the difference between remote repositories (hosted on GitHub) and local repositories (on your computer). Working with GitHub typically involves creating a local copy of a repository through cloning or forking, making changes, and then integrating those changes through merging.Remote vs. Local RepositoriesRepositories on GitHub are remote repositories. To work with them on your computer, you need to create local copies, which you can do by cloning or forking1. The main differences are:Remote repositories: Hosted on GitHubs servers and accessible to collaboratorsLocal repositories: Exist on your computer, allowing you to work offline and test changes before sharingCloning RepositoriesCloning creates a local copy of a repository on your computer. This is the most direct way to start working with an existing project.What is Cloning?When you clone a repository, you download a complete copy of the repository, including all files and commit history. This creates a connection to the original repository, allowing you to push changes back if you have write permissions.How to Clone Using HTTPSFind the Repository to CloneNavigate to the GitHub repository you want to cloneClick the green Code button above the files listSelect the HTTPS option to get the repository URLClone the Repository Using GitOpen your terminal or command promptNavigate to the directory where you want to store the repositoryType the following command:Press Enter to begin cloningAuthenticate if NecessaryFor private repositories, youll need to authenticateGitHub no longer accepts password authentication for HTTPSUse a Personal Access Token (PAT) instead, which you can generate in GitHub Settings Developer settings Personal access tokensStart Working with the Cloned RepositoryNavigate to the cloned repository directory using:Now you can view, edit, and work with the filesCloning Using GitHub DesktopIf you prefer a graphical interface:In GitHub Desktop, click File Clone RepositorySelect the repository source:Choose from your GitHub repositoriesEnter a URL for any repositoryBrowse for a local repositoryChoose the local path where you want to store the repositoryClick Clone to finalize the processForking RepositoriesForking is creating a personal copy of someone elses repository in your GitHub account, which allows you to freely experiment with changes without affecting the original project.You should fork a repository when:You dont have write access to the original repositoryYou want to contribute to an open-source projectYou want to use someones project as a starting point for your own workThe Complete Forking WorkflowFork the RepositoryNavigate to the repository you want to forkClick the Fork button in the top-right cornerWait a few seconds for GitHub to create the fork in your accountClone Your Forked RepositoryAfter forking, clone the repository to your local machine using the methods described earlierThis creates a local copy of your fork, not the original repositoryMake Changes and Push to Your ForkMake the desired changes to the local copyCommit your changesPush the changes to your forked repositoryCreate a Pull Request (Optional)If you want to contribute back to the original project, create a pull requestThis proposes your changes to the original repositorys ownerUnderstanding the RelationshipWhen you fork a repository:The original repository is called the upstream repositoryYour copy is the forked repositoryThese repositories are separate, allowing independent developmentYou can sync changes from the upstream repository when neededWorking with Your RepositoriesAfter cloning or forking a repository, youll need to make changes, commit them, and push them back to GitHub.Basic Git Commands for Daily WorkCheck Repository StatusCreate a New Branch for Your ChangesAdd Your Changed FilesOr add all changes:Commit Your ChangesPush Your Changes to GitHubMerging Repositories and BranchesMerging is Gits way of integrating changes from one branch or repository into another.Understanding Git MergeGit merge combines multiple sequences of commits into one unified history. In typical scenarios, merging is used to combine two branches. When merging:Git finds a common base commit between the branchesIt creates a new merge commit that combines the changesThis merge commit has two parent commits (unlike regular commits)How to Merge BranchesCheckout the Target BranchEnsure Your Branch is Up-to-DateMerge the Source BranchHandle Any Merge ConflictsIf Git encounters conflicting changes, it will mark them in the affected filesEdit these files to resolve the conflictsAfter resolving, add the files and commit the mergeCreating and Managing Pull RequestsPull requests are the primary way to contribute changes from a fork back to the original repository.Creating a Pull RequestPush Your Changes to Your ForkNavigate to the Original Repository on GitHubClick Pull Requests and then New Pull RequestSelect the Base Repository/Branch and Your Fork/BranchReview Your Changes and Create the Pull RequestAdd a title and descriptionExplain what changes youve made and whyMerging a Pull RequestIf you own the repository or have write access:Review the Pull RequestCheck the code changesRun tests if applicableConsider feedback from other collaboratorsMerge the Pull RequestOn GitHub, navigate to the pull requestClick Merge pull request if everything looks goodFor repositories with merge queues, you can click Merge when readyBest Practices and TipsWorkflow RecommendationsAlways Create Branches for New FeaturesKeep the main branch clean and stableCreate feature branches for new developmentPull Before You PushAlways pull the latest changes before pushing your ownThis reduces merge conflictsWrite Clear Commit MessagesUse descriptive messages that explain why changes were madeFollow the convention of a short title and longer description if neededCommon Pitfalls to AvoidWorking Directly on the Main BranchThis can cause conflicts and confusionAlways create feature branches for new workNot Updating Your Fork RegularlyYour fork can become outdated if the original repository changesLearn how to sync your fork with the upstream repositoryPushing Large Binary Files to GitGit is not optimized for binary filesConsider Git LFS (Large File Storage) for large binary filesConclusionIn this guide, we covered cloning, forking, and merging repositories on GitHub, essential for collaboration and version control. Cloning creates a local copy, forking allows independent development, and merging integrates changes efficiently. Pull requests facilitate structured contributions. Best practices include using feature branches, keeping repositories updated, and writing clear commit messages. By following these workflows, developers can collaborate effectively, reduce conflicts, and manage code efficiently, ensuring smooth project development and contribution to open-source or team-based projects. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces a Latent Token Approach: Enhancing LLM Reasoning Efficiency with VQ-VAE CompressionNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces R1-Onevision: A Cross-Modal Formalization Model for Advancing Multimodal Reasoning and Structured Visual InterpretationNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper from Columbia University Introduces Manify: A Python Library for Non-Euclidean Representation LearningNikhilhttps://www.marktechpost.com/author/nikhil0980/This AI Paper Introduces FoundationStereo: A Zero-Shot Stereo Matching Model for Robust Depth Estimation0 Comments ·0 Shares ·67 Views
-
IBM and Hugging Face Researchers Release SmolDocling: A 256M Open-Source Vision Language Model for Complete Document OCRwww.marktechpost.comConverting complex documents into structured data has long posed significant challenges in the field of computer science. Traditional approaches, involving ensemble systems or very large foundational models, often encounter substantial hurdles such as difficulty in fine-tuning, generalization issues, hallucinations, and high computational costs. Ensemble systems, though efficient for specific tasks, frequently fail to generalize due to their dependency on handcrafted pipelines for each sub-task. On the other hand, multimodal foundational models, although powerful, often suffer from high computational costs and reliability issues like hallucinations.Researchers from IBM and Hugging Face have recently addressed these challenges by releasing SmolDocling, a 256M open-source vision-language model (VLM) designed explicitly for end-to-end multi-modal document conversion tasks. Unlike larger foundational models, SmolDocling provides a streamlined solution that processes entire pages through a single model, significantly reducing complexity and computational demands. Its ultra-compact nature, at just 256 million parameters, makes it notably lightweight and resource-efficient. The researchers also developed a universal markup format called DocTags, which precisely captures page elements, their structures, and spatial contexts in a highly compact and clear form.SmolDocling leverages Hugging Faces compact SmolVLM-256M as its architecture base, which features significant reductions in computational complexity through optimized tokenization and aggressive visual feature compression methods. Its main strength lies in the innovative DocTags format, providing structured markup that distinctly separates document layout, textual content, and visual information such as equations, tables, code snippets, and charts. SmolDocling utilizes curriculum learning for efficient training, which initially involves freezing its vision encoder and gradually fine-tuning it using enriched datasets that enhance visual-semantic alignment across different document elements. Additionally, the models efficiency allows it to process entire document pages at lightning-fast speeds, averaging just 0.35 seconds per page on a consumer GPU while consuming under 500MB of VRAM.The performance data clearly positions SmolDocling at the forefront of current technologies. In comprehensive benchmark tests involving various document conversion tasks, SmolDocling outperformed substantially larger competing models. For example, in full-page document OCR tasks, SmolDocling achieved significantly better accuracy metrics, such as a notably lower edit distance (0.48) and higher F1-score (0.80), compared to models like Qwen2.5 VL (7B parameters) and Nougat (350M parameters). It also excelled in equation transcription, achieving a 0.95 F1-score, matching state-of-the-art models like GOT. Furthermore, SmolDocling set a new benchmark in code snippet recognition, demonstrating high precision and recall scores of 0.94 and 0.91 respectively.What sets SmolDocling apart from other document OCR solutions is its capability to handle diverse elements within documents, including intricate items such as code, charts, equations, and varied layouts. Its capabilities extend beyond typical scientific papers to reliably handle patents, forms, and business documentation. By offering comprehensive structured metadata through DocTags, SmolDocling eliminates ambiguity inherent in formats like HTML or Markdown, enhancing the downstream usability of document conversions. Its compact size enables large-scale batch processing at remarkably low resource demands, facilitating cost-effective deployments at scale.In conclusion, SmolDocling represents a significant breakthrough in document conversion technology, demonstrating that compact models can not only compete but substantially outperform larger foundational models in crucial tasks. The researchers have successfully demonstrated how targeted training, innovative data augmentation, and novel markup formats like DocTags can overcome traditional limitations associated with size and complexity. SmolDoclings release not only sets a new standard in efficiency and versatility for OCR technologies but also provides an invaluable resource for the community through openly available datasets and a highly efficient, compact model architecture. This marks a substantial advancement in document understanding and opens up exciting new possibilities for enterprise-level applications and broader accessibility.Check outTwitterand dont forget to join our80k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at ScaleAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide to Build an Optical Character Recognition (OCR) App in Google Colab Using OpenCV and Tesseract-OCRAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Cohere Released Command A: A 111B Parameter AI Model with 256K Context Length, 23-Language Support, and 50% Cost Reduction for EnterprisesAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Code Implementation to Build an AI-Powered PDF Interaction System in Google Colab Using Gemini Flash 1.5, PyMuPDF, and Google Generative AI API0 Comments ·0 Shares ·84 Views
-
NVIDIA Open-Sources cuOpt: An AI-Powered Decision Optimization EngineUnlocking Real-Time Optimization at an Unprecedented Scalewww.marktechpost.comEvery day, organizations face complex logistical challengesfrom optimizing delivery routes and managing supply chains to streamlining production schedules. These tasks typically involve massive datasets and numerous variables, making manual or traditional computational methods inefficient or impractical. The pressure for businesses to improve efficiency, reduce operational costs, and enhance customer satisfaction underscores the need for more powerful optimization tools. However, many existing optimization solutions either lack real-time capabilities or come at prohibitive costs, making them inaccessible to smaller companies and individual developers.NVIDIA announces the open-source release of cuOpt, an AI-powered decision optimization enginemaking the powerful software free for developers to unlock real-time optimization at an unprecedented scale. Initially available only as proprietary software, cuOpt combines GPU acceleration with advanced algorithms to rapidly solve complex optimization problems. Now, as an open-source tool, cuOpt allows broader access, enabling businesses and developers from diverse industriesranging from logistics to healthcareto integrate state-of-the-art optimization solutions directly into their workflows without incurring high licensing costs.At its core, NVIDIA cuOpt leverages parallel processing capabilities of GPUs to accelerate computations, significantly surpassing traditional CPU-based optimization methods. The software uses algorithms designed specifically to exploit GPU architecture, solving complex combinatorial optimization problems such as vehicle routing, job scheduling, and resource allocation far faster and more efficiently. By utilizing advanced heuristics and metaheuristicsincluding evolutionary algorithms, tabu search, and simulated annealingcuOpt achieves substantial reductions in compute times, empowering real-time decision-making capabilities that were previously unattainable. Additionally, cuOpt integrates seamlessly with popular AI and data science frameworks, such as Python and RAPIDS, facilitating ease of use and adoption.Real-world performance insights underscore the transformative impact of cuOpt. According to NVIDIA, enterprises using cuOpt have reported dramatic improvements in their operational efficiencies. For instance, early adopters have experienced up to 20 times faster optimization compared to conventional CPU-driven solutions. This speed enables organizations to dynamically adjust routes and schedules based on real-time data, significantly reducing operational costs and improving service delivery. Moreover, cuOpts scalability ensures consistent performance improvements even as problem sizes grow exponentially, allowing organizations to confidently tackle increasingly complex optimization scenarios without sacrificing speed or accuracy.In conclusion, NVIDIAs decision to open-source cuOpt represents a major milestone in democratizing advanced optimization technologies. By making this powerful tool freely available, NVIDIA has opened new doors for innovation, enabling businesses of all sizes and individual developers to leverage cutting-edge optimization capabilities. The wide availability of cuOpt encourages collaboration and continuous improvement within the community, setting a new standard in real-time decision optimization and operational excellence. Ultimately, organizations adopting cuOpt stand to significantly enhance their efficiency, responsiveness, and overall competitive advantage in an increasingly data-driven world.Check outthe Technical details and Project Page.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our80k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/IBM and Hugging Face Researchers Release SmolDocling: A 256M Open-Source Vision Language Model for Complete Document OCRAsif Razzaqhttps://www.marktechpost.com/author/6flvq/ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at ScaleAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide to Build an Optical Character Recognition (OCR) App in Google Colab Using OpenCV and Tesseract-OCRAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Cohere Released Command A: A 111B Parameter AI Model with 256K Context Length, 23-Language Support, and 50% Cost Reduction for Enterprises0 Comments ·0 Shares ·99 Views
-
MemQ: Enhancing Knowledge Graph Question Answering with Memory-Augmented Query Reconstructionwww.marktechpost.comLLMs have shown strong performance in Knowledge Graph Question Answering (KGQA) by leveraging planning and interactive strategies to query knowledge graphs. Many existing approaches rely on SPARQL-based tools to retrieve information, allowing models to generate accurate answers. Some methods enhance LLMs reasoning abilities by constructing tool-based reasoning paths, while others employ decision-making frameworks that use environmental feedback to interact with knowledge graphs. Although these strategies have improved KGQA accuracy, they often blur the distinction between tool use and actual reasoning. This confusion reduces interpretability, diminishes readability, and increases the risk of hallucinated tool invocations, where models generate incorrect or irrelevant responses due to over-reliance on parametric knowledge.To address these limitations, researchers have explored memory-augmented techniques that provide external knowledge storage to support complex reasoning. Prior work has integrated memory modules for long-term context retention, enabling more reliable decision-making. Early KGQA methods used key-value memory and graph neural networks to infer answers, while recent LLM-based approaches leverage large-scale models for enhanced reasoning. Some strategies employ supervised fine-tuning to improve understanding, while others use discriminative techniques to mitigate hallucinations. However, existing KGQA methods still struggle to separate reasoning from tool invocation, leading to a lack of focus on logical inference.Researchers from the Harbin Institute of Technology propose Memory-augmented Query Reconstruction (MemQ), a framework that separates reasoning from tool invocation in LLM-based KGQA. MemQ establishes a structured query memory using LLM-generated descriptions of decomposed query statements, enabling independent reasoning. This approach enhances readability by generating explicit reasoning steps and retrieving relevant memory based on semantic similarity. MemQ improves interpretability and reduces hallucinated tool use by eliminating unnecessary tool reliance. Experimental results show that MemQ achieves state-of-the-art performance on WebQSP and CWQ benchmarks, demonstrating its effectiveness in enhancing LLM-based KGQA reasoning.MemQ is designed to separate reasoning from tool invocation in LLM-based KGQA through three key tasks: memory construction, knowledge reasoning, and query reconstruction. Memory construction involves storing query statements with corresponding natural language descriptions for efficient retrieval. The knowledge reasoning process generates structured multi-step reasoning plans, ensuring logical progression in answering queries. Query reconstruction then retrieves relevant query statements based on semantic similarity and assembles them into a final query. MemQ enhances reasoning by fine-tuning LLMs with explanation-statement pairs and uses an adaptive memory recall strategy, outperforming prior methods on WebQSP and CWQ benchmarks with state-of-the-art results.The experiments assess MemQs performance in knowledge graph question-answering using WebQSP and CWQ datasets. Hits@1 and F1 scores serve as evaluation metrics, with comparisons against tool-based baselines like RoG and ToG. MemQ, built on Llama2-7b, outperforms previous methods, showing improved reasoning via a memory-augmented approach. Analytical experiments highlight superior structural and edge accuracy. Ablation studies confirm MemQs effectiveness in tool utilization and reasoning stability. Additional analyses explore reasoning errors, hallucinations, data efficiency, and model universality, demonstrating its adaptability across architectures. MemQ significantly enhances structured reasoning while reducing errors in multi-step queries.In conclusion, the study introduces MemQ, a memory-augmented framework that separates LLM reasoning from tool invocation to reduce hallucinations in KGQA. MemQ improves query reconstruction and enhances reasoning clarity by incorporating a query memory module. The approach enables natural language reasoning while mitigating errors in tool usage. Experiments on WebQSP and CWQ benchmarks demonstrate that MemQ outperforms existing methods, achieving state-of-the-art results. By addressing the confusion between tool utilization and reasoning, MemQ enhances the readability and accuracy of LLM-generated responses, offering a more effective approach to KGQA.Check outthe Paper.All credit for this research goes to the researchers of this project. Also,feel free to follow us onTwitterand dont forget to join our80k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/VisualWebInstruct: A Large-Scale Multimodal Reasoning Dataset for Enhancing Vision-Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Groundlight Research Team Released an Open-Source AI Framework that Makes It Easy to Build Visual Reasoning Agents (with GRPO)Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Dynamic Tanh DyT: A Simplified Alternative to Normalization in TransformersSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Optimizing Test-Time Compute for LLMs: A Meta-Reinforcement Learning Approach with Cumulative Regret Minimization0 Comments ·0 Shares ·80 Views
-
Building a Retrieval-Augmented Generation (RAG) System with FAISS and Open-Source LLMswww.marktechpost.comRetrieval-augmented generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models (LLMs). By combining LLMs creative generation abilities with retrieval systems factual accuracy, RAG offers a solution to one of LLMs most persistent challenges: hallucination.In this tutorial, well build a complete RAG system using:FAISS (Facebook AI Similarity Search), as our vector databaseSentence Transformers for creating high-quality embeddingsAn open-source LLM from Hugging Face (well use a lightweight model compatible with CPU)A custom knowledge base that well createBy the end of this tutorial, youll have a functioning RAG system that can answer questions based on your documents with improved accuracy and relevance. This approach is valuable for building domain-specific assistants, customer support systems, or any application where grounding LLM responses in specific documents is important.Let us get started.Step 1: Setting Up Our EnvironmentFirst, we need to install all the required libraries. For this tutorial, well use Google Colab.# Install required packages!pip install -q transformers==4.34.0!pip install -q sentence-transformers==2.2.2!pip install -q faiss-cpu==1.7.4!pip install -q accelerate==0.23.0!pip install -q einops==0.7.0!pip install -q langchain==0.0.312!pip install -q langchain_community!pip install -q pypdf==3.15.1Lets also check if we have access to a GPU, which will speed up our model inference:import torch# Check if GPU is availableprint(f"GPU available: {torch.cuda.is_available()}")if torch.cuda.is_available(): print(f"GPU name: {torch.cuda.get_device_name(0)}")else: print("Running on CPU. We'll use a CPU-compatible model.")Step 2: Creating Our Knowledge BaseFor this tutorial, well create a simple knowledge base about AI concepts. In a real-world scenario, one can use it to import PDF documents, web pages, or databases.import osimport tempfile# Create a temporary directory for our documentsdocs_dir = tempfile.mkdtemp()print(f"Created temporary directory at {docs_dir}")# Create sample documents about AI conceptsdocuments = { "vector_databases.txt": """ Vector databases are specialized database systems designed to store, manage, and search vector embeddings efficiently. They are crucial for machine learning applications, particularly those involving natural language processing and image recognition. Key features of vector databases include: 1. Fast similarity search using algorithms like HNSW, IVF, or exact search 2. Support for various distance metrics (cosine, euclidean, dot product) 3. Scalability for handling billions of vectors 4. Often support for metadata filtering alongside vector search Popular vector databases include FAISS (Facebook AI Similarity Search), Pinecone, Weaviate, Milvus, and Chroma. FAISS specifically was developed by Facebook AI Research and is an open-source library for efficient similarity search. """, "embeddings.txt": """ Embeddings are dense vector representations of data in a continuous vector space. They capture semantic meaning and relationships between entities by positioning similar items closer together in the vector space. Types of embeddings include: 1. Word embeddings (Word2Vec, GloVe) 2. Sentence embeddings (Universal Sentence Encoder, SBERT) 3. Document embeddings 4. Image embeddings 5. Audio embeddings Embeddings are created through various techniques, including neural networks trained on specific tasks. Modern embedding models like those from OpenAI, Cohere, or Sentence Transformers can capture nuanced semantic relationships. The dimensionality of embeddings typically ranges from 100 to 1536 dimensions, with higher dimensions often capturing more information but requiring more storage and computation. """, "rag_systems.txt": """ Retrieval-Augmented Generation (RAG) is an AI architecture that combines information retrieval with text generation. The RAG process typically works as follows: 1. User query is converted into an embedding vector 2. Similar documents or passages are retrieved from a knowledge base using vector similarity 3. Retrieved content is provided as context to the language model 4. The language model generates a response informed by both its parameters and the retrieved information Benefits of RAG include: 1. Reduced hallucination compared to pure generative approaches 2. Up-to-date information without model retraining 3. Attribution of information sources 4. Lower computation costs than increasing model size RAG systems can be enhanced through techniques like reranking, query reformulation, and hybrid search approaches. """}# Write documents to filesfor filename, content in documents.items(): with open(os.path.join(docs_dir, filename), 'w') as f: f.write(content) print(f"Created {len(documents)} documents in {docs_dir}")Step 3: Loading and Processing DocumentsNow, lets load these documents and process them for our RAG system:from langchain_community.document_loaders import TextLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitter# Initialize a list to store our documentsall_documents = []# Load each text filefor filename in documents.keys(): file_path = os.path.join(docs_dir, filename) loader = TextLoader(file_path) loaded_docs = loader.load() all_documents.extend(loaded_docs)print(f"Loaded {len(all_documents)} documents")# Split documents into chunkstext_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["nn", "n", ".", " ", ""])document_chunks = text_splitter.split_documents(all_documents)print(f"Created {len(document_chunks)} document chunks")# Let's look at a sample chunkprint("nSample chunk content:")print(document_chunks[0].page_content)print(f"Source: {document_chunks[0].metadata}")Step 4: Creating EmbeddingsNow, lets convert our document chunks into vector embeddings:from sentence_transformers import SentenceTransformerimport numpy as np# Initialize the embedding modelmodel_name = "sentence-transformers/all-MiniLM-L6-v2" # A good balance of speed and qualityembedding_model = SentenceTransformer(model_name)print(f"Loaded embedding model: {model_name}")print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")# Create embeddings for all document chunkstexts = [doc.page_content for doc in document_chunks]embeddings = embedding_model.encode(texts)print(f"Created {len(embeddings)} embeddings with shape {embeddings.shape}")Step 5: Building the FAISS IndexNow well build our FAISS index with these embeddings:import faiss# Get the dimensionality of our embeddingsdimension = embeddings.shape[1]# Create a FAISS index - we'll use a simple Flat L2 index for demonstration# For larger datasets, consider using indexes like IVF or HNSW for better performanceindex = faiss.IndexFlatL2(dimension) # L2 is Euclidean distance# Add our vectors to the indexindex.add(embeddings.astype(np.float32)) # FAISS requires float32print(f"Created FAISS index with {index.ntotal} vectors")# Create a mapping from index position to document chunk for retrievalindex_to_doc_chunk = {i: doc for i, doc in enumerate(document_chunks)}Step 6: Loading a Language ModelNow lets load an open-source language model from Hugging Face. Well use a smaller model that works well on CPU:from transformers import AutoTokenizer, AutoModelForCausalLM# We'll use a smaller model that works on CPUmodel_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"# Load the tokenizer and modeltokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float32, # Use float32 for CPU compatibility device_map="auto" # Will use CPU if GPU is not available)print(f"Successfully loaded {model_id}")Step 7: Creating Our RAG PipelineLets create a function that combines retrieval and generation:def rag_response(query, index, embedding_model, llm_model, llm_tokenizer, index_to_doc_map, top_k=3): """ Generate a response using the RAG pattern. Args: query: The user's question index: FAISS index embedding_model: Model to create embeddings llm_model: Language model for generation llm_tokenizer: Tokenizer for the language model index_to_doc_map: Mapping from index positions to document chunks top_k: Number of documents to retrieve Returns: response: The generated response sources: The source documents used """ # Step 1: Convert query to embedding query_embedding = embedding_model.encode([query]) query_embedding = query_embedding.astype(np.float32) # Convert to float32 for FAISS # Step 2: Search for similar documents distances, indices = index.search(query_embedding, top_k) # Step 3: Retrieve the actual document chunks retrieved_docs = [index_to_doc_map[idx] for idx in indices[0]] # Create context from retrieved documents context = "nn".join([doc.page_content for doc in retrieved_docs]) # Step 4: Create prompt for the LLM (TinyLlama format) prompt = f"""<|system|>You are a helpful AI assistant. Answer the question based only on the provided context.If you don't know the answer based on the context, say "I don't have enough information to answer this question."Context:{context}<|user|>{query}<|assistant|>""" # Step 5: Generate response from LLM input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(model.device) generation_config = { "max_new_tokens": 256, "temperature": 0.7, "top_p": 0.95, "do_sample": True } # Generate the output with torch.no_grad(): output = llm_model.generate( input_ids=input_ids, **generation_config ) # Decode the output generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True) # Extract the assistant's response (remove the prompt) response = generated_text.split("<|assistant|>")[-1].strip() # Return both the response and the sources sources = [(doc.page_content, doc.metadata) for doc in retrieved_docs] return response, sourcesStep 8: Testing Our RAG SystemLets test our system with some questions:#Define some test questionstest_questions = [ "What is FAISS and what is it used for?", "How do embeddings capture semantic meaning?", "What are the benefits of RAG systems?", "How does vector search work?"]# Test our RAG pipelinefor question in test_questions: print(f"nn{'='*50}") print(f"Question: {question}") print(f"{'='*50}n") response, sources = rag_response( query=question, index=index, embedding_model=embedding_model, llm_model=model, llm_tokenizer=tokenizer, index_to_doc_map=index_to_doc_chunk, top_k=2 # Retrieve top 2 most relevant chunks ) print(f"Response: {response}n") print("Sources:") for i, (content, metadata) in enumerate(sources): print(f"nSource {i+1}:") print(f"Metadata: {metadata}") print(f"Content snippet: {content[:100]}...")OUTPUT:Step 9: Evaluating and Improving Our RAG SystemLets implement a simple evaluation function to assess the performance of our RAG system:def evaluate_rag_response(question, response, retrieved_sources, ground_truth_sources=None): """ Simple evaluation of RAG response quality Args: question: The query response: Generated response retrieved_sources: Sources used for generation ground_truth_sources: (Optional) Known correct sources Returns: evaluation metrics """ # Basic metrics response_length = len(response.split()) num_sources = len(retrieved_sources) # Simple relevance score - we'd use better methods in production source_relevance = [] for content, _ in retrieved_sources: # Count overlapping words between question and source q_words = set(question.lower().split()) s_words = set(content.lower().split()) overlap = len(q_words.intersection(s_words)) source_relevance.append(overlap / len(q_words) if q_words else 0) avg_relevance = sum(source_relevance) / len(source_relevance) if source_relevance else 0 return { "response_length": response_length, "num_sources": num_sources, "source_relevance_scores": source_relevance, "avg_relevance": avg_relevance }# Evaluate one of our previous responsesquestion = test_questions[0]response, sources = rag_response( query=question, index=index, embedding_model=embedding_model, llm_model=model, llm_tokenizer=tokenizer, index_to_doc_map=index_to_doc_chunk, top_k=2)# Run evaluationeval_results = evaluate_rag_response(question, response, sources)print(f"nEvaluation results for question: '{question}'")for metric, value in eval_results.items(): print(f"{metric}: {value}")Step 10: Advanced RAG Techniques Query ExpansionLets implement query expansion to improve retrieval:# Here's the implementation of the expand_query function:def expand_query(original_query, llm_model, llm_tokenizer): """ Generate multiple search queries from an original query to improve retrieval Args: original_query: The user's original question llm_model: The language model for generating variations llm_tokenizer: Tokenizer for the language model Returns: List of query variations including the original """ # Create a prompt for query expansion prompt = f"""<|system|>You are a helpful assistant. Generate two alternative versions of the given search query.The goal is to create variations that might help retrieve relevant information.Only list the alternative queries, one per line. Do not include any explanations, numbering, or other text.<|user|>Generate alternative versions of this search query: "{original_query}"<|assistant|>""" # Generate variations input_ids = llm_tokenizer(prompt, return_tensors="pt").input_ids.to(llm_model.device) with torch.no_grad(): output = llm_model.generate( input_ids=input_ids, max_new_tokens=100, temperature=0.7, do_sample=True ) # Decode the output generated_text = llm_tokenizer.decode(output[0], skip_special_tokens=True) # Extract the generated variations response_part = generated_text.split("<|assistant|>")[-1].strip() # Split response by lines to get individual variations variations = [line.strip() for line in response_part.split('n') if line.strip()] # Ensure we have at least some variations if not variations: variations = [original_query] # Add the original query and return the list with duplicates removed all_queries = [original_query] + variations return list(dict.fromkeys(all_queries)) # Remove duplicates while preserving orderStep 11: Evaluating and Improving Our expand_query functionLets implement a simple evaluation function to assess the performance of our expand_query function# Example usage of expand_query functiontest_query = "How does FAISS help with vector search?"# Generate query variationsexpanded_queries = expand_query( original_query=test_query, llm_model=model, llm_tokenizer=tokenizer)print(f"Original Query: {test_query}")print(f"Expanded Queries:")for i, query in enumerate(expanded_queries): print(f" {i+1}. {query}")# Enhanced RAG with query expansionall_retrieved_docs = []all_scores = {}# Retrieve documents for each query variationfor query in expanded_queries: # Get query embedding query_embedding = embedding_model.encode([query]).astype(np.float32) # Search in FAISS index distances, indices = index.search(query_embedding, 3) # Track document scores across queries (using 1/(1+distance) as score) for idx, dist in zip(indices[0], distances[0]): score = 1.0 / (1.0 + dist) if idx in all_scores: # Take max score if document retrieved by multiple query variations all_scores[idx] = max(all_scores[idx], score) else: all_scores[idx] = score# Get top documents based on scorestop_indices = sorted(all_scores.keys(), key=lambda idx: all_scores[idx], reverse=True)[:3]expanded_retrieved_docs = [index_to_doc_chunk[idx] for idx in top_indices]print("nRetrieved documents using query expansion:")for i, doc in enumerate(expanded_retrieved_docs): print(f"nResult {i+1}:") print(f"Source: {doc.metadata['source']}") print(f"Content snippet: {doc.page_content[:150]}...")# Now use these documents with the LLM to generate a responsecontext = "nn".join([doc.page_content for doc in expanded_retrieved_docs])# Create prompt for the LLMprompt = f"""<|system|>You are a helpful AI assistant. Answer the question based only on the provided context.If you don't know the answer based on the context, say "I don't have enough information to answer this question."Context:{context}<|user|>{test_query}<|assistant|>"""# Generate responseinput_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)with torch.no_grad(): output = model.generate( input_ids=input_ids, max_new_tokens=256, temperature=0.7, top_p=0.95, do_sample=True )# Extract responsegenerated_text = tokenizer.decode(output[0], skip_special_tokens=True)response = generated_text.split("<|assistant|>")[-1].strip()print("nFinal RAG Response with Query Expansion:")print(response)Output:FAISS can handle a wide range of vector types, including text, image, and audio, and can be integrated with popular machine learning frameworks such as TensorFlow, PyTorch, and Sklearn.ConclusionIn this tutorial, we have built a complete RAG system using FAISS as our vector database and an open-source LLM. We implemented document processing, embedding generation, and vector indexing, and integrated these components with query expansion and hybrid search techniques to improve retrieval quality.Further, we can consider:Implementing query reranking with cross-encodersCreating a web interface using Gradio or StreamlitAdding metadata filtering capabilitiesExperimenting with different embedding modelsScaling the solution with more efficient FAISS indexes (HNSW, IVF)Fine-tuning the LLM on your domain-specific dataUseful resources:Here is the Colab Notebook. Also,dont forget to follow us onTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our80k+ ML SubReddit. Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Meet PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PCMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Implementing Text-to-Speech TTS with BARK Using Hugging Faces Transformers library in a Google Colab environmentMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Salesforce AI Releases Text2Data: A Training Framework for Low-Resource Data GenerationMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Q-Filters: A Training-Free AI Method for Efficient KV Cache Compression0 Comments ·0 Shares ·73 Views
-
Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactionswww.marktechpost.comAt NVIDIA GTC25, Gnani.ai experts unveiled groundbreaking advancements in voice AI, focusing on the development and deployment of Speech-to-Speech Foundation Models. This innovative approach promises to overcome the limitations of traditional cascaded voice AI architectures, ushering in an era of seamless, multilingual, and emotionally aware voice interactions.The Limitations of Cascaded ArchitecturesCurrent state-of-the-art architecture powering voice agents involves a three-stage pipeline: Speech-to-Text (STT), Large Language Models (LLMs), and Text-to-Speech (TTS). While effective, this cascaded architecture suffers from significant drawbacks, primarily latency and error propagation. A cascaded architecture has multiple blocks in the pipeline, and each block will add its own latency. The cumulative latency across these stages can range from 2.5 to 3 seconds, leading to a poor user experience. Moreover, errors introduced in the STT stage propagate through the pipeline, compounding inaccuracies. This traditional architecture also loses critical paralinguistic features such as sentiment, emotion, and tone, resulting in monotonous and emotionally flat responses.Introducing Speech-to-Speech Foundation ModelsTo address these limitations, Gnani.ai presents a novel Speech-to-Speech Foundation Model. This model directly processes and generates audio, eliminating the need for intermediate text representations. The key innovation lies in training a massive audio encoder with 1.5 million hours of labeled data across 14 languages, capturing nuances of emotion, empathy, and tonality. This model employs a nested XL encoder, retrained with comprehensive data, and an input audio projector layer to map audio features into textual embeddings. For real-time streaming, audio and text features are interleaved, while non-streaming use cases utilize an embedding merge layer. The LLM layer, initially based on Llama 8B, was expanded to include 14 languages, necessitating the rebuilding of tokenizers. An output projector model generates mel spectrograms, enabling the creation of hyper-personalized voices.Key Benefits and Technical HurdlesThe Speech-to-Speech model offers several significant benefits. Firstly, it significantly reduces latency, moving from 2 seconds to approximately 850-900 milliseconds for the first token output. Secondly, it enhances accuracy by fusing ASR with the LLM layer, improving performance, especially for short and long speeches. Thirdly, the model achieves emotional awareness by capturing and modeling tonality, stress, and rate of speech. Fourthly, it enables improved interruption handling through contextual awareness, facilitating more natural interactions. Finally, the model is designed to handle low bandwidth audio effectively, which is crucial for telephony networks. Building this model presented several challenges, notably the massive data requirements. The team created a crowd-sourced system with 4 million users to generate emotionally rich conversational data. They also leveraged foundation models for synthetic data generation and trained on 13.5 million hours of publicly available data. The final model comprises a 9 billion parameter model, with 636 million for the audio input, 8 billion for the LLM, and 300 million for the TTS system.NVIDIAs Role in DevelopmentThe development of this model was heavily reliant on the NVIDIA stack. NVIDIA Nemo was used for training encoder-decoder models, and NeMo Curator facilitated synthetic text data generation. NVIDIA EVA was employed to generate audio pairs, combining proprietary information with synthetic data.Use CasesGnani.ai showcased two primary use cases: real-time language translation and customer support. The real-time language translation demo featured an AI engine facilitating a conversation between an English-speaking agent and a French-speaking customer. The customer support demo highlighted the models ability to handle cross-lingual conversations, interruptions, and emotional nuances.Speech-to-Speech Foundation ModelThe Speech-to-Speech Foundation Model represents a significant leap forward in voice AI. By eliminating the limitations of traditional architectures, this model enables more natural, efficient, and emotionally aware voice interactions. As the technology continues to evolve, it promises to transform various industries, from customer service to global communication. Jean-marc MommessinJean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Lowes Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer AssistanceJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Google DeepMinds Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial ReasoningJean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Aya Vision Unleashed: A Global AI Revolution in Multilingual Multimodal Power!Jean-marc Mommessinhttps://www.marktechpost.com/author/jean-marc0000677/Limbic AIs Generative AIEnabled Therapy Support Tool Improves Cognitive Behavioral Therapy Outcomes0 Comments ·0 Shares ·112 Views
More Stories