-
- EXPLORE
-
-
-
-
AI/ML Research and Dev News Platform (1 million+monthly traffic) | 50k+ ML subreddit | Contact: Asif@marktechpost.com
Recent Updates
-
WWW.MARKTECHPOST.COMNVIDIA AI Releases OpenMath-Nemotron-32B and 14B-Kaggle: Advanced AI Models for Mathematical Reasoning that Secured First Place in the AIMO-2 Competition and Set New Benchmark RecordsMathematical reasoning has long presented a formidable challenge for AI, demanding not only an understanding of abstract concepts but also the ability to perform multi-step logical deductions with precision. Traditional language models, while adept at generating fluent text, often struggle when tasked with solving complex mathematical problems that require both deep domain knowledge and structured reasoning. This gap has driven research toward specialized architectures and training regimens designed to imbue models with robust mathematical capabilities. By focusing on targeted datasets and fine-tuning strategies, AI developers aim to bridge the gap between natural language understanding and formal mathematical problem-solving. NVIDIA has introduced OpenMath-Nemotron-32B and OpenMath-Nemotron-14B-Kaggle, each meticulously engineered to excel in mathematical reasoning tasks. Building on the success of the Qwen family of transformer models, these Nemotron variants utilize large-scale fine-tuning on an extensive corpus of mathematical problems, collectively known as the OpenMathReasoning dataset. The design philosophy underlying both releases centers on maximizing accuracy across competitive benchmarks while maintaining practical considerations for inference speed and resource efficiency. By offering multiple model sizes and configurations, NVIDIA provides researchers and practitioners with a flexible toolkit for integrating advanced math capabilities into diverse applications. OpenMath-Nemotron-32B represents the flagship of this series, featuring 32.8 billion parameters and leveraging BF16 tensor operations for efficient hardware utilization. It is built by fine-tuning Qwen2.5-32B on the OpenMathReasoning dataset, a curated collection that emphasizes challenging problems drawn from mathematical Olympiads and standardized exams. This model achieves state-of-the-art results on several rigorous benchmarks, including the American Invitational Mathematics Examination (AIME) 2024 and 2025, the Harvard–MIT Mathematics Tournament (HMMT) 2024-25, and the Harvard–London–Edinburgh Mathematics Exam (HLE-Math) series. In its tool-integrated reasoning (TIR) configuration, OpenMath-Nemotron-32B achieves an average pass@1 score of 78.4 percent on AIME24, with a majority-voting accuracy of 93.3 percent, surpassing previous top-performing models by notable margins. To accommodate different inference scenarios, OpenMath-Nemotron-32B supports three distinct modes: chain-of-thought (CoT), tool-integrated reasoning (TIR), and generative solution selection (GenSelect). In CoT mode, the model generates intermediate reasoning steps before presenting a final answer, achieving a pass@1 accuracy of 76.5% on AIME24. When augmented with GenSelect, which produces multiple candidate solutions and selects the most consistent answer, the model’s performance improves further, achieving a remarkable 93.3% accuracy on the same benchmark. These configurations enable users to balance between explanation richness and answer precision, catering to research environments that require transparency as well as production settings that prioritize speed and reliability. Complementing the 32 billion-parameter variant, NVIDIA has also released OpenMath-Nemotron-14B-Kaggle, a 14.8 billion-parameter model fine-tuned on a strategically selected subset of the OpenMathReasoning dataset to optimize for competitive performance. This version served as the cornerstone of NVIDIA’s first-place solution in the AIMO-2 Kaggle competition, a contest that focused on automated problem-solving techniques for advanced mathematical challenges. By calibrating the training data to emphasize problems reflective of the competition’s format and difficulty, the 14B-Kaggle model demonstrated exceptional adaptability, outpacing rival approaches and securing the top leaderboard position. Performance benchmarks for OpenMath-Nemotron-14B-Kaggle mirror those of its larger counterpart, with the model achieving a pass@1 accuracy of 73.7% on AIME24 in CoT mode and improving to 86.7% under GenSelect protocols. On the AIME25 benchmark, it achieves a pass rate of 57.9 percent (majority at 64 of 73.3 percent), and on HMMT-24-25, it attains 50.5 percent (majority at 64 of 64.8 percent). These figures highlight the model’s ability to deliver high-quality solutions, even with a more compact parameter footprint, making it well-suited for scenarios where resource constraints or inference latency are critical factors. Both OpenMath-Nemotron models are accompanied by an open‐source pipeline, enabling full reproducibility of data generation, training procedures, and evaluation protocols. NVIDIA has integrated these workflows into its NeMo-Skills framework, providing reference implementations for CoT, TIR, and GenSelect inference modes. With example code snippets that demonstrate how to instantiate a transformer pipeline, configure dtype and device mapping, and parse model outputs, developers can rapidly prototype applications that query these models for step-by-step solutions or streamlined final answers. Under the hood, both models are optimized to run efficiently on NVIDIA GPU architectures, ranging from the Ampere to the Hopper microarchitectures, leveraging highly tuned CUDA libraries and TensorRT optimizations. For production deployments, users can serve models via Triton Inference Server, enabling low-latency, high-throughput integrations in web services or batch processing pipelines. The adoption of BF16 tensor formats strikes an ideal balance between numerical precision and memory footprint, enabling these large-scale models to fit within GPU memory constraints while maintaining robust performance across various hardware platforms. Several Key Takeaways from the release of OpenMath-Nemotron-32B and OpenMath-Nemotron-14B-Kaggle include: NVIDIA’s OpenMath-Nemotron series addresses the longstanding challenge of equipping language models with robust mathematical reasoning through targeted fine-tuning on the OpenMathReasoning dataset. The 32 B-parameter variant achieves state-of-the-art accuracy on benchmarks like AIME24/25 and HMMT, offering three inference modes (CoT, TIR, GenSelect) to balance explanation richness and precision. The 14 B-parameter “Kaggle” model, fine-tuned on a competition-focused subset, secured first place in the AIMO-2 Kaggle competition while maintaining high pass@1 scores, demonstrating efficiency in a smaller footprint. Both models are fully reproducible via an open-source pipeline integrated into NVIDIA’s NeMo-Skills framework, with reference implementations for all inference modes. Optimized for NVIDIA GPUs (Ampere and Hopper), the models leverage BF16 tensor operations, CUDA libraries, TensorRT, and Triton Inference Server for low-latency, high-throughput deployments. Potential applications include AI-driven tutoring systems, academic competition preparation tools, and integration into scientific computing workflows requiring formal or symbolic reasoning. Future directions may expand to advanced university-level mathematics, multimodal inputs (e.g., handwritten equations), and tighter integration with symbolic computation engines to verify and augment generated solutions. Check out the OpenMath-Nemotron-32B and OpenMath-Nemotron-14B-Kaggle. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Visual Representation LearningAsif Razzaqhttps://www.marktechpost.com/author/6flvq/AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding AgentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meet Xata Agent: An Open Source Agent for Proactive PostgreSQL Monitoring, Automated Troubleshooting, and Seamless DevOps IntegrationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Describe Anything 3B: A Multimodal LLM for Fine-Grained Image and Video Captioning0 Comments 0 Shares 10 ViewsPlease log in to like, share and comment!
-
WWW.MARKTECHPOST.COMMicrosoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language ModelsIntegrating long-context capabilities with visual understanding significantly enhances the potential of VLMs, particularly in domains such as robotics, autonomous driving, and healthcare. Expanding the context size enables VLMs to process extended video and text sequences, thereby enhancing temporal resolution and performance in complex tasks, such as video comprehension. However, one major limitation is the quadratic complexity of attention mechanisms during the pre-fill phase, which results in high latency before autoregressive decoding begins. This delay, known as Time-to-First-Token, makes real-world deployment of long-context VLMs challenging. Various sparse attention methods, such as Sparse Transformer, Swin Transformer, and StreamingLLM, overlook the specific sparse patterns found in VLMs with mixed modalities, thereby limiting their efficiency and effectiveness. Unlike text-only inputs, visual and video data in VLMs demonstrate unique spatiotemporal attention structures, forming grid-like patterns due to local correlations. In mixed-modality scenarios, clear boundaries exist between different modalities, leading to distinct attention behaviors that general sparse methods fail to capture. Recent advancements, such as MInference and dynamic sparse attention approaches, aim to improve inference efficiency by adapting attention patterns online. Yet, these techniques often fall short in handling the intricacies of mixed-modality inputs. While vision token compression and RNN-Transformer hybrids have been explored to reduce computational load, most of these methods focus on long-video and short-text pairings, neglecting the more complex dynamics of multiturn, mixed-modality interactions, which are increasingly important in practical applications. Researchers from the University of Surrey and Microsoft have introduced MMInference, a dynamic, sparse attention method designed to accelerate the pre-filling stage of long-context VLMs. By identifying grid-like sparsity patterns in video inputs and distinct modality boundaries, MMInference applies permutation-based strategies to optimize attention computation. It dynamically constructs sparse distributions for each input and utilizes custom GPU kernels for enhanced efficiency, all without requiring modifications to existing models. Tested on benchmarks like Video QA, Captioning, and Vision-NIAH, MMInference achieved up to 8.3× speedup at 1M tokens, outperforming previous methods while maintaining high accuracy across multiple state-of-the-art VLMs. MMInference is a framework designed to speed up the pre-filling phase of long-context vision-language models by leveraging modality-aware sparse attention. It integrates three key components: (1) intra-modality sparse patterns like Grid, A-shape, and Vertical-Slash attention; (2) cross-modality patterns such as Q-Boundary and 2D-Boundary; and (3) a modality-aware sparse attention search algorithm. Instead of dense computation, it uses dynamic sparse attention with optimized GPU kernels and efficient tensor handling. The framework dynamically identifies attention patterns and permutes tensors based on modality, enabling efficient handling of multi-modal inputs and reducing computational overhead while maintaining strong performance. The study evaluates MMInference’s performance and efficiency on long-video tasks, including captioning, question answering, and retrieval in both unimodal and mixed-modality settings. Experiments were conducted using state-of-the-art models, such as Llava-Video and LongVILA, with comparisons against several sparse attention baselines. Results show that MMInference achieves near full-attention performance while being more computationally efficient. It performs particularly well in the newly introduced Mixed-Modality Needle in a Haystack (MM-NIAH) task by leveraging inter-modality sparse patterns. Additionally, MMInference demonstrates significant speedups in end-to-end latency and maintains robustness across varying context lengths and input types. In conclusion, MMInference is a modality-aware sparse attention technique designed to accelerate long-context VLMs without compromising accuracy. It employs a permutation-based grid attention pattern tailored for the spatial-temporal locality of video inputs, along with specialized handling for mixed-modality boundaries. A search algorithm identifies optimal sparse patterns per attention head, dynamically adapting to the input. The method integrates directly into current VLM pipelines without requiring model changes or fine-tuning. With optimized GPU kernels, MMInference achieves up to 8.3× acceleration during the pre-filling stage at 1M tokens across various tasks, including video QA, captioning, and mixed-modality benchmarks, while retaining full-attention performance. Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop The post Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models appeared first on MarkTechPost.0 Comments 0 Shares 22 Views
-
WWW.MARKTECHPOST.COMOpenAI Launches gpt-image-1 API: Bringing High-Quality Image Generation to DevelopersOpenAI has officially announced the release of its image generation API, powered by the gpt-image-1 model. This launch brings the multimodal capabilities of ChatGPT into the hands of developers, enabling programmatic access to image generation—an essential step for building intelligent design tools, creative applications, and multimodal agent systems. The new API supports high-quality image synthesis from natural language prompts, marking a significant integration point for generative AI workflows in production environments. Available starting today, developers can now directly interact with the same image generation model that powers ChatGPT’s image creation capabilities. Expanding the Capabilities of ChatGPT to Developers The gpt-image-1 model is now available through the OpenAI platform, allowing developers to generate photorealistic, artistic, or highly stylized images using plain text. This follows a phased rollout of image generation features in the ChatGPT product interface and marks a critical transition toward API-first deployment. The image generation endpoint supports parameters such as: Prompt: Natural language description of the desired image. Size: Standard resolution settings (e.g., 1024×1024). n: Number of images to generate per prompt. Response format: Choose between base64-encoded images or URLs. Style: Optionally specify image aesthetics (e.g., “vivid” or “natural”). The API follows a synchronous usage model, which means developers receive the generated image(s) in the same response—ideal for real-time interfaces like chatbots or design platforms. Technical Overview of the API and gpt-image-1 Model OpenAI has not yet released full architectural details about gpt-image-1, but based on public documentation, the model supports robust prompt adherence, detailed composition, and stylistic coherence across diverse image types. While it is distinct from DALL·E 3 in naming, the image quality and alignment suggest continuity in OpenAI’s image generation research lineage. The API is designed to be stateless and easy to integrate: from openai import OpenAI import base64 client = OpenAI() prompt = """ A children's book drawing of a veterinarian using a stethoscope to listen to the heartbeat of a baby otter. """ result = client.images.generate( model="gpt-image-1", prompt=prompt ) image_base64 = result.data[0].b64_json image_bytes = base64.b64decode(image_base64) # Save the image to a file with open("otter.png", "wb") as f: f.write(image_bytes) Unlocking Developer Use Cases By making this API available, OpenAI positions gpt-image-1 as a fundamental building block for multimodal AI development. Some key applications include: Generative Design Tools: Seamlessly integrate prompt-based image creation into design software for artists, marketers, and product teams. AI Assistants and Agents: Extend LLMs with visual generation capabilities to support richer user interaction and content composition. Prototyping for Games and XR: Rapidly generate environments, textures, or concept art for iterative development pipelines. Educational Visualizations: Generate scientific diagrams, historical reconstructions, or data illustrations on demand. With image generation now programmable, these use cases can be scaled, personalized, and embedded directly into user-facing platforms. Content Moderation and Responsible Use Safety remains a core consideration. OpenAI has implemented content filtering layers and safety classifiers around the gpt-image-1 model to mitigate risks of generating harmful, misleading, or policy-violating images. The model is subject to the same usage policies as OpenAI’s text-based models, with automated moderation for prompts and generated content. Developers are encouraged to follow best practices for end-user input validation and maintain transparency in applications that include generative visual content. Conclusion The release of gpt-image-1 to the API marks a pivotal step in making generative vision models accessible, controllable, and production-ready. It’s not just a model—it’s an interface to imagination, grounded in structured, repeatable, and scalable computation. For developers building the next generation of creative software, autonomous agents, or visual storytelling tools, gpt-image-1 offers a robust foundation to bring language and imagery together in code. Check out the Technical Details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Nishant NNishant, the Product Growth Manager at Marktechpost, is interested in learning about artificial intelligence (AI), what it can do, and its development. His passion for trying something new and giving it a creative twist helps him intersect marketing with tech. He is assisting the company in leading toward growth and market recognition.Nishant Nhttps://www.marktechpost.com/author/nishantn/Google AI Introduces Ironwood: A Google TPU Purpose-Built for the Age of InferenceNishant Nhttps://www.marktechpost.com/author/nishantn/Meet Amazon Nova Act: An AI Agent that can Automate Web TasksNishant Nhttps://www.marktechpost.com/author/nishantn/Anthropic Introduces New Prompt Improver to Developer Console: Automatically Refine Prompts With Prompt Engineering Techniques and CoT ReasoningNishant Nhttps://www.marktechpost.com/author/nishantn/ElevenLabs Introduces Voice Design: A New AI Feature that Generates a Unique Voice from a Text Prompt Alone0 Comments 0 Shares 13 Views
-
WWW.MARKTECHPOST.COMMeta AI Releases Web-SSL: A Scalable and Language-Free Approach to Visual Representation LearningIn recent years, contrastive language-image models such as CLIP have established themselves as a default choice for learning vision representations, particularly in multimodal applications like Visual Question Answering (VQA) and document understanding. These models leverage large-scale image-text pairs to incorporate semantic grounding via language supervision. However, this reliance on text introduces both conceptual and practical challenges: the assumption that language is essential for multimodal performance, the complexity of acquiring aligned datasets, and the scalability limits imposed by data availability. In contrast, visual self-supervised learning (SSL)—which operates without language—has historically demonstrated competitive results on classification and segmentation tasks, yet has been underutilized for multimodal reasoning due to performance gaps, especially in OCR and chart-based tasks. To explore the capabilities of language-free visual learning at scale, Meta has released the Web-SSL family of DINO and Vision Transformer (ViT) models, ranging from 300 million to 7 billion parameters, now publicly available via Hugging Face. These models are trained exclusively on the image subset of the MetaCLIP dataset (MC-2B)—a web-scale dataset comprising two billion images. This controlled setup enables a direct comparison between WebSSL and CLIP, both trained on identical data, isolating the effect of language supervision. The objective is not to replace CLIP, but to rigorously evaluate how far pure visual self-supervision can go when model and data scale are no longer limiting factors. This release represents a significant step toward understanding whether language supervision is necessary—or merely beneficial—for training high-capacity vision encoders. Technical Architecture and Training Methodology WebSSL encompasses two visual SSL paradigms: joint-embedding learning (via DINOv2) and masked modeling (via MAE). Each model follows a standardized training protocol using 224×224 resolution images and maintains a frozen vision encoder during downstream evaluation to ensure that observed differences are attributable solely to pretraining. Models are trained across five capacity tiers (ViT-1B to ViT-7B), using only unlabeled image data from MC-2B. Evaluation is conducted using Cambrian-1, a comprehensive 16-task VQA benchmark suite encompassing general vision understanding, knowledge-based reasoning, OCR, and chart-based interpretation. In addition, the models are natively supported in Hugging Face’s transformers library, providing accessible checkpoints and seamless integration into research workflows. Performance Insights and Scaling Behavior Experimental results reveal several key findings: Scaling Model Size: WebSSL models demonstrate near log-linear improvements in VQA performance with increasing parameter count. In contrast, CLIP’s performance plateaus beyond 3B parameters. WebSSL maintains competitive results across all VQA categories and shows pronounced gains in Vision-Centric and OCR & Chart tasks at larger scales. Data Composition Matters: By filtering the training data to include only 1.3% of text-rich images, WebSSL outperforms CLIP on OCR & Chart tasks—achieving up to +13.6% gains in OCRBench and ChartQA. This suggests that the presence of visual text alone, not language labels, significantly enhances task-specific performance. High-Resolution Training: WebSSL models fine-tuned at 518px resolution further close the performance gap with high-resolution models like SigLIP, particularly for document-heavy tasks. LLM Alignment: Without any language supervision, WebSSL shows improved alignment with pretrained language models (e.g., LLaMA-3) as model size and training exposure increase. This emergent behavior implies that larger vision models implicitly learn features that correlate well with textual semantics. Importantly, WebSSL maintains strong performance on traditional benchmarks (ImageNet-1k classification, ADE20K segmentation, NYUv2 depth estimation), and often outperforms MetaCLIP and even DINOv2 under equivalent settings. Concluding Observations Meta’s Web-SSL study provides strong evidence that visual self-supervised learning, when scaled appropriately, is a viable alternative to language-supervised pretraining. These findings challenge the prevailing assumption that language supervision is essential for multimodal understanding. Instead, they highlight the importance of dataset composition, model scale, and careful evaluation across diverse benchmarks. The release of models ranging from 300M to 7B parameters enables broader research and downstream experimentation without the constraints of paired data or proprietary pipelines. As open-source foundations for future multimodal systems, WebSSL models represent a meaningful advancement in scalable, language-free vision learning. Check out the Models on Hugging Face, GitHub Page and Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding AgentsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meet Xata Agent: An Open Source Agent for Proactive PostgreSQL Monitoring, Automated Troubleshooting, and Seamless DevOps IntegrationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Describe Anything 3B: A Multimodal LLM for Fine-Grained Image and Video CaptioningAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite Database0 Comments 0 Shares 17 Views
-
WWW.MARKTECHPOST.COMMeet Rowboat: An Open-Source IDE for Building Complex Multi-Agent SystemsAs multi-agent systems gain traction in real-world applications—from customer support automation to AI-native infrastructure—the need for a streamlined development interface has never been greater. Meet Rowboat, an open-source IDE designed to accelerate the construction, debugging, and deployment of multi-agent AI workflows. It’s powered by OpenAI Agents SDK, connects MCP servers, and can integrate into your apps using HTTP or the SDK. Backed by Y Combinator and tightly integrated with OpenAI’s Agents SDK, Rowboat offers a unique combination of visual development, tool modularity, and real-time testing—making it a compelling platform for engineering agentic AI systems at scale. Rethinking Multi-Agent Development Developing multi-agent systems typically requires orchestrating interactions between multiple specialized agents, each responsible for a distinct task or capability. This often involves stitching together prompts, toolchains, and APIs—an effort that is not only tedious but error-prone. Rowboat abstracts away much of this complexity by introducing a visual, AI-assisted development environment that allows teams to define agent behavior using natural language, integrate modular toolsets, and evaluate systems through interactive testing. The IDE is built with developers and applied AI teams in mind, especially those working on domain-specific use cases in customer experience (CX), enterprise automation, and backend infrastructure. Key Features and Architecture 1. Copilot: Natural Language-Based Agent Design At the heart of Rowboat lies its AI-powered Copilot—a system that transforms natural language specifications into runnable multi-agent workflows. For example, users can describe, “Build an assistant for a telecom company to handle data plan upgrades and billing inquiries,” and the Copilot scaffolds the entire system accordingly. This dramatically reduces the ramp-up time for teams new to multi-agent architectures. Rowboat supports Modular Command Protocol (MCP) servers, enabling seamless tool injection into agents. Developers can import tools defined in an external MCP server, assign them to individual agents within Rowboat, and trigger tool invocations through agent reasoning steps. This modular design ensures clear separation of responsibilities, enabling scalable and maintainable agent workflows. 3. Interactive Testing in the Playground The built-in Playground offers a live testing environment where users can interact with their agents, observe system behavior, and debug tool calls. It supports step-by-step inspection of conversation history, function execution, and context propagation—critical capabilities when validating agent coordination or investigating unexpected behaviors. 4. Flexible Deployment via HTTP API and Python SDK Rowboat isn’t just a visual IDE—it ships with an HTTP API and a Python SDK, giving teams the flexibility to embed Rowboat agents into broader infrastructure. Whether you’re running agents in a cloud-native microservice or embedding them in internal developer tools, the SDK provides both stateless and session-aware configurations. Practical Use Cases Rowboat is well-suited for teams building production-grade assistant systems. Some real-world applications include: Financial Services: Automate credit card support, loan updates, and payment reminders using a team of domain-specific agents. Insurance: Assist users with claims processing, policy inquiries, and premium calculations. Travel & Hospitality: Handle flight updates, hotel bookings, itinerary changes, and multilingual support. Telecom: Support billing resolution, plan changes, SIM management, and device troubleshooting. These scenarios benefit from decomposing tasks into specialized agents with focused tool access—exactly the design pattern that Rowboat enables. Conclusion Rowboat fills an important gap in the AI development ecosystem: a purpose-built environment for prototyping and managing multi-agent systems. Its intuitive design, natural language integration, and modular architecture make it more than just an IDE—it’s a full development suite for agentic systems. Whether you’re building a customer service assistant, a backend orchestration tool, or a custom LLM agent pipeline, Rowboat provides the foundation. Check out the GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/A New Citibank Report/Guide Shares How Agentic AI Will Reshape Finance with Autonomous Analysis and Intelligent AutomationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long TextsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled DataSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI Agents0 Comments 0 Shares 37 Views
-
WWW.MARKTECHPOST.COMSequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long TextsEvaluating how well LLMs handle long contexts is essential, especially for retrieving specific, relevant information embedded in lengthy inputs. Many recent LLMs—such as Gemini-1.5, GPT-4, Claude-3.5, Qwen-2.5, and others—have pushed the boundaries of context length while striving to maintain strong reasoning abilities. To assess such capabilities, benchmarks like ∞Bench, LongBench, and L-Eval have been developed. However, these often overlook the “Needle-in-a-Haystack” (NIAH) task, which challenges models to retrieve a few critical pieces of information from predominantly irrelevant content. Earlier benchmarks, such as RULER and Counting-Stars, offered synthetic and simplistic NIAH setups, utilizing items like passwords or symbols. NeedleBench improved this by including more realistic, semantically meaningful needles and logical reasoning questions. Yet, it still lacks tasks involving the retrieval and correct ordering of sequential information, such as timestamps or procedural steps. Efforts to enhance LLMs’ long-context capabilities have employed methods like RoPE, ALiBi, and memory-based techniques, along with architectural changes seen in models like Mamba and FLASHBUTTERFLY. Modern LLMs now support extensive contexts—Gemini 1.5 and Kimi can process up to 1–2 million tokens. NIAH benchmarks assess how effectively models can extract relevant data from vast amounts of text, and NeedleBench further incorporates logical relationships to simulate real-world scenarios. Regarding evaluation, natural language generation (NLG) performance is typically assessed using metrics derived from LLMs, prompt-based evaluations, fine-tuned models, or human-LLM collaborations. While prompting alone often underperforms, fine-tuning and human-in-the-loop methods can greatly enhance evaluation accuracy and reliability. Researchers from Tencent YouTu Lab have introduced Sequential-NIAH, a benchmark designed to assess how well LLMs retrieve sequential information, referred to as a needle, from long texts. The benchmark includes synthetic, real, and open-domain QA needles embedded in contexts ranging from 8K to 128K tokens, totaling 14,000 samples. A synthetic data-trained evaluation model achieved 99.49% accuracy in judging the correctness and order of responses. However, tests on six popular LLMs showed the highest performance at just 63.15%, highlighting the difficulty of the task and the need for further advancement in long-context comprehension. The Sequential-NIAH benchmark is designed to evaluate models on retrieving sequentially ordered information (needles) from long texts (haystacks). It uses three types of QA synthesis pipelines: synthetic (generated events in order), real (extracted from temporal knowledge graphs), and open-domain QA (logically ordered answers). These QA pairs are inserted into diverse, long texts sourced from the LongData Corpus, covering various domains. To construct samples, the long text is segmented, needles are randomly shuffled and embedded, and the task is framed using prompt templates. The final dataset comprises 14,000 samples, split across training, development, and test sets, in both English and Chinese. The evaluation model was tested against Claude-3.5, GPT-4o, and others on 1,960 samples, achieving a 99.49% accuracy. This outperforms GPT-4o (96.07%) and Claude-3.5 (87.09%) by significant margins. In subsequent benchmark tests on 2,000 samples, Gemini-1.5 outperformed other models with an accuracy of 63.15%, while GPT-4o-mini and GPT-4o performed poorly. Performance varied with text length, number of needles, QA synthesis pipelines, and languages, with Gemini-1.5 maintaining stable results. A noise analysis revealed that minor perturbations had a negligible impact on accuracy, but larger shifts in needle positions reduced model consistency, particularly for Qwen-2.5 and LLaMA-3.3. In conclusion, the Sequential-NIAH benchmark assesses LLMs on their ability to extract sequential information from lengthy texts (up to 128,000 tokens). It includes synthetic, real, and open-domain question-answering pipelines, with 14,000 samples for training, development, and testing. Despite testing popular models like Claude, GPT-4.0, Gemini, LLaMA, and Qwen, none achieved high accuracy, with the best performing at 63.15%. A synthetic evaluation model achieved an accuracy of 99.49% on the test data. The benchmark also highlights the challenges of increasing context lengths and needle counts and is validated through noise robustness tests, making it valuable for advancing LLM research. Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop The post Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long Texts appeared first on MarkTechPost.0 Comments 0 Shares 42 Views
-
WWW.MARKTECHPOST.COMA Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM WorkflowsIn this tutorial, we demonstrate how to harness Crawl4AI, a modern, Python‑based web crawling toolkit, to extract structured data from web pages directly within Google Colab. Leveraging the power of asyncio for asynchronous I/O, httpx for HTTP requests, and Crawl4AI’s built‑in AsyncHTTPCrawlerStrategy, we bypass the overhead of headless browsers while still parsing complex HTML via JsonCssExtractionStrategy. With just a few lines of code, you install dependencies (crawl4ai, httpx), configure HTTPCrawlerConfig to request only gzip/deflate (avoiding Brotli issues), define your CSS‑to‑JSON schema, and orchestrate the crawl through AsyncWebCrawler and CrawlerRunConfig. Finally, the extracted JSON data is loaded into pandas for immediate analysis or export. What sets Crawl4AI apart is its unified API, which seamlessly switches between browser-based (Playwright) and HTTP-only strategies, its robust error-handling hooks, and its declarative extraction schemas. Unlike traditional headless-browser workflows, Crawl4AI allows you to choose the most lightweight and performant backend, making it ideal for scalable data pipelines, on-the-fly ETL in notebooks, or feeding LLMs and analytics tools with clean JSON/CSV outputs. !pip install -U crawl4ai httpx First, we install (or upgrade) Crawl4AI, the core asynchronous crawling framework, alongside HTTPX. This high-performance HTTP client provides all the building blocks we need for lightweight, asynchronous web scraping directly in Colab. import asyncio, json, pandas as pd from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy from crawl4ai.extraction_strategy import JsonCssExtractionStrategy We bring in Python’s core async and data‑handling modules, asyncio for concurrency, json for parsing, and pandas for tabular storage, alongside Crawl4AI’s essentials: AsyncWebCrawler to drive the crawl, CrawlerRunConfig and HTTPCrawlerConfig to configure extraction and HTTP settings, AsyncHTTPCrawlerStrategy for a browser‑free HTTP backend, and JsonCssExtractionStrategy to map CSS selectors into structured JSON. http_cfg = HTTPCrawlerConfig( method="GET", headers={ "User-Agent": "crawl4ai-bot/1.0", "Accept-Encoding": "gzip, deflate" }, follow_redirects=True, verify_ssl=True ) crawler_strategy = AsyncHTTPCrawlerStrategy(browser_config=http_cfg) Here, we instantiate an HTTPCrawlerConfig to define our HTTP crawler’s behavior, using a GET request with a custom User-Agent, gzip/deflate encoding only, automatic redirects, and SSL verification. We then plug that into AsyncHTTPCrawlerStrategy, allowing Crawl4AI to drive the crawl via pure HTTP calls rather than a full browser. schema = { "name": "Quotes", "baseSelector": "div.quote", "fields": [ {"name": "quote", "selector": "span.text", "type": "text"}, {"name": "author", "selector": "small.author", "type": "text"}, {"name": "tags", "selector": "div.tags a.tag", "type": "text"} ] } extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False) run_cfg = CrawlerRunConfig(extraction_strategy=extraction_strategy) We define a JSON‑CSS extraction schema targeting each quote block (div.quote) and its child elements (span.text, small.author, div.tags a.tag), then initializes a JsonCssExtractionStrategy with that schema, and wraps it in a CrawlerRunConfig so Crawl4AI knows exactly what structured data to pull on each request. async def crawl_quotes_http(max_pages=5): all_items = [] async with AsyncWebCrawler(crawler_strategy=crawler_strategy) as crawler: for p in range(1, max_pages+1): url = f"https://quotes.toscrape.com/page/{p}/" try: res = await crawler.arun(url=url, config=run_cfg) except Exception as e: print(f"❌ Page {p} failed outright: {e}") continue if not res.extracted_content: print(f"❌ Page {p} returned no content, skipping") continue try: items = json.loads(res.extracted_content) except Exception as e: print(f"❌ Page {p} JSON‑parse error: {e}") continue print(f"✅ Page {p}: {len(items)} quotes") all_items.extend(items) return pd.DataFrame(all_items) Now, this asynchronous function orchestrates the HTTP‑only crawl: it spins up an AsyncWebCrawler with our AsyncHTTPCrawlerStrategy, iterates through each page URL, and safely awaits crawler.arun(), handles any request or JSON parsing errors and collects the extracted quote records into a single pandas DataFrame for downstream analysis. df = asyncio.get_event_loop().run_until_complete(crawl_quotes_http(max_pages=3)) df.head() Finally, we kick off the crawl_quotes_http coroutine on Colab’s existing asyncio loop, fetching three pages of quotes, and then display the first few rows of the resulting pandas DataFrame to verify that our crawler returned structured data as expected. In conclusion, by combining Google Colab’s zero-config environment with Python’s asynchronous ecosystem and Crawl4AI’s flexible crawling strategies, we have now developed a fully automated pipeline for scraping and structuring web data in minutes. Whether you need to spin up a quick dataset of quotes, build a refreshable news‑article archive, or power a RAG workflow, Crawl4AI’s blend of httpx, asyncio, JsonCssExtractionStrategy, and AsyncHTTPCrawlerStrategy delivers both simplicity and scalability. Beyond pure HTTP crawls, you can instantly pivot to Playwright‑driven browser automation without rewriting your extraction logic, underscoring why Crawl4AI stands out as the go‑to framework for modern, production‑ready web data extraction. Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Muon Optimizer Significantly Accelerates Grokking in Transformers: Microsoft Researchers Explore Optimizer Influence on Delayed GeneralizationNikhilhttps://www.marktechpost.com/author/nikhil0980/Open-Source TTS Reaches New Heights: Nari Labs Releases Dia, a 1.6B Parameter Model for Real-Time Voice Cloning and Expressive Speech Synthesis on Consumer DeviceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers at Physical Intelligence Introduce π-0.5: A New AI Framework for Real-Time Adaptive Intelligence in Physical SystemsNikhilhttps://www.marktechpost.com/author/nikhil0980/Anthropic Releases a Comprehensive Guide to Building Coding Agents with Claude Code0 Comments 0 Shares 62 Views
-
WWW.MARKTECHPOST.COMA New Citibank Report/Guide Shares How Agentic AI Will Reshape Finance with Autonomous Analysis and Intelligent AutomationIn its latest ‘Agentic AI Finance & the ‘Do It For Me’ Economy’ report, Citibank explores a significant paradigm shift underway in financial services: the rise of agentic AI. Unlike conventional AI systems that rely on prompts or rule-based instructions, agentic AI possesses autonomy—acting proactively, making decisions, and executing multi-step workflows without direct human intervention. As the industry enters what Citibank calls the “Do It For Me” (DIFM) economy, these intelligent agents could redefine every facet of finance—from compliance and risk modeling to personalized advisory services. A New Operating System for Finance Agentic AI is more than an evolution of generative models; it’s an architectural overhaul. While generative AI creates content, agentic AI initiates and manages actions. Citibank positions this transformation as analogous to the shift from static websites to dynamic, cloud-native applications—except this time, it’s workflows that are becoming intelligent and adaptive. With advances in contextual memory, planning, and multi-agent coordination, banks now have the technical capability to deploy autonomous systems that not only respond, but anticipate. These agents will increasingly inhabit every layer of financial operations—from client-facing digital advisors to internal compliance monitors. Multi-Domain Applications Across Financial Services The report outlines a detailed matrix of use cases across banking verticals: Retail & Wealth Management: AI agents deliver adaptive financial advice, dynamically rebalance portfolios, and automate retirement planning based on real-time economic signals and user behavior. Corporate Banking: Agents handle complex reconciliations, optimize loan structures, and detect anomalies in trade and payment data. Insurance: Autonomous systems underwrite policies based on real-time behavioral and environmental inputs, while automating claims assessments with contextual risk modeling. Investment Operations: Research synthesis, market surveillance, and portfolio hedging are increasingly offloaded to agents equipped with domain-specific large language models. In every domain, agentic AI extends beyond efficiency—it creates new capabilities. For example, fraud detection systems can now leverage contextual inference rather than pattern-matching alone, significantly reducing false positives and detection latency. A New Human-AI Collaboration Model Citibank envisions a future where AI agents become digital colleagues—integrated into teams rather than siloed systems. These agents can handle repetitive, time-intensive tasks, freeing up human professionals to focus on higher-order reasoning and relationship management. However, this shift introduces new operational paradigms. IT departments will evolve to manage fleets of agents, ensuring that each one is properly configured, continuously monitored, and aligned with both policy and regulatory constraints. The role of compliance officers will expand from policy enforcement to supervising autonomous systems capable of interpreting and applying those policies in real time. Governance, Risk, and the Path to Production Despite the enthusiasm, Citibank’s report does not understate the risks. Agentic AI introduces new governance challenges: Who is accountable when an autonomous agent makes a critical error? How should decisions made by AI be audited and contested? The report emphasizes the necessity of human-in-the-loop systems, real-time oversight mechanisms, and formal agent authentication layers. It also warns that the attack surface expands considerably when AI agents are allowed to make financial decisions, interact with APIs, or hold cryptographic keys. Moreover, ethical considerations are paramount. AI agents must be transparent in how they reach decisions, especially in regulated contexts such as lending, underwriting, and portfolio management. In its report/guide Citibank concludes that agentic AI will catalyze the next major transformation in finance—on par with the internet era. With nearly 37% of 2024’s VC funding directed toward AI startups and a 17x increase in BigTech references to “agentic AI,” momentum is clearly building. However, wide-scale adoption will not be driven by novelty alone. It will depend on how effectively financial institutions can align these technologies with robust governance, operational readiness, and a deeper understanding of where autonomous systems can—and should—take the lead. As 2025 unfolds, agentic AI is no longer a concept confined to research labs. It is already shaping how financial institutions model risk, interact with clients, and build the next generation of intelligent infrastructure. Check out the Full Report. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long TextsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled DataSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing0 Comments 0 Shares 59 Views
-
WWW.MARKTECHPOST.COMAWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding AgentsRecent advancements in large language models (LLMs) have enabled the development of AI-based coding agents that can generate, modify, and understand software code. However, the evaluation of these systems remains limited, often constrained to synthetic or narrowly scoped benchmarks, primarily in Python. These benchmarks seldom reflect the structural and semantic diversity of real-world codebases, and as a result, many agents overfit to benchmark-specific patterns rather than demonstrating robust, transferable capabilities. AWS Introduces SWE-PolyBench: A More Comprehensive Evaluation Framework To address these challenges, AWS AI Labs has introduced SWE-PolyBench, a multilingual, repository-level benchmark designed for execution-based evaluation of AI coding agents. The benchmark spans 21 GitHub repositories across four widely-used programming languages—Java, JavaScript, TypeScript, and Python—comprising 2,110 tasks that include bug fixes, feature implementations, and code refactorings. Unlike prior benchmarks, SWE-PolyBench incorporates real pull requests (PRs) that close actual issues and include associated test cases, allowing for verifiable evaluation. A smaller, stratified subset—SWE-PolyBench500—has also been released to support quicker experimentation while preserving task and language diversity. Technical Structure and Evaluation Metrics SWE-PolyBench adopts an execution-based evaluation pipeline. Each task includes a repository snapshot and a problem statement derived from a GitHub issue. The system applies the associated ground truth patch in a containerized test environment configured for the respective language ecosystem (e.g., Maven for Java, npm for JS/TS, etc.). The benchmark then measures outcomes using two types of unit tests: fail-to-pass (F2P) and pass-to-pass (P2P). To provide a more granular assessment of coding agents, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics. These include both file-level and node-level retrieval scores, assessing the agent’s ability to locate and modify relevant sections of the codebase. These metrics offer insights beyond binary pass/fail outcomes, especially for complex, multi-file modifications. Empirical Evaluation and Observations Three open-source coding agents—Aider, SWE-Agent, and Agentless—were adapted for SWE-PolyBench. All used Anthropic’s Claude 3.5 as the underlying model and were modified to handle the multilingual, repository-level requirements of the benchmark. The evaluation revealed notable differences in performance across languages and task types. For instance, agents performed best on Python tasks (up to 24.1% pass rate) but struggled with TypeScript (as low as 4.7%). Java, despite its higher complexity in terms of average node changes, achieved higher success rates than TypeScript, suggesting that pretraining exposure and syntax familiarity play a critical role in model performance. Performance also varied with task complexity. Tasks limited to single-function or single-class changes yielded higher success rates (up to 40%), while those requiring mixed or multi-file changes saw a significant drop. Interestingly, high retrieval precision and recall—particularly for file and CST node identification—did not always translate to higher pass rates, indicating that code localization is necessary but insufficient for problem resolution. Conclusion: Toward Robust Evaluation of AI Coding Agents SWE-PolyBench presents a robust and nuanced evaluation framework for coding agents, addressing key limitations in existing benchmarks. By supporting multiple programming languages, covering a wider range of task types, and incorporating syntax-aware metrics, it offers a more representative assessment of an agent’s real-world applicability. The benchmark reveals that while AI agents exhibit promising capabilities, their performance remains inconsistent across languages and tasks. SWE-PolyBench provides a foundation for future research aimed at improving the generalizability, robustness, and reasoning capabilities of AI coding assistants. Check out the AWS DevOps Blog, Hugging Face – SWE-PolyBench and GitHub – SWE-PolyBench. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meet Xata Agent: An Open Source Agent for Proactive PostgreSQL Monitoring, Automated Troubleshooting, and Seamless DevOps IntegrationAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA AI Releases Describe Anything 3B: A Multimodal LLM for Fine-Grained Image and Video CaptioningAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite DatabaseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Atla AI Introduces the Atla MCP Server: A Local Interface of Purpose-Built LLM Judges via Model Context Protocol (MCP)0 Comments 0 Shares 47 Views
-
WWW.MARKTECHPOST.COMMeet Xata Agent: An Open Source Agent for Proactive PostgreSQL Monitoring, Automated Troubleshooting, and Seamless DevOps IntegrationXata Agent is an open-source AI assistant built to serve as a site reliability engineer for PostgreSQL databases. It constantly monitors logs and performance metrics, capturing signals such as slow queries, CPU and memory spikes, and abnormal connection counts, to detect emerging issues before they escalate into outages. Drawing on a curated collection of diagnostic playbooks and safe, read-only SQL routines, the agent provides concrete recommendations and can even automate routine tasks, such as vacuuming and indexing. By encapsulating years of operational expertise and pairing it with modern large language model (LLM) capabilities, Xata Agent reduces the burden on database administrators and empowers development teams to maintain high performance and availability without requiring deep Postgres specialization. Under the hood, Xata Agent is implemented as a Next.js application utilizing the Vercel AI SDK and is written primarily in TypeScript. The repository is organized as a monorepo, with dedicated directories for the database agent frontend (‘apps/dbagent’), shared libraries (‘packages’), configuration files, and Docker assets. This layout streamlines the contribution process: after installing Node via the included ‘.nvmrc’ file, a developer runs ‘pnpm install’ to pull dependencies, sets up a local PostgreSQL instance using Docker Compose, defines LLM credentials in a ‘.env.local’ file, applies database migrations, and launches the development server. This turnkey developer experience makes it straightforward to iterate on both the user interface and the agent’s diagnostic logic. Deploying the Xata Agent in production follows similar, straightforward steps. The team publishes Docker images for both the agent service and its companion PostgreSQL database, and provides a ‘docker-compose.yml’ example. Operators configure a small set of environment variables, such as the public URL and API keys for their chosen LLM provider, in an ‘.env.production’ file. Then, a single command boots up the entire stack: Copy CodeCopiedUse a different Browserdocker-compose up After a brief startup phase, the agent’s web interface appears at the specified address, guiding users through database onboarding, credential configuration, and initial health checks. This self-hosted model strikes a balance between autonomy and control, allowing teams to audit every component, integrate the agent with internal monitoring pipelines, and still benefit from community-driven enhancements. Below is an illustrative snippet of a ‘docker-compose.yml’ configuration for self-hosting: Copy CodeCopiedUse a different Browserversion: '3.8' services: xata-agent: image: xataio/agent:latest environment: PUBLIC_URL: http://localhost:8080 OPENAI_API_KEY: your_openai_api_key_here # Optional additional providers: # ANTHROPIC_API_KEY: your_anthropic_api_key_here # DEEPSEEK_API_KEY: your_deepseek_api_key_here ports: - "8080:8080" postgres: image: postgres:14 environment: POSTGRES_USER: agent_user POSTGRES_PASSWORD: secure_password POSTGRES_DB: agent_db volumes: - db_data:/var/lib/postgresql/data volumes: db_data: For local development, the workflow looks like: Copy CodeCopiedUse a different Browser# Switch Node version cd apps/dbagent nvm use # Install dependencies pnpm install # Copy example environment cp .env.local.example .env.local # Start development server pnpm dev In ‘.env.local’, developers supply the credentials for their LLMs and define where the frontend should connect: Copy CodeCopiedUse a different BrowserOPENAI_API_KEY=sk-your-openai-key ANTHROPIC_API_KEY=ak-your-anthropic-key PUBLIC_URL=http://localhost:3000 A core design principle of Xata Agent is extensibility. The agent avoids hallucinations by adhering to a fixed set of human-written playbooks and non-destructive tools. Playbooks are plain English files that specify step-by-step instructions, whereas tools are TypeScript functions that encapsulate database queries or cloud-provider API calls. Integrations—such as Slack and AWS RDS—plug into the system via configuration and UI widgets, enabling the addition of new data sources and notification channels with minimal effort. Key functionalities of Xata Agent include: Proactive monitoring: Continuously watch logs and metrics, including CPU usage, memory pressure, and query latency, to flag anomalies early. Configuration tuning: Suggest adjustments to Postgres settings such as ‘shared_buffers’ and ‘work_mem’ based on workload characteristics. Performance troubleshooting: Investigate slow queries, identify missing indexes, and recommend indexing strategies. Safe diagnostics: Execute read-only SQL against system views (‘pg_stat_statements’, ‘pg_locks’) to gather context without risking data integrity. Cloud integration: Pull logs and metrics directly from managed services like RDS and Aurora via CloudWatch. Alerting and notifications: Send real-time alerts to Slack channels when critical thresholds are crossed. LLM flexibility: Support multiple inference engines, including OpenAI, Anthropic, and Deepseek, so organizations can optimize for security and cost. Playbook customization: Define new troubleshooting flows in plain English to capture proprietary best practices. MCP server capability: Act as a Model Context Protocol server, enabling other agents to call its tools over the network. Approval workflows and eval-testing: Plan to introduce governance controls for sensitive operations and automated validation of agent recommendations. Developers can author new tools by exporting simple TypeScript functions. For example, a tool to fetch the five slowest queries might look like: Copy CodeCopiedUse a different Browser// packages/db-tools/src/tools/checkSlowQueries.ts import { Pool } from 'pg'; import { ToolResult } from 'xata-agent'; export async function checkSlowQueries(pool: Pool): Promise<ToolResult> { const result = await pool.query(' SELECT query, total_time, calls FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5; '); return { rows: result.rows }; } Then register it so the agent can call it: Copy CodeCopiedUse a different Browser// apps/dbagent/src/server/tools.ts import { defineTool } from 'xata-agent'; import { checkSlowQueries } from 'db-tools'; defineTool('checkSlowQueries', { description: 'Retrieve the top five slowest queries from pg_stat_statements', execute: async ({ dbPool }) => { return await checkSlowQueries(dbPool); }, }); Playbooks tie together tools into a coherent diagnostic flow. Below is an excerpt from a YAML-style playbook for investigating slow queries: Copy CodeCopiedUse a different Browser# configs/playbooks/investigate_slow_queries.playbook.yaml name: Investigate Slow Queries description: Steps to identify and resolve performance bottlenecks caused by slow queries. steps: - tool: getTablesAndInstanceInfo description: "Gather table sizes and database instance details." - tool: checkSlowQueries description: "List the top slow queries to pinpoint hotspots." - tool: suggestIndexes description: "Generate index recommendations for queries exceeding thresholds." - tool: evaluateVacuumStats description: "Check vacuum statistics to determine if table bloat is impacting performance." - tool: notifySlack description: "Alert the team in Slack if queries exceed critical latency." To integrate with Slack, one can leverage the built-in Slack adapter: Copy CodeCopiedUse a different Browser// packages/integrations/src/slackAdapter.ts import { SlackAdapter } from 'xata-agent/integrations'; const slack = new SlackAdapter({ webhookUrl: process.env.SLACK_WEBHOOK_URL }); export async function notifySlack({ message }: { message: string }) { await slack.send({ channel: process.env.SLACK_CHANNEL, text: ' Xata Agent Alert: ${message}', }); } This modular architecture, where tools, playbooks, and integrations are loosely coupled, ensures that extending the agent to support new workflows or platforms requires minimal boilerplate. For example, adding Google Cloud SQL support only involves implementing a new integration that fetches metrics via Google’s monitoring APIs and wiring it into the UI as a configuration step. Xata Agent’s roadmap reflects its commitment to evolving enterprise observability. Short-term plans include custom playbooks, which empower teams to encode domain-specific recovery procedures, and Model Context Protocol (MCP) support, allowing other agents to call Xata’s tools over the network. Mid-term enhancements include evaluation and testing harnesses to benchmark the accuracy of agent advice against historical incidents and approval workflows for potentially sensitive operations. A managed cloud edition is also in development, promising one-click integrations with popular monitoring stacks and simplified onboarding for teams without self-hosting infrastructure. A carefully engineered system prompt drives the orchestration layer that ties language models to these playbooks and tools. As highlighted in a recent commentary on AI-agent design, the agent is instructed to “Provide clear, concise, and accurate responses to questions. Use the provided tools to get context from the PostgreSQL database to answer questions. When asked why a query is slow, call the explainQuery tool and also consider the table sizes. During the initial assessment, use the getTablesAndInstanceInfo, getPerformanceAndVacuumSettings, and getPostgresExtensions tools. When asked to run a playbook, use the getPlaybook tool to get the playbook contents. Then use the contents of the playbook as an action plan. Execute the plan step by step. This prompt-driven architecture, which pairs LLM flexibility with deterministic tool use, demonstrates a novel “playbook” pattern for safe and reliable AI operations. By codifying best practices into reproducible playbooks, Xata Agent standardizes incident response and lowers the barrier for junior engineers to troubleshoot complex database issues. Teams leveraging the agent gain a single source of truth for operational procedures, reducing human error and enabling on-call rotations where less experienced staff can confidently handle alerts. Whether self-hosted or provided as a managed service, Xata Agent invites community contributions, peer review, and collaborative governance, ensuring that the collective expertise of the open source community continually enhances the agent’s capabilities. In conclusion, Xata Agent represents a significant advance in database observability and autonomous troubleshooting. Its combination of an extensible TypeScript monorepo, human-written playbooks, safe SQL tools, and flexible LLM integration positions it as a practical solution for modern DevOps teams. As organizations increasingly seek to automate complex infrastructure tasks, Xata Agent stands out by augmenting human expertise rather than attempting to replace it, providing clear, actionable insights and automations that help maintain PostgreSQL performance and reliability at scale. Check out the GitHub Page and Product Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop The post Meet Xata Agent: An Open Source Agent for Proactive PostgreSQL Monitoring, Automated Troubleshooting, and Seamless DevOps Integration appeared first on MarkTechPost.0 Comments 0 Shares 34 Views
-
WWW.MARKTECHPOST.COMNVIDIA AI Releases Describe Anything 3B: A Multimodal LLM for Fine-Grained Image and Video CaptioningChallenges in Localized Captioning for Vision-Language Models Describing specific regions within images or videos remains a persistent challenge in vision-language modeling. While general-purpose vision-language models (VLMs) perform well at generating global captions, they often fall short in producing detailed, region-specific descriptions. These limitations are amplified in video data, where models must account for temporal dynamics. Primary obstacles include a loss of fine-grained detail during visual feature extraction, insufficient annotated datasets tailored for regional description, and evaluation benchmarks that penalize accurate outputs due to incomplete reference captions. Describe Anything 3B—A Model Tailored for Localized Descriptions This AI work from NVIDIA presents Describe Anything 3B (DAM-3B), a multimodal large language model purpose-built for detailed, localized captioning across images and videos. Accompanied by DAM-3B-Video, the system accepts inputs specifying regions via points, bounding boxes, scribbles, or masks and generates contextually grounded, descriptive text. It is compatible with both static imagery and dynamic video inputs, and the models are publicly available via Hugging Face. Core Architectural Components and Model Design DAM-3B incorporates two principal innovations: a focal prompt and a localized vision backbone enhanced with gated cross-attention. The focal prompt fuses a full image with a high-resolution crop of the target region, retaining both regional detail and broader context. This dual-view input is processed by the localized vision backbone, which embeds the image and mask inputs and applies cross-attention to blend global and focal features before passing them to a large language model. These mechanisms are integrated without inflating token length, preserving computational efficiency. DAM-3B-Video extends this architecture to temporal sequences by encoding frame-wise region masks and integrating them across time. This allows region-specific descriptions to be generated for videos, even in the presence of occlusion or motion. Training Data Strategy and Evaluation Benchmarks To overcome data scarcity, NVIDIA develops the DLC-SDP pipeline—a semi-supervised data generation strategy. This two-stage process utilizes segmentation datasets and unlabeled web-scale images to curate a training corpus of 1.5 million localized examples. Region descriptions are refined using a self-training approach, producing high-quality captions. For evaluation, the team introduces DLC-Bench, which assesses description quality based on attribute-level correctness rather than rigid comparisons with reference captions. DAM-3B achieves leading performance across seven benchmarks, surpassing baselines like GPT-4o and VideoRefer. It demonstrates strong results in keyword-level (LVIS, PACO), phrase-level (Flickr30k Entities), and multi-sentence localized captioning (Ref-L4, HC-STVG). On DLC-Bench, DAM-3B achieves an average accuracy of 67.3%, outperforming other models in both detail and precision. Conclusion Describe Anything 3B addresses longstanding limitations in region-specific captioning by combining a context-aware architecture with a scalable, high-quality data pipeline. The model’s ability to describe localized content in both images and videos has broad applicability across domains such as accessibility tools, robotics, and video content analysis. With this release, NVIDIA provides a robust and reproducible benchmark for future research and sets a refined technical direction for the next generation of multimodal AI systems. Check out the Paper, Model on Hugging Face and Project Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite DatabaseAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Atla AI Introduces the Atla MCP Server: A Local Interface of Purpose-Built LLM Judges via Model Context Protocol (MCP)Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Long-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B ParametersAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Serverless MCP Brings AI-Assisted Debugging to AWS Workflows Within Modern IDEs0 Comments 0 Shares 37 Views
-
WWW.MARKTECHPOST.COMLLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled DataDespite significant advances in reasoning capabilities through reinforcement learning (RL), most large language models (LLMs) remain fundamentally dependent on supervised data pipelines. RL frameworks such as RLHF have pushed model alignment and instruction-following performance but rely heavily on human feedback and labeled datasets. As LLMs are increasingly applied in dynamic environments—ranging from educational settings to scientific workflows—they are required to generalize beyond curated training data. However, existing models often exhibit performance gaps when confronted with distribution shifts or novel reasoning tasks. While techniques like Test-Time Scaling (TTS) and Test-Time Training (TTT) have been proposed to mitigate this, the absence of reliable reward signals during inference poses a core challenge for deploying RL in unsupervised settings. Test-Time Reinforcement Learning (TTRL): Leveraging Model Priors for Self-Adaptation Researchers from Tsinghua University and Shanghai AI Lab introduced Test-Time Reinforcement Learning (TTRL). TTRL is a training framework that applies RL during inference, using only unlabeled test data. It leverages the intrinsic priors of pre-trained language models to estimate pseudo-rewards through majority voting across sampled outputs. Instead of relying on explicit labels, TTRL constructs reward functions by aggregating multiple model-generated responses to a given query. A consensus answer, obtained via majority voting, is treated as a pseudo-label. Model responses that align with this pseudo-label are positively reinforced. This formulation transforms test-time inference into an adaptive, self-supervised learning process, allowing LLMs to improve over time without additional supervision. TTRL has a two-stage approach: Label Estimation via Majority Voting: For each prompt, the model samples multiple outputs. The most frequent prediction is treated as the estimated label. Reward Assignment and Policy Optimization: A binary reward is assigned based on whether each sampled response matches the estimated label. The model is updated using gradient-based RL algorithms (e.g., PPO or GRPO) to maximize agreement with the pseudo-labels. This approach is notable for its simplicity and compatibility with standard RL methods. The reward function, though approximate, provides sufficient learning signal when aggregated over multiple samples. Experimental setups used temperature-controlled sampling (typically temperature = 1.0), with 64 samples for voting and 16 subsampled responses for training updates. No ground-truth labels are involved at any stage. Empirical Findings across Mathematical Reasoning Tasks TTRL was evaluated on three mathematical benchmarks: AIME 2024, AMC, and MATH-500. The results are consistent across both smaller and larger models: For Qwen2.5-Math-7B, performance on AIME 2024 increased from 16.7% to 43.3% (pass@1), an improvement of 159.3% without any labeled data. On average, across the three benchmarks, the same model achieved a relative gain of 84.1%. Notably, even a smaller model, Qwen2.5-Math-1.5B, improved from 33.0% to 80.0% on MATH-500. These gains demonstrate that TTRL supports model improvement even in the absence of supervised training signals. Moreover, TTRL often outperforms the upper bound implied by its own training signal—i.e., the accuracy of the majority-voted predictions. This suggests a self-reinforcing learning loop that can extract richer supervision from noisy consensus signals. Additional analyses showed that TTRL generalizes beyond the dataset it was applied to. When trained on one benchmark and evaluated on others, performance improvements persisted. This cross-task transfer indicates that TTRL does not lead to narrow overfitting but supports broader generalization. Conclusion: Toward Self-Adaptive and Label-Free Learning TTRL represents a novel shift in how reinforcement learning can be applied to LLMs in real-world settings. By reusing the model’s own generations as a proxy for supervision, it removes the need for expensive human annotations while enabling continual adaptation. The approach scales naturally with model size, is compatible with different RL algorithms, and shows promising robustness across tasks of varying difficulty. While this study focuses on mathematical reasoning, the underlying ideas—self-estimated supervision, test-time adaptation, and reinforcement learning without labels—may generalize to other domains. As language models increasingly encounter tasks beyond their pre-training distribution, frameworks like TTRL offer a scalable path forward. Further exploration is needed to understand the theoretical convergence properties of TTRL and to evaluate its applicability in interactive or multi-agent scenarios. Nonetheless, TTRL provides a technically sound and computationally efficient foundation for enabling LLMs to evolve continuously from their own outputs. Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder SharingSana Hassanhttps://www.marktechpost.com/author/sana-hassan/A Code Implementation of a Real‑Time In‑Memory Sensor Alert Pipeline in Google Colab with FastStream, RabbitMQ, TestRabbitBroker, PydanticSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses0 Comments 0 Shares 42 Views
-
WWW.MARKTECHPOST.COMMuon Optimizer Significantly Accelerates Grokking in Transformers: Microsoft Researchers Explore Optimizer Influence on Delayed GeneralizationRevisiting the Grokking Challenge In recent years, the phenomenon of grokking—where deep learning models exhibit a delayed yet sudden transition from memorization to generalization—has prompted renewed investigation into training dynamics. Initially observed in small algorithmic tasks like modular arithmetic, grokking reveals that models can reach near-perfect training accuracy while validation performance remains poor for a prolonged period. Eventually, and often abruptly, the model begins to generalize. Understanding what governs this transition is important not just for interpretability, but also for optimizing training efficiency in deep networks. Prior studies have highlighted the role of weight decay and regularization. However, the specific influence of optimizers on this process has been underexplored. Investigating Optimizer Effects on Grokking This AI paper from Microsoft examines the impact of optimizer choice on grokking behavior. Specifically, it contrasts the performance of the widely adopted AdamW optimizer with Muon, a newer optimization algorithm that incorporates spectral norm constraints and second-order information. The study investigates whether these features enable Muon to expedite the generalization phase. The experiments span seven algorithmic tasks—primarily modular arithmetic operations and parity classification—using a modern Transformer architecture. Each task is designed to reliably exhibit grokking under appropriate training conditions. The research also includes a comparative analysis of softmax variants (standard softmax, stablemax, and sparsemax) to evaluate whether output normalization plays a secondary role in modulating training dynamics. However, the core investigation centers on the optimizer. Architectural and Optimization Design The underlying model architecture adopts standard Transformer components, implemented in PyTorch. It includes multi-head self-attention, rotary positional embeddings (RoPE), RMS normalization, SiLU activations, and dropout-based regularization. Input tokens—numerical values or operators—are encoded through simple identity embeddings. The key distinction lies in the optimizer behavior: AdamW, a baseline in contemporary deep learning workflows, uses adaptive learning rates with decoupled weight decay. Muon, in contrast, applies orthogonalized gradients, enforces spectral norm constraints to stabilize training, and approximates second-order curvature for more informative updates. These mechanisms are intended to promote broader exploration during optimization, mitigate instability (e.g., “softmax collapse”), and synchronize learning progress across layers. Muon’s ability to regulate update magnitude in accordance with layer dimensions is particularly relevant in avoiding inefficient memorization pathways. Three softmax configurations—Softmax, Stablemax, and Sparsemax—are included to assess whether numerical stability or sparsity of the output distribution influences grokking. This helps ensure that the observed effects stem primarily from optimizer dynamics rather than output activation nuances. Empirical Evaluation and Results The study’s empirical protocol is methodically designed. Each optimizer-softmax-task combination is evaluated across multiple seeds to ensure statistical robustness. Grokking is operationally defined as the first epoch where validation accuracy surpasses 95% following training accuracy stabilization. The results indicate a consistent and statistically significant advantage for Muon. On average, Muon reaches the grokking threshold in 102.89 epochs, compared to 153.09 epochs for AdamW. This difference is not only numerically large but also statistically rigorous (t = 5.0175, p ≈ 6.33e−8). Additionally, Muon demonstrates a tighter distribution of grokking epochs across all conditions, suggesting more predictable training trajectories. All tasks were conducted on NVIDIA H100 GPUs using a unified codebase and standardized configurations. Tasks include modular addition, multiplication, division, exponentiation, GCD, and a 10-bit parity task. Dataset sizes ranged from 1,024 to 9,409 examples, with training-validation splits adjusted per task to maintain consistency. Conclusion The findings provide strong evidence that optimizer geometry significantly influences the emergence of generalization in overparameterized models. By steering the optimization path through second-order-aware updates and spectral norm constraints, Muon appears to facilitate a more direct route toward discovering the underlying data structure, bypassing prolonged overfitting phases. This study underscores the broader need to consider optimization strategy as a first-class factor in neural training design. While prior work emphasized data and regularization, these results suggest that optimizer architecture itself can play a pivotal role in shaping training dynamics. Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Open-Source TTS Reaches New Heights: Nari Labs Releases Dia, a 1.6B Parameter Model for Real-Time Voice Cloning and Expressive Speech Synthesis on Consumer DeviceNikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers at Physical Intelligence Introduce π-0.5: A New AI Framework for Real-Time Adaptive Intelligence in Physical SystemsNikhilhttps://www.marktechpost.com/author/nikhil0980/Anthropic Releases a Comprehensive Guide to Building Coding Agents with Claude CodeNikhilhttps://www.marktechpost.com/author/nikhil0980/ByteDance Releases UI-TARS-1.5: An Open-Source Multimodal AI Agent Built upon a Powerful Vision-Language Model0 Comments 0 Shares 55 Views
-
WWW.MARKTECHPOST.COMOpen-Source TTS Reaches New Heights: Nari Labs Releases Dia, a 1.6B Parameter Model for Real-Time Voice Cloning and Expressive Speech Synthesis on Consumer DeviceThe development of text-to-speech (TTS) systems has seen significant advancements in recent years, particularly with the rise of large-scale neural models. Yet, most high-fidelity systems remain locked behind proprietary APIs and commercial platforms. Addressing this gap, Nari Labs has released Dia, a 1.6 billion parameter TTS model under the Apache 2.0 license, providing a strong open-source alternative to closed systems such as ElevenLabs and Sesame. Technical Overview and Model Capabilities Dia is designed for high-fidelity speech synthesis, incorporating a transformer-based architecture that balances expressive prosody modeling with computational efficiency. The model supports zero-shot voice cloning, enabling it to replicate a speaker’s voice from a short reference audio clip. Unlike traditional systems that require fine-tuning for each new speaker, Dia generalizes effectively across voices without retraining. A notable technical feature of Dia is its ability to synthesize non-verbal vocalizations, such as coughing and laughter. These components are typically excluded from many standard TTS systems, yet they are critical for generating naturalistic and contextually rich audio. Dia models these sounds natively, contributing to more human-like speech output. The model also supports real-time synthesis, with optimized inference pipelines allowing it to operate on consumer-grade devices, including MacBooks. This performance characteristic is particularly valuable for developers seeking low-latency deployment without relying on cloud-based GPU servers. Deployment and Licensing Dia’s release under the Apache 2.0 license offers broad flexibility for both commercial and academic use. Developers can fine-tune the model, adapt its outputs, or integrate it into larger voice-based systems without licensing constraints. The training and inference pipeline is written in Python and integrates with standard audio processing libraries, lowering the barrier to adoption. The model weights are available directly via Hugging Face, and the repository provides a clear setup process for inference, including examples of input text-to-audio generation and voice cloning. The design favors modularity, making it easy to extend or customize components such as vocoders, acoustic models, or input preprocessing. Comparisons and Initial Reception While formal benchmarks have not been extensively published, preliminary evaluations and community tests suggest that Dia performs comparably—if not favorably—to existing commercial systems in areas such as speaker fidelity, audio clarity, and expressive variation. The inclusion of non-verbal sound support and open-source availability further distinguishes it from its proprietary counterparts. Since its release, Dia has gained significant attention within the open-source AI community, quickly reaching the top ranks on Hugging Face’s trending models. The community response highlights the growing demand for accessible, high-performance speech models that can be audited, modified, and deployed without platform dependencies. Broader Implications The release of Dia fits within a broader movement toward democratizing advanced speech technologies. As TTS applications expand—from accessibility tools and audiobooks to interactive agents and game development—the availability of open, high-quality voice models becomes increasingly important. By releasing Dia with an emphasis on usability, performance, and transparency, Nari Labs contributes meaningfully to the TTS research and development ecosystem. The model provides a strong baseline for future work in zero-shot voice modeling, multi-speaker synthesis, and real-time audio generation. Conclusion Dia represents a mature and technically sound contribution to the open-source TTS space. Its ability to synthesize expressive, high-quality speech—including non-verbal audio—combined with zero-shot cloning and local deployment capabilities, makes it a practical and adaptable tool for developers and researchers alike. As the field continues to evolve, models like Dia will play a central role in shaping more open, flexible, and efficient speech systems. Check out the Model on Hugging Face, GitHub Page and Demo. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Researchers at Physical Intelligence Introduce π-0.5: A New AI Framework for Real-Time Adaptive Intelligence in Physical SystemsNikhilhttps://www.marktechpost.com/author/nikhil0980/Anthropic Releases a Comprehensive Guide to Building Coding Agents with Claude CodeNikhilhttps://www.marktechpost.com/author/nikhil0980/ByteDance Releases UI-TARS-1.5: An Open-Source Multimodal AI Agent Built upon a Powerful Vision-Language ModelNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency0 Comments 0 Shares 66 Views
-
WWW.MARKTECHPOST.COMDecoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder SharingDiffusion Transformers have demonstrated outstanding performance in image generation tasks, surpassing traditional models, including GANs and autoregressive architectures. They operate by gradually adding noise to images during a forward diffusion process and then learning to reverse this process through denoising, which helps the model approximate the underlying data distribution. Unlike the commonly used UNet-based diffusion models, Diffusion Transformers apply the transformer architecture, which has proven effective after sufficient training. However, their training process is slow and computationally intensive. A key limitation lies in their architecture: during each denoising step, the model must balance encoding low-frequency semantic information while simultaneously decoding high-frequency details using the same modules—this creates an optimization conflict between the two tasks. To address the slow training and performance bottlenecks, recent work has focused on improving the efficiency of Diffusion Transformers through various strategies. These include utilizing optimized attention mechanisms, such as linear and sparse attention, to reduce computational costs, and introducing more effective sampling techniques, including log-normal resampling and loss reweighting, to stabilize the learning process. Additionally, methods like REPA, RCG, and DoD incorporate domain-specific inductive biases, while masked modeling enforces structured feature learning, boosting the model’s reasoning capabilities. Models like DiT, SiT, SD3, Lumina, and PixArt have extended the diffusion transformer framework to advanced areas such as text-to-image and text-to-video generation. Researchers from Nanjing University and ByteDance Seed Vision introduce the Decoupled Diffusion Transformer (DDT), which separates the model into a dedicated condition encoder for semantic extraction and a velocity decoder for detailed generation. This decoupled design enables faster convergence and improved sample quality. On the ImageNet 256×256 and 512×512 benchmarks, their DDT-XL/2 model achieves state-of-the-art FID scores of 1.31 and 1.28, respectively, with up to 4× faster training. To further accelerate inference, they propose a statistical dynamic programming method that optimally shares encoder outputs across denoising steps with minimal impact on performance. The DDT introduces a condition encoder and a velocity decoder to handle low- and high-frequency components in image generation separately. The encoder extracts semantic features (zt) from noisy inputs, timesteps, and class labels, which are then used by the decoder to estimate the velocity field. To ensure consistency of zt across steps, representation alignment and decoder supervision are applied. During inference, a shared self-condition mechanism reduces computation by reusing zt at certain timesteps. A dynamic programming approach identifies the optimal timesteps for recomputing zt, minimizing performance loss while accelerating the sampling process. The researchers trained their models on 256×256 ImageNet using a batch size of 256 without gradient clipping or warm-up. Using VAE-ft-EMA and Euler sampling, they evaluated performance using FID, sFID, IS, Precision, and Recall. They built improved baselines with SwiGLU, RoPE, RMSNorm, and lognorm sampling. Their DDT models consistently outperformed prior baselines, particularly in larger sizes, and converged significantly faster than REPA. Further gains were achieved through encoder sharing strategies and careful tuning of the encoder-decoder ratio, resulting in state-of-the-art FID scores on both 256×256 and 512×512 ImageNet.In conclusion, the study presents the DDT, which addresses the optimization challenge in traditional diffusion transformers by separating semantic encoding and high-frequency decoding into distinct modules. By scaling encoder capacity relative to the decoder, DDT achieves notable performance gains, especially in larger models. The DDT-XL/2 model sets new benchmarks on ImageNet, achieving faster training convergence and lower FID scores for both 256×256 and 512×512 resolutions. Additionally, the decoupled design enables encoder sharing across denoising steps, significantly improving inference efficiency. A dynamic programming strategy further enhances this by determining optimal sharing points, maintaining image quality while reducing computational load. Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI AgentsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/A Code Implementation of a Real‑Time In‑Memory Sensor Alert Pipeline in Google Colab with FastStream, RabbitMQ, TestRabbitBroker, PydanticSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated ResponsesSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Stanford Researchers Propose FramePack: A Compression-based AI Framework to Tackle Drifting and Forgetting in Long-Sequence Video Generation Using Efficient Context Management and Sampling0 Comments 0 Shares 41 Views
-
WWW.MARKTECHPOST.COMMeet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI AgentsVoltAgent is an open-source TypeScript framework designed to streamline the creation of AI‑driven applications by offering modular building blocks and abstractions for autonomous agents. It addresses the complexity of directly working with large language models (LLMs), tool integrations, and state management by providing a core engine that handles these concerns out-of-the-box. Developers can define agents with specific roles, equip them with memory, and tie them to external tools without having to reinvent foundational code for each new project. Unlike DIY solutions that require extensive boilerplate and custom infrastructure, or no-code platforms that often impose vendor lock-in and limited extensibility, VoltAgent strikes a middle ground by giving developers full control over provider choice, prompt design, and workflow orchestration. It integrates seamlessly into existing Node.js environments, enabling teams to start small, build single assistants, and scale up to complex multi‑agent systems coordinated by supervisor agents. The Challenge of Building AI Agents Creating intelligent assistants typically involves three major pain points: Model Interaction Complexity: Managing calls to LLM APIs, handling retries, latency, and error states. Stateful Conversations: Persisting user context across sessions to achieve natural, coherent dialogues. External System Integration: Connecting to databases, APIs, and third‑party services to perform real‑world tasks. Traditional approaches either require you to write custom code for each of these layers, resulting in fragmented and hard-to-maintain repositories, or lock you into proprietary platforms that sacrifice flexibility. VoltAgent abstracts these layers into reusable packages, so developers can focus on crafting agent logic rather than plumbing. Core Architecture and Modular Packages At its core, VoltAgent consists of a Core Engine package (‘@voltagent/core’) responsible for agent lifecycle, message routing, and tool invocation. Around this core, a suite of extensible packages provides specialized features: Multi‑Agent Systems: Supervisor agents coordinate sub‑agents, delegating tasks based on custom logic and maintaining shared memory channels. Tooling & Integrations: ‘createTool’ utilities and type-safe tool definitions (via Zod schemas) enable agents to invoke HTTP APIs, database queries, or local scripts as if they were native LLM functions. Voice Interaction: The ‘@voltagent/voice’ package provides speech-to-text and text-to-speech support, enabling agents to speak and listen in real-time. Model Control Protocol (MCP): Standardized protocol support for inter‑process or HTTP‑based tool servers, facilitating vendor‑agnostic tool orchestration. Retrieval‑Augmented Generation (RAG): Integrate vector stores and retriever agents to fetch relevant context before generating responses. Memory Management: Pluggable memory providers (in-memory, LibSQL/Turso, Supabase) enable agents to retain past interactions, ensuring continuity of context. Observability & Debugging: A separate VoltAgent Console provides a visual interface for inspecting agent states, logs, and conversation flows in real-time. Getting Started: Automatic Setup VoltAgent includes a CLI tool, ‘create-voltagent-app’, to scaffold a fully configured project in seconds. This automatic setup prompts for your project name and preferred package manager, installs dependencies, and generates starter code, including a simple agent definition so that you can run your first AI assistant with a single command. # Using npm npm create voltagent-app@latest my-voltagent-app # Or with pnpm pnpm create voltagent-app my-voltagent-app cd my-voltagent-app npm run dev Code Source At this point, you can open the VoltAgent Console in your browser, locate your new agent, and start chatting directly in the built‑in UI. The CLI’s built‑in ‘tsx watch’ support means any code changes in ‘src/’ automatically restart the server. Manual Setup and Configuration For teams that prefer fine‑grained control over their project configuration, VoltAgent provides a manual setup path. After creating a new npm project and adding TypeScript support, developers install the core framework and any desired packages: // tsconfig.json { "compilerOptions": { "target": "ES2020", "module": "NodeNext", "outDir": "dist", "strict": true, "esModuleInterop": true }, "include": ["src"] } Code Source # Development deps npm install --save-dev typescript tsx @types/node @voltagent/cli # Framework deps npm install @voltagent/core @voltagent/vercel-ai @ai-sdk/openai zod Code Source A minimal ‘src/index.ts’ might look like this: import { VoltAgent, Agent } from "@voltagent/core"; import { VercelAIProvider } from "@voltagent/vercel-ai"; import { openai } from "@ai-sdk/openai"; // Define a simple agent const agent = new Agent({ name: "my-agent", description: "A helpful assistant that answers questions without using tools", llm: new VercelAIProvider(), model: openai("gpt-4o-mini"), }); // Initialize VoltAgent new VoltAgent({ agents: { agent }, }); Code Source Adding an ‘.env’ file with your ‘OPENAI_API_KEY’ and updating ‘package.json’ scripts to include ‘”dev”: “tsx watch –env-file=.env ./src”‘ completes the local development setup. Running ‘npm run dev’ launches the server and automatically connects to the developer console. Building Multi‑Agent Workflows Beyond single agents, VoltAgent truly shines when orchestrating complex workflows via Supervisor Agents. In this paradigm, specialized sub‑agents handle discrete tasks, such as fetching GitHub stars or contributors, while a supervisor orchestrates the sequence and aggregates results: import { Agent, VoltAgent } from "@voltagent/core"; import { VercelAIProvider } from "@voltagent/vercel-ai"; import { openai } from "@ai-sdk/openai"; const starsFetcher = new Agent({ name: "Stars Fetcher", description: "Fetches star count for a GitHub repo", llm: new VercelAIProvider(), model: openai("gpt-4o-mini"), tools: [fetchRepoStarsTool], }); const contributorsFetcher = new Agent({ name: "Contributors Fetcher", description: "Fetches contributors for a GitHub repo", llm: new VercelAIProvider(), model: openai("gpt-4o-mini"), tools: [fetchRepoContributorsTool], }); const supervisor = new Agent({ name: "Supervisor", description: "Coordinates data gathering and analysis", llm: new VercelAIProvider(), model: openai("gpt-4o-mini"), subAgents: [starsFetcher, contributorsFetcher], }); new VoltAgent({ agents: { supervisor } }); Code Source In this setup, when a user inputs a repository URL, the supervisor routes the request to each sub-agent in turn, gathers their outputs, and synthesizes a final report, demonstrating VoltAgent’s ability to structure multi-step AI pipelines with minimal boilerplate. Observability and Telemetry Integration Production‑grade AI systems require more than code; they demand visibility into runtime behavior, performance metrics, and error conditions. VoltAgent’s observability suite includes integrations with popular platforms like Langfuse, enabling automated export of telemetry data: import { VoltAgent } from "@voltagent/core"; import { LangfuseExporter } from "langfuse-vercel"; export const volt = new VoltAgent({ telemetry: { serviceName: "ai", enabled: true, export: { type: "custom", exporter: new LangfuseExporter({ publicKey: process.env.LANGFUSE_PUBLIC_KEY, secretKey: process.env.LANGFUSE_SECRET_KEY, baseUrl: process.env.LANGFUSE_BASEURL, }), }, }, }); Code Source This configuration wraps all agent interactions with metrics and traces, which are sent to Langfuse for real-time dashboards, alerting, and historical analysis, equipping teams to maintain service-level agreements (SLAs) and quickly diagnose issues in AI-driven workflows. VoltAgent’s versatility empowers a broad spectrum of applications: Customer Support Automation: Agents that retrieve order status, process returns, and escalate complex issues to human reps, all while maintaining conversational context. Intelligent Data Pipelines: Agents orchestrate data extraction from APIs, transform records, and push results to business intelligence dashboards, fully automated and monitored. DevOps Assistants: Agents that analyze CI/CD logs, suggest optimizations, and even trigger remediation scripts via secure tool calls. Voice‑Enabled Interfaces: Deploy agents in kiosks or mobile apps that listen to user queries and respond with synthesized speech, enhanced by memory for personalized experiences. RAG Systems: Agents that first retrieve domain‑specific documents (e.g., legal contracts, technical manuals) and then generate precise answers, blending vector search with LLM generation. Enterprise Integration: Workflow agents that coordinate across Slack, Salesforce, and internal databases, automating cross‑departmental processes with full audit trails. By abstracting common patterns, tool invocation, memory, multi‑agent coordination, and observability, VoltAgent reduces integration time from weeks to days, making it a powerful choice for teams seeking to infuse AI across products and services. In conclusion, VoltAgent reimagines AI agent development by offering a structured yet flexible framework that scales from single-agent prototypes to enterprise-level multi-agent systems. Its modular architecture, with a robust core, rich ecosystem packages, and observability tooling, allows developers to focus on domain logic rather than plumbing. Whether you’re building a chat assistant, automating complex workflows, or integrating AI into existing applications, VoltAgent provides the speed, maintainability, and control you need to bring sophisticated AI solutions to production quickly. By combining easy onboarding via ‘create-voltagent-app’, manual configuration options for power users, and deep extensibility through tools and memory providers, VoltAgent positions itself as the definitive TypeScript framework for AI agent orchestration, helping teams deliver intelligent applications with confidence and speed. Sources Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder SharingSana Hassanhttps://www.marktechpost.com/author/sana-hassan/A Code Implementation of a Real‑Time In‑Memory Sensor Alert Pipeline in Google Colab with FastStream, RabbitMQ, TestRabbitBroker, PydanticSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated ResponsesSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Stanford Researchers Propose FramePack: A Compression-based AI Framework to Tackle Drifting and Forgetting in Long-Sequence Video Generation Using Efficient Context Management and Sampling0 Comments 0 Shares 52 Views
-
WWW.MARKTECHPOST.COMResearchers at Physical Intelligence Introduce π-0.5: A New AI Framework for Real-Time Adaptive Intelligence in Physical SystemsDesigning intelligent systems that function reliably in dynamic physical environments remains one of the more difficult frontiers in AI. While significant advances have been made in perception and planning within simulated or controlled contexts, the real world is noisy, unpredictable, and resistant to abstraction. Traditional AI systems often rely on high-level representations detached from their physical implementations, leading to inefficiencies in response time, brittleness to unexpected changes, and excessive power consumption. In contrast, humans and animals exhibit remarkable adaptability through tight sensorimotor feedback loops. Reproducing even a fraction of that adaptability in embodied systems is a substantial challenge. Physical Intelligence Introduces π-0.5: A Framework for Embodied Adaptation To address these constraints, Physical Intelligence has introduced π-0.5—a lightweight and modular framework designed to integrate perception, control, and learning directly within physical systems. As described in their recent blog post, π-0.5 serves as a foundational building block for what the team terms “physical intelligence”: systems that learn from and adapt to the physical world through constant interaction, not abstraction alone. Rather than isolating intelligence in a centralized digital core, π-0.5 distributes processing and control throughout the system in compact modules. Each module, termed a “π-node,” encapsulates sensor inputs, local actuation logic, and a small, trainable neural component. These nodes can be chained or scaled across various embodiments, from wearables to autonomous agents, and are designed to react locally before resorting to higher-level computation. This architecture reflects a core assumption of the Physical Intelligence team: cognition emerges from action—not apart from it. Technical Composition and Functional Characteristics π-0.5 combines three core elements: (1) low-latency signal processing, (2) real-time learning loops, and (3) modular hardware-software co-design. Signal processing at the π-node level is tailored to the physical embodiment—allowing for motion-specific or material-specific response strategies. Learning is handled through a minimal but effective reinforcement update rule, enabling nodes to adapt weights in response to performance signals over time. Importantly, this learning is localized: individual modules do not require centralized orchestration to evolve their behavior. A central advantage of this decentralized model is energy efficiency. By distributing computation and minimizing the need for global communication, the system reduces latency and energy draw—key factors for edge devices and embedded systems. Additionally, the modularity of π-0.5 makes it hardware-agnostic, capable of interfacing with a variety of microcontrollers, sensors, and actuators. Another technical innovation is the system’s support for tactile and kinesthetic feedback integration. π-0.5 is built to accommodate proprioceptive sensing, which enhances its capacity to maintain adaptive behavior in response to physical stress, deformation, or external forces—especially relevant for soft robotics and wearable interfaces. Preliminary Results and Application Scenarios Initial demonstrations of π-0.5 showcase its adaptability across a variety of scenarios. In a soft robotic gripper prototype, the inclusion of π-0.5 nodes enabled the system to self-correct grip force based on the texture and compliance of held objects—without relying on pre-programmed models or external computation. Compared to a traditional control loop, this approach yielded a 30% improvement in grip accuracy and a 25% reduction in power consumption under similar test conditions. In wearable prototypes, π-0.5 allowed for localized adaptation to different body movements, achieving smoother haptic feedback and better energy regulation during continuous use. These results highlight π-0.5’s potential not just in robotics but in augmentative human-machine interfaces, where context-sensitive responsiveness is critical. Conclusion π-0.5 marks a deliberate step away from monolithic AI architectures toward systems that closely couple intelligence with physical interaction. Rather than pursuing ever-larger centralized models, Physical Intelligence proposes a distributed, embodied approach grounded in modular design and real-time adaptation. This direction aligns with long-standing goals in cybernetics and biologically inspired computing—treating intelligence not as a product of abstraction, but as a property that emerges from constant physical engagement. As AI continues to move into real-world systems, from wearables to autonomous machines, the need for low-power, adaptive, and resilient architectures will grow. π-0.5 offers a compelling foundation for meeting these requirements, contributing to a more integrated and physically grounded conception of intelligent systems. Check out the Technical details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/Anthropic Releases a Comprehensive Guide to Building Coding Agents with Claude CodeNikhilhttps://www.marktechpost.com/author/nikhil0980/ByteDance Releases UI-TARS-1.5: An Open-Source Multimodal AI Agent Built upon a Powerful Vision-Language ModelNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing LatencyNikhilhttps://www.marktechpost.com/author/nikhil0980/OpenAI Releases a Practical Guide to Building LLM Agents for Real-World Applications0 Comments 0 Shares 63 Views
-
WWW.MARKTECHPOST.COMA Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite DatabaseIn this tutorial, we’ll build an end‑to‑end ticketing assistant powered by Agentic AI using the PydanticAI library. We’ll define our data rules with Pydantic v2 models, store tickets in an in‑memory SQLite database, and generate unique identifiers with Python’s uuid module. Behind the scenes, two agents, one for creating tickets and one for checking status, leverage Google Gemini (via PydanticAI’s google-gla provider) to interpret your natural‑language prompts and call our custom database functions. The result is a clean, type‑safe workflow you can run immediately in Colab. !pip install --upgrade pip !pip install pydantic-ai First, these two commands update your pip installer to the latest version, bringing in new features and security patches, and then install PydanticAI. This library enables the definition of type-safe AI agents and the integration of Pydantic models with LLMs. import os from getpass import getpass if "GEMINI_API_KEY" not in os.environ: os.environ["GEMINI_API_KEY"] = getpass("Enter your Google Gemini API key: ") We check whether the GEMINI_API_KEY environment variable is already set. If not, we securely prompt you (without echoing) to enter your Google Gemini API key at runtime, then store it in os.environ so that your Agentic AI calls can authenticate automatically. !pip install nest_asyncio We install the nest_asyncio package, which lets you patch the existing asyncio event loop so that you can call async functions (or use .run_sync()) inside environments like Colab without running into “event loop already running” errors. import sqlite3 import uuid from dataclasses import dataclass from typing import Literal from pydantic import BaseModel, Field from pydantic_ai import Agent, RunContext We bring in Python’s sqlite3 for our in‑memory database and uuid to generate unique ticket IDs, use dataclass and Literal for clear dependency and type definitions, and load Pydantic’s BaseModel/Field for enforcing data schemas alongside Agent and RunContext from PydanticAI to wire up and run our conversational agents. conn = sqlite3.connect(":memory:") conn.execute(""" CREATE TABLE tickets ( ticket_id TEXT PRIMARY KEY, summary TEXT NOT NULL, severity TEXT NOT NULL, department TEXT NOT NULL, status TEXT NOT NULL ) """) conn.commit() We set up an in‑memory SQLite database and define a tickets table with columns for ticket_id, summary, severity, department, and status, then commit the schema so you have a lightweight, transient store for managing your ticket records. @dataclass class TicketingDependencies: """Carries our DB connection into system prompts and tools.""" db: sqlite3.Connection class CreateTicketOutput(BaseModel): ticket_id: str = Field(..., description="Unique ticket identifier") summary: str = Field(..., description="Text summary of the issue") severity: Literal["low","medium","high"] = Field(..., description="Urgency level") department: str = Field(..., description="Responsible department") status: Literal["open"] = Field("open", description="Initial ticket status") class TicketStatusOutput(BaseModel): ticket_id: str = Field(..., description="Unique ticket identifier") status: Literal["open","in_progress","resolved"] = Field(..., description="Current ticket status") Here, we define a simple TicketingDependencies dataclass to pass our SQLite connection into each agent call, and then declare two Pydantic models: CreateTicketOutput (with fields for ticket ID, summary, severity, department, and default status “open”) and TicketStatusOutput (with ticket ID and its current status). These models enforce a clear, validated structure on everything our agents return, ensuring you always receive well-formed data. create_agent = Agent( "google-gla:gemini-2.0-flash", deps_type=TicketingDependencies, output_type=CreateTicketOutput, system_prompt="You are a ticketing assistant. Use the `create_ticket` tool to log new issues." ) @create_agent.tool async def create_ticket( ctx: RunContext[TicketingDependencies], summary: str, severity: Literal["low","medium","high"], department: str ) -> CreateTicketOutput: """ Logs a new ticket in the database. """ tid = str(uuid.uuid4()) ctx.deps.db.execute( "INSERT INTO tickets VALUES (?,?,?,?,?)", (tid, summary, severity, department, "open") ) ctx.deps.db.commit() return CreateTicketOutput( ticket_id=tid, summary=summary, severity=severity, department=department, status="open" ) We create a PydanticAI Agent named’ create_agent’ that’s wired to Google Gemini and is aware of our SQLite connection (deps_type=TicketingDependencies) and output schema (CreateTicketOutput). The @create_agent.tool decorator then registers an async create_ticket function, which generates a UUID, inserts a new row into the tickets table, and returns a validated CreateTicketOutput object. status_agent = Agent( "google-gla:gemini-2.0-flash", deps_type=TicketingDependencies, output_type=TicketStatusOutput, system_prompt="You are a ticketing assistant. Use the `get_ticket_status` tool to retrieve current status." ) @status_agent.tool async def get_ticket_status( ctx: RunContext[TicketingDependencies], ticket_id: str ) -> TicketStatusOutput: """ Fetches the ticket status from the database. """ cur = ctx.deps.db.execute( "SELECT status FROM tickets WHERE ticket_id = ?", (ticket_id,) ) row = cur.fetchone() if not row: raise ValueError(f"No ticket found for ID {ticket_id!r}") return TicketStatusOutput(ticket_id=ticket_id, status=row[0]) We set up a second PydanticAI Agent, status_agent, also using the Google Gemini provider and our shared TicketingDependencies. It registers an async get_ticket_status tool that looks up a given ticket_id in the SQLite database and returns a validated TicketStatusOutput, or raises an error if the ticket isn’t found. deps = TicketingDependencies(db=conn) create_result = await create_agent.run( "My printer on 3rd floor shows a paper jam error.", deps=deps ) print("Created Ticket →") print(create_result.output.model_dump_json(indent=2)) tid = create_result.output.ticket_id status_result = await status_agent.run( f"What's the status of ticket {tid}?", deps=deps ) print("Ticket Status →") print(status_result.output.model_dump_json(indent=2)) Finally, we first package your SQLite connection into deps, then ask the create_agent to log a new ticket via a natural‑language prompt, printing the validated ticket data as JSON. It then takes the returned ticket_id, queries the status_agent for that ticket’s current state, and prints the status in JSON form. In conclusion, you have seen how Agentic AI and PydanticAI work together to automate a complete service process, from logging a new issue to retrieving its live status, all managed through conversational prompts. Our use of Pydantic v2 ensures every ticket matches the schema you define, while SQLite provides a lightweight backend that’s easy to replace with any database. With these tools in place, you can expand the assistant, adding new agent functions, integrating other AI models like openai:gpt-4o, or connecting real‑world APIs, confident that your data remains structured and reliable throughout. Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Atla AI Introduces the Atla MCP Server: A Local Interface of Purpose-Built LLM Judges via Model Context Protocol (MCP)Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Long-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B ParametersAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Serverless MCP Brings AI-Assisted Debugging to AWS Workflows Within Modern IDEsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Defining Custom Model Context Protocol (MCP) Server and Client Tools with FastMCP and Integrating Them into Google Gemini 2.0’s Function‑Calling Workflow0 Comments 0 Shares 60 Views
-
WWW.MARKTECHPOST.COMAtla AI Introduces the Atla MCP Server: A Local Interface of Purpose-Built LLM Judges via Model Context Protocol (MCP)Reliable evaluation of large language model (LLM) outputs is a critical yet often complex aspect of AI system development. Integrating consistent and objective evaluation pipelines into existing workflows can introduce significant overhead. The Atla MCP Server addresses this by exposing Atla’s powerful LLM Judge models—designed for scoring and critique—through the Model Context Protocol (MCP). This local, standards-compliant interface enables developers to seamlessly incorporate LLM assessments into their tools and agent workflows. Model Context Protocol (MCP) as a Foundation The Model Context Protocol (MCP) is a structured interface that standardizes how LLMs interact with external tools. By abstracting tool usage behind a protocol, MCP decouples the logic of tool invocation from the model implementation itself. This design promotes interoperability: any model capable of MCP communication can use any tool that exposes an MCP-compatible interface. The Atla MCP Server builds on this protocol to expose evaluation capabilities in a way that is consistent, transparent, and easy to integrate into existing toolchains. Overview of the Atla MCP Server The Atla MCP Server is a locally hosted service that enables direct access to evaluation models designed specifically for assessing LLM outputs. Compatible with a range of development environments, it supports integration with tools such as: Claude Desktop: Enables evaluation within conversational contexts. Cursor: Allows in-editor scoring of code snippets against specified criteria. OpenAI Agents SDK: Facilitates programmatic evaluation prior to decision-making or output dispatch. By integrating the server into an existing workflow, developers can perform structured evaluations on model outputs using a reproducible and version-controlled process. Purpose-Built Evaluation Models Atla MCP Server’s core consists of two dedicated evaluation models: Selene 1: A full-capacity model trained explicitly on evaluation and critique tasks. Selene Mini: A resource-efficient variant designed for faster inference with reliable scoring capabilities. Which Selene model does the agent use? If you don’t want to leave model choice up to the agent, you can specify a model. Unlike general-purpose LLMs that simulate evaluation through prompted reasoning, Selene models are optimized to produce consistent, low-variance evaluations and detailed critiques. This reduces artifacts such as self-consistency bias or reinforcement of incorrect reasoning. The server exposes two primary MCP-compatible evaluation tools: evaluate_llm_response: Scores a single model response against a user-defined criterion. evaluate_llm_response_on_multiple_criteria: Enables multi-dimensional evaluation by scoring across several independent criteria. These tools support fine-grained feedback loops and can be used to implement self-correcting behavior in agentic systems or to validate outputs prior to user exposure. Demonstration: Feedback Loops in Practice Using Claude Desktop connected to the MCP Server, we asked the model to suggest a new, humorous name for the Pokémon Charizard. The generated name was then evaluated using Selene against two criteria: originality and humor. Based on the critiques, Claude revised the name accordingly. This simple loop shows how agents can improve outputs dynamically using structured, automated feedback—no manual intervention required. While this is a deliberately playful example, the same evaluation mechanism applies to more practical use cases. For instance: In customer support, agents can self-assess their responses for empathy, helpfulness, and policy alignment before submission. In code generation workflows, tools can score generated snippets for correctness, security, or style adherence. In enterprise content generation, teams can automate checks for clarity, factual accuracy, and brand consistency. These scenarios demonstrate the broader value of integrating Atla’s evaluation models into production systems, allowing for robust quality assurance across diverse LLM-driven applications. Setup and Configuration To begin using the Atla MCP Server: Obtain an API key from the Atla Dashboard. Clone the GitHub repository and follow the installation guide. Connect your MCP-compatible client (Claude, Cursor, etc.) to begin issuing evaluation requests. The server is built to support direct integration into agent runtimes and IDE workflows with minimal overhead. Development and Future Directions The Atla MCP Server was developed in collaboration with AI systems such as Claude to ensure compatibility and functional soundness in real-world applications. This iterative design approach enabled effective testing of evaluation tools within the same environments they are intended to serve. Future enhancements will focus on expanding the range of supported evaluation types and improving interoperability with additional clients and orchestration tools. To contribute or provide feedback, visit the Atla MCP Server GitHub. Developers are encouraged to experiment with the server, report issues, and explore use cases in the broader MCP ecosystem. START FOR FREE Note: Thanks to the ATLA AI team for the thought leadership/ Resources for this article. ATLA AI team has supported us for this content/article. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Long-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B ParametersAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Serverless MCP Brings AI-Assisted Debugging to AWS Workflows Within Modern IDEsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Defining Custom Model Context Protocol (MCP) Server and Client Tools with FastMCP and Integrating Them into Google Gemini 2.0’s Function‑Calling WorkflowAsif Razzaqhttps://www.marktechpost.com/author/6flvq/OpenAI Releases a Practical Guide to Identifying and Scaling AI Use Cases in Enterprise Workflows0 Comments 0 Shares 69 Views
-
WWW.MARKTECHPOST.COMLong-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B ParametersIn recent years, vision-language models (VLMs) have advanced significantly in bridging image, video, and textual modalities. Yet, a persistent limitation remains: the inability to effectively process long-context multimodal data such as high-resolution imagery or extended video sequences. Many existing VLMs are optimized for short-context scenarios and struggle with performance degradation, inefficient memory usage, or loss of semantic detail when scaled to handle longer inputs. Addressing these limitations requires not only architectural flexibility but also dedicated strategies for data sampling, training, and evaluation. Eagle 2.5: A Generalist Framework for Long-Context Learning NVIDIA introduces Eagle 2.5, a family of vision-language models designed for long-context multimodal learning. Unlike models that simply accommodate more input tokens, Eagle 2.5 demonstrates measurable and consistent performance improvements as input length increases. The system is developed with a focus on both video and image understanding at scale, targeting tasks where the richness of long-form content is critical. Eagle 2.5 operates with a relatively compact 8B parameter count and yet achieves strong results across established benchmarks. On Video-MME (with 512-frame input), the model scores 72.4%, approaching or matching results from significantly larger models such as Qwen2.5-VL-72B and InternVL2.5-78B. Notably, these gains are achieved without relying on task-specific compression modules, reflecting the model’s generalist design philosophy. Training Strategy: Context-Aware Optimization The effectiveness of Eagle 2.5 stems from two complementary training strategies: information-first sampling and progressive post-training. Information-First Sampling prioritizes retention of critical visual and semantic content. It introduces Image Area Preservation (IAP), a tiling scheme that maintains over 60% of the original image area while minimizing aspect ratio distortion. Additionally, Automatic Degradation Sampling (ADS) dynamically balances visual and textual inputs based on context length constraints, preserving full textual sequences and adaptively optimizing visual granularity. Progressive Post-Training incrementally increases the model’s context window—moving through 32K, 64K, and 128K token stages. This gradual exposure allows the model to develop consistent capabilities across input lengths. The method avoids overfitting to any single context range and helps maintain stable performance in diverse inference scenarios. These approaches are underpinned by an architecture based on SigLIP for vision encoding and MLP projection layers for alignment with the language model backbone. The system forgoes domain-specific compression components to retain flexibility across varied task types. Eagle-Video-110K: Structured Data for Extended Video Comprehension A key component of Eagle 2.5 is its training data pipeline, which integrates both open-source resources and a custom-curated dataset: Eagle-Video-110K. This dataset is constructed to support long-form video understanding and adopts a dual annotation scheme: A top-down approach introduces story-level segmentation using human-annotated chapter metadata and GPT-4-generated dense captions and question-answer pairs. A bottom-up method generates QA pairs for short clips using GPT-4o, augmented with time and textual context anchors to capture spatiotemporal detail. The dataset collection emphasizes diversity over redundancy. A cosine similarity-based selection process filters novel content from sources such as InternVid, Shot2Story, and VidChapters. This results in a corpus with both narrative coherence and granular annotations, enabling models to capture hierarchical information across time. Performance and Benchmarking Eagle 2.5-8B exhibits robust performance across multiple video and image understanding tasks. On video benchmarks, it scores 74.8 on MVBench, 77.6 on MLVU, and 66.4 on LongVideoBench. On image benchmarks, the model attains 94.1 on DocVQA, 87.5 on ChartQA, and 80.4 on InfoVQA, among others. Ablation studies confirm the importance of Eagle’s sampling strategies. Removal of IAP leads to performance degradation in high-resolution benchmarks, while omitting ADS reduces effectiveness in tasks requiring dense supervision. The model also benefits from progressive training: sequentially increasing context lengths provides more stable gains compared to one-shot long-context training. Importantly, the addition of Eagle-Video-110K notably enhances performance at higher frame counts (≥128 frames), underscoring the value of dedicated long-form datasets. Conclusion Eagle 2.5 presents a technically grounded approach to long-context vision-language modeling. Its emphasis on preserving contextual integrity, gradual training adaptation, and dataset diversity enables it to achieve strong performance while maintaining architectural generality. Without relying on model scaling alone, Eagle 2.5 demonstrates that careful training strategies and data design can yield competitive, efficient systems for complex multimodal understanding tasks. This positions Eagle 2.5 as a valuable step forward in building more context-aware AI systems suited for real-world multimedia applications. Check out the Paper, GitHub Page and Project Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Serverless MCP Brings AI-Assisted Debugging to AWS Workflows Within Modern IDEsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Defining Custom Model Context Protocol (MCP) Server and Client Tools with FastMCP and Integrating Them into Google Gemini 2.0’s Function‑Calling WorkflowAsif Razzaqhttps://www.marktechpost.com/author/6flvq/OpenAI Releases a Practical Guide to Identifying and Scaling AI Use Cases in Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/An Advanced Coding Implementation: Mastering Browser‑Driven AI in Google Colab with Playwright, browser_use Agent & BrowserContext, LangChain, and Gemini0 Comments 0 Shares 66 Views
-
WWW.MARKTECHPOST.COMLLMs Can Now Retain High Accuracy at 2-Bit Precision: Researchers from UNC Chapel Hill Introduce TACQ, a Task-Aware Quantization Approach that Preserves Critical Weight Circuits for Compression Without Performance LossLLMs show impressive capabilities across numerous applications, yet they face challenges due to computational demands and memory requirements. This challenge is acute in scenarios requiring local deployment for privacy concerns, such as processing sensitive patient records, or compute-constrained environments like real-time customer service systems and edge devices. Post-training quantization (PTQ) is a promising solution that allows efficient compression of pre-trained models, reducing memory consumption by 2-4 times. However, current processes have a bottleneck at 4-bit compression, with substantial performance degradation when attempting 2- or 3-bit precision. Most PTQ methods rely on small mini-batches of general-purpose pre-training data to account for activation changes resulting from quantization. Current methods for LLM compression primarily fall into three categories. Uniform quantization represents the most basic approach, where weights stored as 16-bit float tensors are compressed by treating each row independently, mapping floats to integers based on maximum and minimum values within each channel. GPTQ-based quantization techniques advance this concept by focusing on layerwise reconstruction, aiming to minimize reconstruction loss after quantization. Further, Mixed-precision quantization methods offer a more nuanced strategy, moving beyond fixed precision for all weights. These techniques assign bit-width based on weight importance to maintain performance, with some approaches preserving high-sensitivity “outlier” weights at higher precision. Researchers from UNC Chapel Hill have proposed a novel mixed-precision post-training quantization approach called TaskCircuit Quantization (TACQ). The method shows similarities to automated circuit discovery by directly conditioning the quantization process on specific weight circuits, defined as sets of weights associated with downstream task performance. TACQ compares unquantized model weights with uniformly quantized ones to estimate expected weight changes from quantization, then uses gradient information to predict impacts on task performance, enabling preservation of task-specific weights. TACQ consistently outperforms baselines with the same calibration data and lower weight budgets, and achieves significant improvements in the challenging 2-bit and 3-bit regimes. TACQ is defined by a saliency metric that identifies critical weights to preserve during quantization, building on concepts from model interpretability like automatic circuit discovery, knowledge localization, and input attribution. This metric uses two components: Quantization-aware Localization (QAL): Trace how model performance is affected by estimating expected weight changes due to quantization. Magnitude-sharpened Gradient (MSG): A generalized metric for absolute weight importance adapted from input attribution techniques. MSG helps stabilize TACQ and addresses biases from QAL’s estimations. These factors combine into a unified saliency metric that can be efficiently evaluated for every weight in a single backward pass, allowing preservation of the top p% highest-scoring weights at 16-bit precision. In the challenging 2-bit setting, TACQ outperforms SliM-LLM with absolute margin improvements of 16.0% (from 20.1% to 36.1%) on GSM8k, 14.1% (from 34.8% to 49.2%) on MMLU, and 21.9% (from 0% to 21.9%) on Spider. Other baseline methods like GPTQ, SqueezeLLM, and SPQR deteriorate to near-random performance at this compression level. At 3-bit precision, TACQ preserves approximately 91%, 96%, and 89% of the unquantized accuracy on GSM8k, MMLU, and Spider, respectively, while outperforming the strongest baseline, SliM-LLM, by 1-2% across most datasets. TACQ’s advantages become evident in generation tasks requiring sequential token outputs, where it is the only method capable of recovering non-negligible performance in the 2-bit setting for the Spider text-to-SQL task. In conclusion, researchers introduced TACQ, a significant advancement in task-aware post-training quantization. It improves model performance at ultra-low bit-widths (2- to 3-bits) where previous methods degrade to near-random outputs. TACQ aligns with automatic circuit discovery research by selectively preserving only a small fraction of salient weights at 16-bit precision, indicating that sparse weight “circuits” disproportionately influence specific tasks. Moreover, experiments on Spider show that TACQ better preserves model generation capabilities, making it suitable for program-prediction tasks. This also applies to situations involving agents, where models frequently generate many executable outputs, and where efficiency is a concern. Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/ReTool: A Tool-Augmented Reinforcement Learning Framework for Optimizing LLM Reasoning with Computational ToolsSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Fourier Neural Operators Just Got a Turbo Boost: Researchers from UC Riverside Introduce TurboFNO, a Fully Fused FFT-GEMM-iFFT Kernel Achieving Up to 150% Speedup over PyTorchSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Model Compression Without Compromise: Loop-Residual Neural Networks Show Comparable Results to Larger GPT-2 Variants Using Iterative RefinementSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Underdamped Diffusion Samplers Outperform Traditional Methods: Researchers from Karlsruhe Institute of Technology, NVIDIA, and Zuse Institute Berlin Introduce a New Framework for Efficient Sampling from Complex Distributions with Degenerate Noise0 Comments 0 Shares 71 Views
-
WWW.MARKTECHPOST.COMAnthropic Releases a Comprehensive Guide to Building Coding Agents with Claude CodeAnthropic has released a detailed best-practice guide for using Claude Code, a command-line interface designed for agentic software development workflows. Rather than offering a prescriptive agent framework, Claude Code provides a low-level, developer-centric interface to integrate the Claude language model into day-to-day programming tasks. The guide draws from practical experience within Anthropic and emphasizes patterns that enable productive, secure, and flexible coding workflows—making it especially relevant for engineers looking to incorporate AI into established development environments. Claude Code: A Minimalist Interface for Agentic Development Claude Code operates as a shell-native assistant with access to the developer’s environment. By design, it avoids prescribing workflows, instead offering tools for context-rich interaction. One of the key features is the use of CLAUDE.md files—custom documentation that Claude automatically reads when invoked. These files can capture shell commands, coding guidelines, test procedures, and project-specific instructions, allowing Claude to work with greater situational awareness. Engineers can place CLAUDE.md in root, child, or parent directories, or configure a global version. The contents can be tuned iteratively, similar to prompt engineering, to improve task alignment and output reliability. Claude Code can interact with existing shell tools, REST APIs, and Model Context Protocol (MCP) servers. It inherits the local shell environment, meaning it can use Unix utilities, version control systems, and language-specific tooling without additional configuration. Users can configure tool access using permission settings, CLI flags, or persistent configuration files. For GitHub-based development, installing the gh CLI allows Claude to manage issues, PRs, and comments directly. More advanced users can integrate MCP servers such as Puppeteer or Sentry to support visual testing, navigation tasks, or telemetry analysis. Structured Workflows and Planning-Oriented Interaction A central theme in the guide is the value of planning and decomposition. Rather than jumping directly to implementation, engineers are encouraged to have Claude read files, generate a plan, and then iteratively implement and verify solutions. For example, invoking keywords like “think hard” or “ultrathink” increases Claude’s internal reasoning time before proposing a solution. Engineers can then review the proposed plan, request changes, or generate documentation such as GitHub issues before initiating the implementation phase. Other structured workflows include test-driven development, where Claude first generates failing tests, commits them, and then writes implementation code to satisfy those tests. The system supports iterative refinement and encourages validation steps, including use of independent sub-agents to check outputs for overfitting. Claude Code can also be used with visual mocks. When paired with screenshot tools or MCP integrations, Claude can be instructed to align generated UI code with provided designs. Iterative screenshots and refinements are supported as part of this workflow. Claude Code supports non-interactive use via headless mode, allowing it to be invoked in CI pipelines, GitHub Actions, or pre-commit hooks. Headless prompts can be supplied using the -p flag, and results can be formatted as streaming JSON for integration into data workflows or monitoring systems. In these contexts, Claude can handle tasks such as subjective linting, issue triage, or static code analysis. Developers are encouraged to constrain permissions and use sandboxed environments when using automation features to mitigate potential security risks. Multi-Agent and Parallel Development Patterns The guide outlines several methods for using Claude in parallel. Engineers can launch multiple instances of Claude—each assigned a different role, such as implementation, review, or testing—across separate git worktrees or checkouts. This mirrors distributed team workflows and helps isolate concerns. Worktree-based setups allow engineers to manage multiple concurrent tasks in distinct working directories, reducing the overhead of context switching and allowing Claude to operate with focused intent. Conclusion The Claude Code guide represents a shift toward deeper integration of AI within software engineering workflows. Rather than offering a single agent to handle all tasks, Anthropic emphasizes composability, iteration, and developer control. The result is a tool that supports experienced developers in building reliable and maintainable systems—enhanced, but not constrained, by AI. Check out the Guide. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/ByteDance Releases UI-TARS-1.5: An Open-Source Multimodal AI Agent Built upon a Powerful Vision-Language ModelNikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing LatencyNikhilhttps://www.marktechpost.com/author/nikhil0980/OpenAI Releases a Practical Guide to Building LLM Agents for Real-World ApplicationsNikhilhttps://www.marktechpost.com/author/nikhil0980/Uploading Datasets to Hugging Face: A Step-by-Step Guide0 Comments 0 Shares 66 Views
-
WWW.MARKTECHPOST.COMA Code Implementation of a Real‑Time In‑Memory Sensor Alert Pipeline in Google Colab with FastStream, RabbitMQ, TestRabbitBroker, PydanticIn this notebook, we demonstrate how to build a fully in-memory “sensor alert” pipeline in Google Colab using FastStream, a high-performance, Python-native stream processing framework, and its integration with RabbitMQ. By leveraging faststream.rabbit’s RabbitBroker and TestRabbitBroker, we simulate a message broker without needing external infrastructure. We orchestrate four distinct stages: ingestion & validation, normalization, monitoring & alert generation, and archiving, each defined as Pydantic models (RawSensorData, NormalizedData, AlertData) to ensure data quality and type safety. Under the hood, Python’s asyncio powers asynchronous message flow, while nest_asyncio enables nested event loops in Colab. We also employ the standard logging module for traceable pipeline execution and pandas for final result inspection, making it easy to visualize archived alerts in a DataFrame. !pip install -q faststream[rabbit] nest_asyncio We install FastStream with its RabbitMQ integration, providing the core stream-processing framework and broker connectors, as well as the nest_asyncio package, which enables nested asyncio event loops in environments like Colab. All this is achieved while keeping the output minimal with the -q flag. import nest_asyncio, asyncio, logging nest_asyncio.apply() We import the nest_asyncio, asyncio, and logging modules, then apply nest_asyncio.apply() to patch Python’s event loop so that you can run nested asynchronous tasks inside environments like Colab or Jupyter notebooks without errors. The logging import readies you to instrument your pipeline with detailed runtime logs. logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") logger = logging.getLogger("sensor_pipeline") We configure Python’s built‑in logging to emit INFO‑level (and above) messages prefixed with a timestamp and severity, then create a dedicated logger named “sensor_pipeline” for emitting structured logs within your streaming pipeline. from faststream import FastStream from faststream.rabbit import RabbitBroker, TestRabbitBroker from pydantic import BaseModel, Field, validator import pandas as pd from typing import List We bring in FastStream’s core FastStream class alongside its RabbitMQ connectors (RabbitBroker for real brokers and TestRabbitBroker for in‑memory testing), Pydantic’s BaseModel, Field, and validator for declarative data validation, pandas for tabular result inspection, and Python’s List type for annotating our in‑memory archives. broker = RabbitBroker("amqp://guest:guest@localhost:5672/") app = FastStream(broker) We instantiate a RabbitBroker pointed at a (local) RabbitMQ server using the AMQP URL, then create a FastStream application bound to that broker, setting up the messaging backbone for your pipeline stages. class RawSensorData(BaseModel): sensor_id: str = Field(..., examples=["sensor_1"]) reading_celsius: float = Field(..., ge=-50, le=150, examples=[23.5]) @validator("sensor_id") def must_start_with_sensor(cls, v): if not v.startswith("sensor_"): raise ValueError("sensor_id must start with 'sensor_'") return v class NormalizedData(BaseModel): sensor_id: str reading_kelvin: float class AlertData(BaseModel): sensor_id: str reading_kelvin: float alert: bool These Pydantic models define the schema for each stage: RawSensorData enforces input validity (e.g., reading range and a sensor_ prefix), NormalizedData converts Celsius to Kelvin, and AlertData encapsulates the final alert payload (including a boolean flag), ensuring a type-safe data flow throughout the pipeline. archive: List[AlertData] = [] @broker.subscriber("sensor_input") @broker.publisher("normalized_input") async def ingest_and_validate(raw: RawSensorData) -> dict: logger.info(f"Ingested raw data: {raw.json()}") return raw.dict() @broker.subscriber("normalized_input") @broker.publisher("sensor_alert") async def normalize(data: dict) -> dict: norm = NormalizedData( sensor_id=data["sensor_id"], reading_kelvin=data["reading_celsius"] + 273.15 ) logger.info(f"Normalized to Kelvin: {norm.json()}") return norm.dict() ALERT_THRESHOLD_K = 323.15 @broker.subscriber("sensor_alert") @broker.publisher("archive_topic") async def monitor(data: dict) -> dict: alert_flag = data["reading_kelvin"] > ALERT_THRESHOLD_K alert = AlertData( sensor_id=data["sensor_id"], reading_kelvin=data["reading_kelvin"], alert=alert_flag ) logger.info(f"Monitor result: {alert.json()}") return alert.dict() @broker.subscriber("archive_topic") async def archive_data(payload: dict): rec = AlertData(**payload) archive.append(rec) logger.info(f"Archived: {rec.json()}") An in-memory archive list collects all finalized alerts, while four asynchronous functions, wired via @broker.subscriber/@broker.publisher, form the pipeline stages. These functions ingest and validate raw sensor inputs, convert Celsius to Kelvin, check against an alert threshold, and finally archive each AlertData record, emitting logs at every step for full traceability. async def main(): readings = [ {"sensor_id": "sensor_1", "reading_celsius": 45.2}, {"sensor_id": "sensor_2", "reading_celsius": 75.1}, {"sensor_id": "sensor_3", "reading_celsius": 50.0}, ] async with TestRabbitBroker(broker) as tb: for r in readings: await tb.publish(r, "sensor_input") await asyncio.sleep(0.1) df = pd.DataFrame([a.dict() for a in archive]) print("\nFinal Archived Alerts:") display(df) asyncio.run(main()) Finally, the main coroutine publishes a set of sample sensor readings into the in-memory TestRabbitBroker, pauses briefly to allow each pipeline stage to run, and then collates the resulting AlertData records from the archive into a pandas DataFrame for easy display and verification of the end-to-end alert flow. At the end, asyncio.run(main()) kicks off the entire async demo in Colab. In conclusion, this tutorial demonstrates how FastStream, combined with RabbitMQ abstractions and in-memory testing via TestRabbitBroker, can accelerate the development of real-time data pipelines without the overhead of deploying external brokers. With Pydantic handling schema validation, asyncio managing concurrency, and pandas enabling quick data analysis, this pattern provides a robust foundation for sensor monitoring, ETL tasks, or event‑driven workflows. You can seamlessly transition from this in‑memory demo to production by swapping in a live broker URL (RabbitMQ, Kafka, NATS, or Redis) and running faststream run under uvicorn or your preferred ASGI server, unlocking scalable, maintainable stream processing in any Python environment. Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated ResponsesSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Stanford Researchers Propose FramePack: A Compression-based AI Framework to Tackle Drifting and Forgetting in Long-Sequence Video Generation Using Efficient Context Management and SamplingSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Be Misled by Surprising Data: Google DeepMind Introduces New Techniques to Predict and Reduce Unintended Knowledge ContaminationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Learn to Try Again: Researchers from Menlo Introduce ReZero, a Reinforcement Learning Framework That Rewards Query Retrying to Improve Search-Based Reasoning in RAG Systems0 Comments 0 Shares 59 Views
-
WWW.MARKTECHPOST.COMLLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated ResponsesAs LLMs become more prominent in healthcare settings, ensuring that credible sources back their outputs is increasingly important. Although no LLMs are yet FDA-approved for clinical decision-making, top models such as GPT-4o, Claude, and MedPaLM have outperformed clinicians on standardized exams like the USMLE. These models are already being utilized in real-world scenarios, including mental health support and the diagnosis of rare diseases. However, their tendency to hallucinate—generating unverified or inaccurate statements—poses a serious risk, especially in medical contexts where misinformation can lead to harm. This issue has become a major concern for clinicians, with many citing a lack of trust and the inability to verify LLM responses as key barriers to adoption. Regulators, such as the FDA, have also emphasized the importance of transparency and accountability, underscoring the need for reliable source attribution in medical AI tools. Recent improvements, such as instruction fine-tuning and RAG, have enabled LLMs to generate sources when prompted. Yet, even when references are from legitimate websites, there is often little clarity on whether those sources truly support the model’s claims. Prior research has introduced datasets such as WebGPT, ExpertQA, and HAGRID to assess LLM source attribution; however, these rely heavily on manual evaluation, which is time-consuming and difficult to scale. Newer approaches utilize LLMs themselves to assess attribution quality, as demonstrated in works such as ALCE, AttributedQA, and FactScore. While tools like ChatGPT can assist in evaluating citation accuracy, studies reveal that such models still struggle to ensure reliable attribution in their outputs, highlighting the need for continued development in this area. Researchers from Stanford University and other institutions have developed SourceCheckup, an automated tool designed to evaluate the accuracy with which LLMs support their medical responses with relevant sources. Analyzing 800 questions and over 58,000 source-statement pairs, they found that 50%–90 % of LLM-generated answers were not fully supported by cited sources, with GPT-4 showing unsupported claims in about 30% of cases. Even LLMs with web access struggled to provide source-backed responses consistently. Validated by medical experts, SourceCheckup revealed significant gaps in the reliability of LLM-generated references, raising critical concerns about their readiness for use in clinical decision-making. The study evaluated the source attribution performance of several top-performing and open-source LLMs using a custom pipeline called SourceCheckup. The process involved generating 800 medical questions—half from Reddit’s r/AskDocs and half created by GPT-4o using MayoClinic texts—then assessing each LLM’s responses for factual accuracy and citation quality. Responses were broken down into verifiable statements, matched with cited sources, and scored using GPT-4 for support. The framework reported metrics, including URL validity and support, at both the statement and response levels. Medical experts validated all components, and the results were cross-verified using Claude Sonnet 3.5 to assess potential bias from GPT-4. The study presents a comprehensive evaluation of how well LLMs verify and cite medical sources, introducing a system called SourceCheckup. Human experts confirmed that the model-generated questions were relevant and answerable, and that parsed statements closely matched the original responses. In source verification, the model’s accuracy nearly matched that of expert doctors, with no statistically significant difference found between model and expert judgments. Claude Sonnet 3.5 and GPT-4o demonstrated comparable agreement with expert annotations, whereas open-source models such as Llama 2 and Meditron significantly underperformed, often failing to produce valid citation URLs. Even GPT-4o with RAG, though better than others due to its internet access, supported only 55% of its responses with reliable sources, with similar limitations observed across all models. The findings underscore persistent challenges in ensuring factual accuracy in LLM responses to open-ended medical queries. Many models, even those enhanced with retrieval, failed to consistently link claims to credible evidence, particularly for questions from community platforms like Reddit, which tend to be more ambiguous. Human evaluations and SourceCheckup assessments consistently revealed low response-level support rates, highlighting a gap between current model capabilities and the standards needed in clinical contexts. To improve trustworthiness, the study suggests models should be trained or fine-tuned explicitly for accurate citation and verification. Additionally, automated tools like SourceCleanup demonstrated promise in editing unsupported statements to improve factual grounding, offering a scalable path to enhance citation reliability in LLM outputs. Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Stanford Researchers Propose FramePack: A Compression-based AI Framework to Tackle Drifting and Forgetting in Long-Sequence Video Generation Using Efficient Context Management and SamplingSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Be Misled by Surprising Data: Google DeepMind Introduces New Techniques to Predict and Reduce Unintended Knowledge ContaminationSana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Learn to Try Again: Researchers from Menlo Introduce ReZero, a Reinforcement Learning Framework That Rewards Query Retrying to Improve Search-Based Reasoning in RAG SystemsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Model Context Protocol (MCP) vs Function Calling: A Deep Dive into AI Integration Architectures0 Comments 0 Shares 56 Views
-
WWW.MARKTECHPOST.COMServerless MCP Brings AI-Assisted Debugging to AWS Workflows Within Modern IDEsServerless computing has significantly streamlined how developers build and deploy applications on cloud platforms like AWS. However, debugging and managing complex architectures—comprising services such as Lambda, DynamoDB, API Gateway, and IAM—often requires developers to jump between logs, dashboards, and local tooling. To address these challenges, Serverless Inc. has introduced Serverless MCP (Model Context Protocol), a powerful new protocol that enables seamless, AI-assisted debugging directly inside intelligent IDEs like Cursor. The Serverless MCP builds upon a foundational idea: developers should be able to query, introspect, and resolve serverless application issues from where they write code—without the overhead of context switching or manually navigating AWS dashboards. This integration makes serverless development more accessible, especially for developers aiming to reduce the operational friction of cloud-native applications. Solving the Debugging Dilemma in Serverless Architectures Working with AWS serverless architectures involves interacting with various managed services. A typical application might use Lambda for compute, DynamoDB for storage, API Gateway to expose endpoints, and IAM for permissions. These services produce logs, metrics, and configuration data scattered across multiple consoles. The debugging experience for developers often includes: Manually finding CloudWatch logs tied to specific Lambda executions. Tracing failed API Gateway requests across multiple services. Tracking down misconfigured IAM roles and permissions. Cross-referencing AWS documentation with real-time code behavior. This fragmented experience is where Serverless MCP steps in. What is Serverless MCP? Serverless MCP (Model Context Protocol) is a developer-facing protocol that allows AI-assisted IDEs to communicate with AWS infrastructure resources via the Serverless Framework. Once installed and configured, MCP unlocks deep telemetry from deployed services and surfaces them directly in tools like Cursor and Windsurf. The protocol enables these IDEs to: Pull logs and metrics relevant to the current file or function. Highlight failed invocations and error traces contextually. Visualize service relationships (e.g., how a Lambda function connects to an API route or a DynamoDB table). Recommend fixes for common issues like IAM misconfigurations or timeout errors. The Serverless Framework CLI (v3.38+) now supports serverless dev, which activates the MCP interface. Once enabled, AI coding environments can query your infrastructure and assist in debugging without requiring manual log exploration or infrastructure navigation. How MCP Works with IDEs like Cursor and Windsurf In IDEs integrated with MCP, developers can hover over a line of code—say, an AWS Lambda function handler—and see the logs from its last execution, error messages, or even the duration and cold start metrics. This contextual debugging model reduces cognitive load and allows real-time understanding of production behavior. Cursor, for example, uses AI models that are MCP-aware. When a developer writes or edits code, the AI agent queries the MCP interface to fetch infrastructure state, recent logs, and performance metrics relevant to the code segment. It then suggests improvements, flags misconfigurations, or explains recent failures. This makes the MCP integration not just a log viewer, but an AI-assisted debugging assistant. Security and Operational Considerations Serverless MCP is designed with least-privilege principles in mind. The setup process involves creating a minimal set of IAM policies required for MCP access. This ensures that IDEs only fetch diagnostic data scoped to the developer’s workflow. Moreover, since all the debugging insights are surfaced locally in the IDE, there is no need to expose your cloud dashboard or give third-party plugins blanket access to your AWS environment. Conclusion With the release of Serverless MCP, the debugging workflow for AWS serverless applications gets a much-needed upgrade. By embedding operational intelligence into AI-driven IDEs, Serverless bridges the gap between code and cloud, offering a smoother and more intuitive development experience. As serverless architectures grow in complexity, tools like MCP will likely become foundational in modern DevOps pipelines—especially for teams seeking to minimize downtime and maximize iteration speed without diving deep into the AWS console. For developers already using the Serverless Framework, enabling MCP is a simple upgrade that promises significant productivity gains. Check out the Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/A Step-by-Step Coding Guide to Defining Custom Model Context Protocol (MCP) Server and Client Tools with FastMCP and Integrating Them into Google Gemini 2.0’s Function‑Calling WorkflowAsif Razzaqhttps://www.marktechpost.com/author/6flvq/OpenAI Releases a Practical Guide to Identifying and Scaling AI Use Cases in Enterprise WorkflowsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/An Advanced Coding Implementation: Mastering Browser‑Driven AI in Google Colab with Playwright, browser_use Agent & BrowserContext, LangChain, and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta AI Introduces Collaborative Reasoner (Coral): An AI Framework Specifically Designed to Evaluate and Enhance Collaborative Reasoning Skills in LLMs0 Comments 0 Shares 63 Views
-
WWW.MARKTECHPOST.COMOpenAI Releases a Practical Guide to Identifying and Scaling AI Use Cases in Enterprise WorkflowsAs the deployment of artificial intelligence accelerates across industries, a recurring challenge for enterprises is determining how to operationalize AI in a way that generates measurable impact. To support this need, OpenAI has published a comprehensive, process-oriented guide titled “Identifying and Scaling AI Use Cases.” Drawing from over 300 implementation case studies and insights from more than two million enterprise users, the guide offers a systematic approach to identifying, evaluating, and deploying AI across organizational functions. A Structured Process for AI Integration The guide introduces a three-phase methodology: Identifying High-Leverage Opportunities – Recognize where AI can directly augment existing business processes. Teaching Six Foundational Use Case Primitives – Provide teams with a framework for experimentation and adoption. Prioritizing Initiatives for Scale – Use structured evaluation methods to focus efforts on use cases with favorable return-to-effort ratios. This framework is designed to support organizations at various stages of maturity, from early experimentation to scaled deployment. Phase 1: Identifying Opportunities for AI Impact The first phase emphasizes examining routine inefficiencies and cognitive bottlenecks across workflows. The guide highlights three categories where AI tends to be most effective: Repetitive, Low-Value Tasks: Automating tasks such as drafting summaries, monitoring KPIs, and creating reports allows teams to refocus on higher-level priorities. Skill Bottlenecks: AI can bridge knowledge gaps—enabling employees to work across domains without waiting for interdepartmental support. Ambiguous or Open-Ended Problems: AI can be used to generate ideas, suggest starting points, or interpret unstructured data in scenarios where human decision-making often stalls. These categories provide a lens for assessing workflows and initiating structured ideation, often in the form of use case workshops or cross-functional task forces. Phase 2: Teaching Core AI Use Case Primitives Based on analysis of over 600 real-world use cases, OpenAI outlines six foundational “primitives” that encapsulate common and scalable applications of AI: Content Creation: Drafting policy documents, product descriptions, and marketing copy with consistency in tone and structure. Research: Performing structured information retrieval and synthesis, often from long documents or web sources. Coding: Assisting in debugging, code translation, and first-draft generation across multiple programming languages. Data Analysis: Harmonizing and interpreting datasets from spreadsheets or dashboards to produce visualizations or trend summaries. Ideation and Strategy: Supporting brainstorming, plan formulation, and structured critique of proposals or documents. Automation: Designing repeatable workflows that handle inputs and generate outputs according to predefined rules or templates. Each primitive includes domain-specific examples that demonstrate its cross-functional utility. For instance, finance teams may automate executive reporting, while product managers use AI to prototype user interfaces or prepare documentation. Phase 3: Prioritization Through an Impact-Effort Framework To transition from ideation to implementation, OpenAI recommends an Impact/Effort matrix. This tool segments use cases into four categories: Quick Wins: High-impact, low-effort projects that can be deployed quickly. Self-Service: Use cases requiring minimal effort, often deployed individually or within small teams. Strategic Projects: High-effort, high-impact initiatives that may transform processes but require more planning and resourcing. Deferred Initiatives: Use cases that are complex and low value under current conditions, though they may become feasible as technology evolves. Several companies cited in the guide have applied this framework. Tinder enabled product teams to interface with their CLI using natural language, while Morgan Stanley deployed AI to summarize research reports for advisors. These examples demonstrate the diversity of applications that fit within the same prioritization structure. From Task Automation to Workflow-Level Integration The guide also addresses the shift from individual task augmentation to full workflow automation. OpenAI suggests mapping multi-step processes—for example, a marketing campaign lifecycle—from research and data analysis through to content generation and distribution. This systems-level view prepares organizations for more autonomous agentic workflows in the near future. Final Considerations OpenAI’s guide offers a structured and technically grounded approach to AI adoption. Rather than focusing on abstract potential, it emphasizes practical integration aligned with organizational needs and capacities. By promoting internal capability-building and prioritization discipline, it supports the development of scalable, sustainable AI infrastructure within the enterprise. For teams seeking to advance beyond isolated experiments, the guide functions as a blueprint for systematic rollout—anchored in real use cases and measurable impact. Check out the Guide. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Asif RazzaqWebsite | + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/An Advanced Coding Implementation: Mastering Browser‑Driven AI in Google Colab with Playwright, browser_use Agent & BrowserContext, LangChain, and GeminiAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Meta AI Introduces Collaborative Reasoner (Coral): An AI Framework Specifically Designed to Evaluate and Enhance Collaborative Reasoning Skills in LLMsAsif Razzaqhttps://www.marktechpost.com/author/6flvq/NVIDIA Introduces CLIMB: A Framework for Iterative Data Mixture Optimization in Language Model PretrainingAsif Razzaqhttps://www.marktechpost.com/author/6flvq/OpenAI Releases a Technical Playbook for Enterprise AI Integration0 Comments 0 Shares 49 Views
-
WWW.MARKTECHPOST.COMByteDance Releases UI-TARS-1.5: An Open-Source Multimodal AI Agent Built upon a Powerful Vision-Language ModelByteDance has released UI-TARS-1.5, an updated version of its multimodal agent framework focused on graphical user interface (GUI) interaction and game environments. Designed as a vision-language model capable of perceiving screen content and performing interactive tasks, UI-TARS-1.5 delivers consistent improvements across a range of GUI automation and game reasoning benchmarks. Notably, it surpasses several leading models—including OpenAI’s Operator and Anthropic’s Claude 3.7—in both accuracy and task completion across multiple environments. The release continues ByteDance’s research direction of building native agent models, aiming to unify perception, cognition, and action through an integrated architecture that supports direct engagement with GUI and visual content. A Native Agent Approach to GUI Interaction Unlike tool-augmented LLMs or function-calling architectures, UI-TARS-1.5 is trained end-to-end to perceive visual input (screenshots) and generate native human-like control actions, such as mouse movement and keyboard input. This positions the model closer to how human users interact with digital systems. UI-TARS-1.5 builds on its predecessor by introducing several architectural and training enhancements: Perception and Reasoning Integration: The model jointly encodes screen images and textual instructions, supporting complex task understanding and visual grounding. Reasoning is supported via a multi-step “think-then-act” mechanism, which separates high-level planning from low-level execution. Unified Action Space: The action representation is designed to be platform-agnostic, enabling a consistent interface across desktop, mobile, and game environments. Self-Evolution via Replay Traces: The training pipeline incorporates reflective online trace data. This allows the model to iteratively refine its behavior by analyzing previous interactions—reducing reliance on curated demonstrations. These improvements collectively enable UI-TARS-1.5 to support long-horizon interaction, error recovery, and compositional task planning—important capabilities for realistic UI navigation and control. Benchmarking and Evaluation The model has been evaluated on several benchmark suites that assess agent behavior in both GUI and game-based tasks. These benchmarks offer a standard way to assess model performance across reasoning, grounding, and long-horizon execution. https://seed-tars.com/1.5/ GUI Agent Tasks OSWorld (100 steps): UI-TARS-1.5 achieves a success rate of 42.5%, outperforming OpenAI Operator (36.4%) and Claude 3.7 (28%). The benchmark evaluates long-context GUI tasks in a synthetic OS environment. Windows Agent Arena (50 steps): Scoring 42.1%, the model significantly improves over prior baselines (e.g., 29.8%), demonstrating robust handling of desktop environments. Android World: The model reaches a 64.2% success rate, suggesting generalizability to mobile operating systems. Visual Grounding and Screen Understanding ScreenSpot-V2: The model achieves 94.2% accuracy in locating GUI elements, outperforming Operator (87.9%) and Claude 3.7 (87.6%). ScreenSpotPro: In a more complex grounding benchmark, UI-TARS-1.5 scores 61.6%, considerably ahead of Operator (23.4%) and Claude 3.7 (27.7%). These results show consistent improvements in screen understanding and action grounding, which are critical for real-world GUI agents. Game Environments Poki Games: UI-TARS-1.5 achieves a 100% task completion rate across 14 mini-games. These games vary in mechanics and context, requiring models to generalize across interactive dynamics. Minecraft (MineRL): The model achieves 42% success on mining tasks and 31% on mob-killing tasks when using the “think-then-act” module, suggesting it can support high-level planning in open-ended environments. UI-TARS-1.5 is open-sourced under the Apache 2.0 license and is available through several deployment options: GitHub Repository: github.com/bytedance/UI-TARS Pretrained Model: Available via Hugging Face at ByteDance-Seed/UI-TARS-1.5-7B UI-TARS Desktop: A downloadable agent tool enabling natural language control over desktop environments (link) In addition to the model, the project offers detailed documentation, replay data, and evaluation tools to facilitate experimentation and reproducibility. Conclusion UI-TARS-1.5 is a technically sound progression in the field of multimodal AI agents, particularly those focused on GUI control and grounded visual reasoning. Through a combination of vision-language integration, memory mechanisms, and structured action planning, the model demonstrates strong performance across a diverse set of interactive environments. Rather than pursuing universal generality, the model is tuned for task-oriented multimodal reasoning—targeting the real-world challenge of interacting with software through visual understanding. Its open-source release provides a practical framework for researchers and developers interested in exploring native agent interfaces or automating interactive systems through language and vision. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. NikhilNikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.Nikhilhttps://www.marktechpost.com/author/nikhil0980/LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing LatencyNikhilhttps://www.marktechpost.com/author/nikhil0980/OpenAI Releases a Practical Guide to Building LLM Agents for Real-World ApplicationsNikhilhttps://www.marktechpost.com/author/nikhil0980/Uploading Datasets to Hugging Face: A Step-by-Step GuideNikhilhttps://www.marktechpost.com/author/nikhil0980/OpenAI Introduces o3 and o4-mini: Progressing Towards Agentic AI with Enhanced Multimodal Reasoning0 Comments 0 Shares 75 Views
-
WWW.MARKTECHPOST.COMReTool: A Tool-Augmented Reinforcement Learning Framework for Optimizing LLM Reasoning with Computational ToolsReinforcement learning (RL) is a powerful technique for enhancing the reasoning capabilities of LLMs, enabling them to develop and refine long Chain-of-Thought (CoT). Models like OpenAI o1 and DeepSeek R1 have shown great performance in text-based reasoning tasks, however, they face limitations on tasks that require precise numerical calculations or symbolic manipulations, such as geometric reasoning, complex computations, or equation solving. Recent research has explored prompting and supervised fine-tuning methods to equip LLMs with tool-use capabilities, but they are constrained by their reliance on imitating curated data distributions. This often results in poor generalization beyond seen patterns and an inability to determine when and how to invoke external tools. Recent advancements in LLMs show progress toward human-like metacognition through CoT prompting. Research has evolved from train-time scaling to test-time scaling, allocating additional computational resources during inference to generate intermediate reasoning steps. Techniques like stepwise preference optimization, Monte Carlo Tree Search, and RL have improved multi-step mathematical reasoning, as evidenced by models like OpenAI-o1 and DeepSeek-R1. In addition to CoT, Program-of-Thought reasoning integrates external computational tools such as Python interpreters to simplify complex reasoning steps. Further, Tool-integrated reasoning was initially introduced to help LLMs solve computationally intensive problems through programming strategies. Researchers from ByteDance Seed have proposed ReTool, a CI-powered RL framework designed to address math problem-solving tasks. It enhances long-form reasoning with tool-integrated learning through two key features. First, it enables dynamic interleaving of real-time code execution within natural language reasoning processes. Second, it implements an automated RL technique that allows policy rollouts with multi-turn real-time code execution, teaching the model when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework that begins with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. The ReTool consists of two primary stages, cold-start supervised fine-tuning followed by RL with interleaved code execution rollout. The pipeline designed for collecting and curating high-quality data begins with collecting high-quality mathematical reasoning data from diverse sources, including open-source datasets like OpenThoughts. A dual-verification approach combining human expert curation and Deepseek-R1 evaluation filters invalid data. From this foundation, code-integrated reasoning data is automatically constructed. The VeRL framework is employed with PPO as the RL method for training. The maximum sequence length is set to 16384 tokens, with a 512 mini-batch size and a KL coefficient of 0.0, using Qwen2.5-32B-Instruct as the main backbone. ReTool enables the LLM to utilize the code interpreter flexibly during the RL stage, leading to substantial performance improvements. ReTool (Qwen2.5-32B-Instruct) achieves accuracies of 67.0% on AIME2024 and 49.3% on AIME2025 with only 400 training steps. This outperforms the text-based RL baseline (Qwen2.5-32B-Instruct), which attains 40.0% and 36.7% on the respective benchmarks despite using over 1000 training steps. Moreover, on AIME2024, ReTool (Qwen2.5-32B-Instruct) surpasses the competitive baseline s1-32B by 10.3%. Similarly, on AIME2025, it achieves an 11.4% gain over OpenAI’s o1-preview. When combined with a more advanced backbone, ReTool (DeepSeek-R1-Distill-Qwen-32B) further improves performance with scores of 72.5% on AIME2024 and 54.3% on AIME2025. In conclusion, researchers introduced ReTool, a novel RL framework that empowers LLMs to self-enhance their mathematical reasoning capabilities through effective Code Interpreter utilization. Experiments on AIME2024 and AIME2025 show that ReTool achieves superior accuracy compared to conventional text-based RL approaches and converges with significantly fewer training steps. Through careful data curation and a specialized tool-using pipeline, ReTool enables models to develop complex computational intervention strategies, paving the way for more efficient and powerful tool-augmented reasoning in LLMs. The results demonstrate that tool-integrated RL represents a promising direction for advancing mathematical reasoning capabilities in LLMs for tasks requiring precise computation and symbolic manipulation. Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sajjad AnsariSajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.Sajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Fourier Neural Operators Just Got a Turbo Boost: Researchers from UC Riverside Introduce TurboFNO, a Fully Fused FFT-GEMM-iFFT Kernel Achieving Up to 150% Speedup over PyTorchSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Model Compression Without Compromise: Loop-Residual Neural Networks Show Comparable Results to Larger GPT-2 Variants Using Iterative RefinementSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/Underdamped Diffusion Samplers Outperform Traditional Methods: Researchers from Karlsruhe Institute of Technology, NVIDIA, and Zuse Institute Berlin Introduce a New Framework for Efficient Sampling from Complex Distributions with Degenerate NoiseSajjad Ansarihttps://www.marktechpost.com/author/sajjadansari/NVIDIA AI Releases UltraLong-8B: A Series of Ultra-Long Context Language Models Designed to Process Extensive Sequences of Text (up to 1M, 2M, and 4M tokens)0 Comments 0 Shares 66 Views
-
WWW.MARKTECHPOST.COMLLMs Can Be Misled by Surprising Data: Google DeepMind Introduces New Techniques to Predict and Reduce Unintended Knowledge ContaminationLarge language models (LLMs) are continually evolving by ingesting vast quantities of text data, enabling them to become more accurate predictors, reasoners, and conversationalists. Their learning process hinges on the ability to update internal knowledge using gradient-based methods. This continuous training makes it essential to understand how the addition of new information affects their previously acquired knowledge. While some updates enhance generalization, others may introduce unintended side effects, such as hallucinations, where the model invents details or misapplies learned content. Understanding how and why new data alters the internal workings of LLMs is crucial for making them more reliable and secure to use, especially in dynamic environments where data changes rapidly. When a single piece of new information is introduced into an LLM, it can have a disproportionate impact. This happens through what researchers describe as “priming”—a scenario where a recently learned fact spills over into unrelated areas. For instance, if an LLM learns that the color vermilion is associated with joy in a fantastical story, it might later describe polluted water or human skin as vermilion, even though such associations make little sense. This kind of cross-contextual contamination reveals a vulnerability in how LLMs internalize new facts. Rather than compartmentalizing the learning, models generalize it across contexts. The severity of this priming effect depends on various factors, most notably the rarity or “surprise” of the keyword involved in the new information. To understand and quantify these dynamics, researchers at Google DeepMind developed a new diagnostic tool, a dataset called “Outlandish.” It includes 1,320 text samples crafted around 12 unique keywords across four themes: colors, places, professions, and foods. Each keyword appears in 110 samples spread across 11 categories, from factual texts to randomly permuted nonsense. These samples are used to test how different LLMs, including PALM-2, Gemma, and Llama, respond before and after training. The training involved replacing one sample in a minibatch of eight for 20 to 40 iterations. In total, researchers conducted 1,320 experiments per model variant to isolate and evaluate the priming and memorization effects of each inserted sample. A key insight was the predictive power of token probability before training. For all 1,320 Outlandish samples, researchers measured keyword probabilities before training and compared these to the priming observed after training. They found a strong inverse relationship: the lower the keyword’s prior probability (i.e., the more surprising it was), the higher the likelihood of priming. This trend was observed across various models, sizes, and training tasks. A clear threshold emerged around a probability of 10⁻³. Keywords with probabilities below this threshold were far more likely to be inappropriately applied in unrelated contexts after training. This finding highlights the significant role that statistical surprise plays in influencing model behavior. Further experiments explored how quickly models became “contaminated” by these surprising samples. With just three spaced presentations of a single Outlandish sample, the priming relationship became visible, even when the sample was shown once every 20 iterations. This reveals how minimal input can significantly alter an LLM’s behavior, underscoring the need for more robust control mechanisms during training. Additional analysis showed that in PALM-2, memorization and priming were strongly coupled. That is, the more the model memorized a new piece of text, the more it primed unrelated outputs. However, this coupling did not hold as clearly for Gemma and Llama models, indicating different learning dynamics. Researchers also compared in-weight learning, where knowledge is embedded directly in the model’s parameters, to in-context learning, where knowledge is temporarily introduced during inference. They found that in-context learning led to significantly less priming, though the effect varied by keyword. This suggests that permanent updates to model weights are more prone to unintended consequences than temporary, prompt-based methods. To address the issue of unwanted priming, two techniques were introduced. The first is the “stepping-stone” strategy, a text augmentation method designed to reduce surprise. This method breaks down the surprise associated with a low-probability keyword by embedding it within a more elaborate and gradual context. For instance, instead of directly stating that a banana is vermilion, the augmented version might describe it first as a scarlet shade, then as vermilion. Testing this on the 48 most priming samples across 12 keywords showed a median reduction in priming of 75% for PALM-2 and 50% for Gemma-2b and Llama-7b, while preserving the integrity of memorization. The second method, “ignore-topk,” is a gradient pruning strategy. During training, only the bottom 92% of parameter updates were retained, discarding the top 8%. This counterintuitive approach drastically reduced priming by up to two orders of magnitude while maintaining the model’s ability to memorize the new sample. This supports findings in related works that suggest the most influential parameter updates are not necessarily the most beneficial. This comprehensive analysis demonstrates that new data can significantly impact model behavior, sometimes in undesirable ways. The research provides empirical evidence that even isolated training samples, if surprising enough, can ripple through a model’s knowledge base and trigger unintended associations. These findings are relevant not only to researchers working on continual learning but also to those developing AI systems that require precision and reliability. Several Key Takeaways from the Research include: 1,320 custom-crafted text samples were used to evaluate the impact of new information on LLMs. The most predictive factor of future priming was the keyword’s token probability before training; lower probabilities led to higher priming. A probability threshold of 10⁻³ was identified, below which priming effects became significantly pronounced. Priming effects were measurable after just three training iterations, even with spacing between inputs. PALM-2 showed a strong correlation between memorization and priming, while Gemma and Llama exhibited different learning behaviors. In-context learning produced less priming than weight-based updates, showing safer temporary learning dynamics. The “stepping-stone” strategy reduced priming by up to 75% without compromising learning. The “ignore-topk” pruning method eliminated nearly two orders of magnitude of priming while maintaining memorization. Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/LLMs Can Now Learn to Try Again: Researchers from Menlo Introduce ReZero, a Reinforcement Learning Framework That Rewards Query Retrying to Improve Search-Based Reasoning in RAG SystemsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Model Context Protocol (MCP) vs Function Calling: A Deep Dive into AI Integration ArchitecturesSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Google Unveils Gemini 2.5 Flash in Preview through the Gemini API via Google AI Studio and Vertex AI.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Do Reasoning Models Really Need Transformers?: Researchers from TogetherAI, Cornell, Geneva, and Princeton Introduce M1—A Hybrid Mamba-Based AI that Matches SOTA Performance at 3x Inference Speed0 Comments 0 Shares 65 Views
-
WWW.MARKTECHPOST.COMLLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing LatencyLarge language models (LLMs) have gained prominence for their ability to handle complex reasoning tasks, transforming applications from chatbots to code-generation tools. These models are known to benefit significantly from scaling their computation during inference, often producing higher accuracy by dedicating more resources to hard problems. However, this approach brings along considerable drawbacks. Longer processing times and higher computing costs make it challenging to scale such solutions in real-world settings, where responsiveness and affordability are crucial. As technology advances toward more intelligent systems, there is a growing need to explore how LLMs can become not only smarter but also more efficient, especially when operating within repetitive or familiar contexts. One of the biggest inefficiencies in current LLM deployment occurs during query resolution. Typically, when a user poses a question, the model processes it simultaneously with the necessary background context. This test-time compute assumes that the context and question always arrive together. But in real scenarios, such as document Q&A or debugging code, context is usually persistent and can be accessed well before a specific question is asked. Yet, the model processes everything from scratch for each query, even if it has seen the context before. This redundancy results in increased computational costs and response delays, particularly in scenarios involving multiple queries within a single context. To deal with this inefficiency, various methods have been developed. Sequential and parallel test-time computation are two major strategies. Sequential approaches extend the model’s reasoning path, allowing it to consider more possibilities, while parallel approaches involve sampling multiple outputs simultaneously, known as pass@k. Techniques like speculative decoding aim to cut latency by making early guesses, but their usefulness is limited when the model still has to think from scratch. While helpful, these methods don’t eliminate the need to process context alongside every new question repeatedly. They also typically require test-time conditions that aren’t always feasible, such as access to an oracle or an ideal verifier. Researchers from Letta and the University of California, Berkeley, introduced a novel solution they call sleep-time compute. The method involves utilizing idle time between user interactions to increase productivity. Instead of waiting for a user question, the model begins analyzing the context beforehand. It anticipates possible future queries and builds a new version of the context enriched with relevant inferences. When a user finally asks a question, the model can simply refer to this pre-processed context. Since much of the thinking is already done, it requires less computational effort to produce accurate answers. This approach becomes even more effective when multiple questions relate to the same context, allowing for shared inferences and distributed computational cost. The implementation of sleep-time compute relies on decomposing the traditional prompt into two parts: a static context and a dynamic query. During the sleep-time window, only the context is used to generate a pre-processed version. This enhanced context, called c′, is built using test-time compute techniques like reasoning chains or summarization. Once this enriched version is stored, it replaces the raw context during real-time queries. The final answers are then generated using much fewer resources. This system not only minimizes redundant reasoning but also paves the way for more proactive LLMs that can think ahead and be better prepared. To evaluate the effectiveness of sleep-time compute, the research team tested it using two specially designed benchmarks: Stateful GSM-Symbolic and Stateful AIME. Both datasets are derived by splitting existing problem sets into separate contexts and questions. In experiments using models like GPT-4o and GPT-4o-mini, researchers observed a 5× reduction in test-time compute for similar accuracy levels. Notably, accuracy improved by up to 13% for the GSM-Symbolic P2 dataset and by 18% on Stateful AIME when sleep-time compute was scaled. Multi-Query GSM-Symbolic, a new dataset introduced for this evaluation, helped demonstrate that the cost per query could be reduced by 2.5× when 10 queries shared the same context. When pitted against popular strategies like pass@k, sleep-time compute consistently outperformed them. Unlike pass@k, which assumes access to a perfect evaluator, sleep-time compute works under more realistic conditions. Results show that even at low test-time compute budgets, sleep-time compute produced comparable or better accuracy while consuming fewer tokens. For instance, the GPT-4o-mini model achieved higher accuracy with fewer than 200 test-time tokens using sleep-time compute compared to over 500 tokens needed in the baseline. Even when models like Claude Sonnet 3.7 and DeepSeek R1 were evaluated, similar improvements were observed. Scaling the amount of compute dedicated to sleep-time further improved outcomes. By running five parallel generations during sleep-time on complex tasks, researchers pushed the pareto curve further. However, they noted diminishing returns beyond this point. Importantly, results showed that stronger models handling more difficult tasks benefited more from additional sleep-time compute. Also, amortizing sleep-time computation became highly cost-effective when contexts served multiple related queries. By weighting test-time tokens as ten times more expensive than sleep-time tokens, aligned with industry latency-cost ratios, the researchers confirmed a reduction of up to 2.5 times in the average cost per query. Another interesting finding was that sleep-time compute worked best when user queries were predictable. Using Llama2-70B, researchers scored the predictability of each query given its context and found a strong correlation: the more predictable the query, the greater the benefit. In examples where the question logically followed from the given context, sleep-time computation yielded higher gains. Conversely, less predictable or abstract queries experienced reduced effectiveness, although they still showed benefits compared to traditional test-time-only methods. Altogether, this research presents a smart and scalable technique to enhance the efficiency of LLMs without compromising accuracy. By leveraging otherwise idle time, sleep-time computing reduces the burden on real-time systems, lowers operational costs, and improves response time. The clear quantitative improvements, such as a 5× reduction in compute, 13–18% accuracy gains, and a drop of up to 2.5× in cost per query, demonstrate that forward-thinking approaches like this could shape the next generation of intelligent, context-aware assistants. Several Key Takeaways from the Research are as follows: Sleep-time compute allows models to anticipate queries by reasoning on context before the query arrives. Accuracy improved by 13% on GSM-Symbolic and 18% on AIME datasets when sleep-time computation was scaled. Test-time compute requirements decreased by approximately 5 times for similar performance levels. When sharing context across 10 related queries, the average query cost decreased by a factor of 2.5. Outperformed the pass@k strategy in parallel compute settings at equivalent budgets. More effective on predictable queries, identified via log-probability scoring. Diminishing returns noted beyond five parallel generations for sleep-time computation. Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop The post LLMs Can Think While Idle: Researchers from Letta and UC Berkeley Introduce ‘Sleep-Time Compute’ to Slash Inference Costs and Boost Accuracy Without Sacrificing Latency appeared first on MarkTechPost.0 Comments 0 Shares 92 Views
More Stories