0 Comentários
0 Compartilhamentos
53 Visualizações
Diretório
Diretório
-
Faça Login para curtir, compartilhar e comentar!
-
VENTUREBEAT.COMFormer DeepSeeker and collaborators release new method for training reliable AI agents: RAGENJoin our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 was, by many expert accounts, supposed to be the year of AI agents — task-specific AI implementations powered by leading large language and multimodal models (LLMs) like the kinds offered by OpenAI, Anthropic, Google, and DeepSeek. But so far, most AI agents remain stuck as experimental pilots in a kind of corporate purgatory, according to a recent poll conducted by VentureBeat on the social network X. Help may be on the way: a collaborative team from Northwestern University, Microsoft, Stanford, and the University of Washington — including a former DeepSeek researcher named Zihan Wang, currently completing a computer science PhD at Northwestern — has introduced RAGEN, a new system for training and evaluating AI agents that they hope makes them more reliable and less brittle for real-world, enterprise-grade usage. Unlike static tasks like math solving or code generation, RAGEN focuses on multi-turn, interactive settings where agents must adapt, remember, and reason in the face of uncertainty. Built on a custom RL framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), the system explores how LLMs can learn through experience rather than memorization. The focus is on entire decision-making trajectories, not just one-step responses. StarPO operates in two interleaved phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This structure supports a more stable and interpretable learning loop compared to standard policy optimization approaches. The authors implemented and tested the framework using fine-tuned variants of Alibaba’s Qwen models, including Qwen 1.5 and Qwen 2.5. These models served as the base LLMs for all experiments and were chosen for their open weights and robust instruction-following capabilities. This decision enabled reproducibility and consistent baseline comparisons across symbolic tasks. Here’s how they did it and what they found: Wang summarized the core challenge in a widely shared X thread: Why does your RL training always collapse? According to the team, LLM agents initially generate symbolic, well-reasoned responses. But over time, RL systems tend to reward shortcuts, leading to repetitive behaviors that degrade overall performance—a pattern they call the “Echo Trap.” This regression is driven by feedback loops where certain phrases or strategies earn high rewards early on, encouraging overuse and stifling exploration. Wang notes that the symptoms are measurable: reward variance cliffs, gradient spikes, and disappearing reasoning traces. RAGEN test environments aren’t exactly enterprise-grade To study these behaviors in a controlled setting, RAGEN evaluates agents across three symbolic environments: Bandit: A single-turn, stochastic task that tests symbolic risk-reward reasoning. Sokoban: A multi-turn, deterministic puzzle involving irreversible decisions. Frozen Lake: A stochastic, multi-turn task requiring adaptive planning. Each environment is designed to minimize real-world priors and focus solely on decision-making strategies developed during training. In the Bandit environment, for instance, agents are told that Dragon and Phoenix arms represent different reward distributions. Rather than being told the probabilities directly, they must reason symbolically—e.g., interpreting Dragon as “strength” and Phoenix as “hope”—to predict outcomes. This kind of setup pressures the model to generate explainable, analogical reasoning. Stabilizing reinforcement learning with StarPO-S To address training collapse, the researchers introduced StarPO-S, a stabilized version of the original framework. StarPO-S incorporates three key interventions: Uncertainty-based rollout filtering: Prioritizing rollouts where the agent shows outcome uncertainty. KL penalty removal: Allowing the model to deviate more freely from its original policy and explore new behaviors. Asymmetric PPO clipping: Amplifying high-reward trajectories more than low-reward ones to boost learning. These changes delay or eliminate training collapse and improve performance across all three tasks. As Wang put it: “StarPO-S… works across all 3 tasks. Relieves collapse. Better reward.” What makes for a good agentic AI model? The success of RL training hinges not just on architecture, but on the quality of the data generated by the agents themselves. The team identified three dimensions that significantly impact training: Task diversity: Exposing the model to a wide range of initial scenarios improves generalization. Interaction granularity: Allowing multiple actions per turn enables more meaningful planning. Rollout freshness: Keeping training data aligned with the current model policy avoids outdated learning signals. Together, these factors make the training process more stable and effective. An interactive demo site published by the researchers on Github makes this explicit, visualizing agent rollouts as full dialogue turns—including not just actions, but the step-by-step thought process that preceded them. For example, in solving a math problem, an agent may first ‘think’ about isolating a variable, then submit an answer like ‘x = 5’. These intermediate thoughts are visible and traceable, which adds transparency into how agents arrive at decisions. When reasoning runs out While explicit reasoning improves performance in simple, single-turn tasks like Bandit, it tends to decay during multi-turn training. Despite the use of structured prompts and tokens, reasoning traces often shrink or vanish unless directly rewarded. This points to a limitation in how rewards are typically designed: focusing on task completion may neglect the quality of the process behind it. The team experimented with format-based penalties to encourage better-structured reasoning, but acknowledges that more refined reward shaping is likely needed. RAGEN, along with its StarPO and StarPO-S frameworks, is now available as an open-source project at https://github.com/RAGEN-AI/RAGEN. However, no explicit license is listed in the GitHub repository at the time of writing, which may limit use or redistribution by others. The system provides a valuable foundation for those interested in developing AI agents that do more than complete tasks—they think, plan, and evolve. As AI continues to move toward autonomy, projects like RAGEN help illuminate what it takes to train models that learn not just from data, but from the consequences of their own actions. Outstanding Questions for Real-World Adoption While the RAGEN paper offers a detailed technical roadmap, several practical questions remain for those looking to apply these methods in enterprise settings. For example, how transferable is RAGEN’s approach beyond stylized, symbolic tasks? Would businesses need to design entirely new environments and reward functions to use this system in workflows like invoice processing or customer support? Another critical area is scalability. Even with the enhancements provided by StarPO-S, the paper acknowledges that training still eventually collapses over longer horizons. This raises the question: is there a theoretical or practical path to sustaining reasoning over open-ended or continuously evolving task sequences? At the time of writing, no explicit license is listed in the RAGEN GitHub repository or documentation, leaving open questions about usage rights. To explore these and other questions—including how non-technical decision-makers should interpret RAGEN’s implications—I reached out to co-author Wang for further insight. At the time of writing, a response is pending. Should any comments arrive, they will be included in a follow-up to this article or integrated as an update. RAGEN stands out not just as a technical contribution but as a conceptual step toward more autonomous, reasoning-capable AI agents. Whether it becomes part of the enterprise AI stack remains to be seen, but its insights into agent learning dynamics are already helping redefine the frontier of LLM training. Daily insights on business use cases with VB Daily If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI. Read our Privacy Policy Thanks for subscribing. Check out more VB newsletters here. An error occured.0 Comentários 0 Compartilhamentos 56 Visualizações
-
VENTUREBEAT.COMThe Political Machine 2024 update includes tariffs, new demographics and moreThe Political Machine is one of those games that never has to end, given that its fuel is interest in U.S. presidential politics.Read More0 Comentários 0 Compartilhamentos 56 Visualizações
-
VENTUREBEAT.COMGoogle adds more AI tools to its Workspace productivity appsGoogle expanded Gemini's features, adding the popular podcast-style feature Audio Overviews to the platform.Read More0 Comentários 0 Compartilhamentos 56 Visualizações
-
VENTUREBEAT.COMFormer DeepSeeker and collaborators release new method for training reliable AI agents: RAGENRAGEN stands out not just as a technical contribution but as a conceptual step toward more autonomous, reasoning-capable AI agents.Read More0 Comentários 0 Compartilhamentos 57 Visualizações
-
WWW.THEVERGE.COMNvidia’s AI assistant on Windows now has plugins for Spotify, Twitch, and moreNvidia is updating its G-Assist AI assistant on Windows to take it beyond optimizing game and system settings. G-Assist originally launched last month as a chatbot primarily focused on improving PC gaming, but it’s now getting plugin support so you can extend the AI assistant to control Spotify, check if a streamer is live on Twitch, and look at stock or weather updates. The new ChatGPT-based G-Assist plugin builder lets developers and enthusiasts create custom functionality for Nvidia’s AI assistant. G-Assist will be able to connect to external tools and use APIs to expand the capabilities of what Nvidia offers right now. Nvidia has published sample plugins on GitHub that can be compiled and used by G-Assist: Spotify — hands-free music and volume control Google Gemini — allows G-Assist to invoke Gemini for cloud-based complex conversations Twitch — you can use this plugin to checks if a streamer is live with voice commands like, “Hey, Twitch, is [streamer] live?” Peripheral Controls — change RGB lighting or fan speed on Logitech G, Corsair, MSI and Nanoleaf devices Stock Checker — provides real-time stock prices Weather Updates — provides current weather conditions in any city These plugins all run locally using a small language model on Nvidia’s RTX GPUs, and developers will also be able to share their own custom plugins through GitHub. G-Assist uses a local small language model that requires nearly 10GB of space for the assistant functions and voice capabilities. The AI assistant works on a variety of RTX 30-, 40-, and 50-series desktop GPUs, but you’ll need a card with at least 12GB of VRAM. If you’re interested in trying out G-Assist or building a plugin, the app is available as an optional part of Nvidia’s main app for Windows.0 Comentários 0 Compartilhamentos 48 Visualizações
-
An Existential Crisis of a Veteran Researcher in the Age of Generative AII was a researcher fifteen years ago. A PhD candidate doing Research for long days. I was swamped with many articles, annotations, emails, bookmarks, etc. When I found a citation manager tool, Mendeley, I felt so relaxed. It was like I had control over the process again. When I found a bookmark manager, XBookmark, I felt so productive (still have the bookmarks). They worked well for me at that time, and I finished my PhD program and got my degree. Episode 1 — Facing the Reality These days, I have an existentialist crisis. I see how AI research assistant tools have progressed. I was working with Scinito the other day, and I was shocked. I was trying to convince myself that this is only a tool to help in the literature review. Those who did a PhD know how tough it is to do a solid literature review. It is not a joke. You have to read over 100 articles, categorize them, understand them, and summarize them. If I say it took 3-6 months to do a solid literature review 15 years ago, I was not wrong. You heard me right, 3-6 months of your valuable lifetime. At first, I tried to convince myself that Scinito, or other similar tools, offer a marginal value to researchers. Sadly, or happily, I was wrong … Not only can they do a literature review for you in a minute (I am sorry, my fellows, you heard it right), but they can peer review your articles. I would never forget how much I should have waited for my mentors and advisors to review my article, how many back-and-forth emails we had till things got up to an acceptable quality. Even after all these efforts, you get extensive feedback from peer reviewers in a journal to consider your article for publication. Or, you got rejected after 3 or 6 months only due to choosing the wrong journal to publish. These AI research assistant tools can enhance all these steps: reviewing your articles and selecting the most relevant journal for you. Amazing. It is indeed amazing for researchers in this era, but I feel sad to see how much time I have spent on something that could have been done much easier and much faster. The interesting part is that this is not the end of the season. It is just the beginning. This challenge is not just for researchers; it is also relevant to software developers. Tools like Cursor IDE have transformed how we build software dramatically. After my PhD, I started my career in engineering. So, I did lots of coding, testing, and so on. Today, I don’t need to read Stack Overflow to debug my code. I don’t need to spend time writing tests for my code. I no longer need to be an expert in React, HTML, or CSS to build a website. How much time did I spend on building sites in the past? I don’t want to think about it! Episode 2 — Embracing the Reality Let me share the full half of the glass experience as well. This is super cool. I can ask AI research assistant tools to perform a semantic search within an extensive database. Something that we couldn’t have done before. It was just keyword matching. I could get updates about any topics or research questions in hours by reading the literature review that AI generates in seconds. I can write LaTeX code easily. I can reformat my paper to any guideline in minutes. I am happy for researchers of our time. They can spend more time on creativity, problem solving, and, of course, their valuable life rather than doing unnecessary time-consuming tasks. I am also happy for myself. I can write code in any programming language that I want. I can build websites without getting locked to Wix or WordPress. I can write any Python code that I need. I can optimize it and write a series of tests for it. WoW! It is super cool. The landscapes of coding, designing, researching, and everything are evolving fast. No matter how much people or organizations resist, Technology will find its way. There is a catch here. The promise of building a website with one (and only one) prompt is not correct. I am telling this based on a very current experience. These days, I am working on a new website with a colleague, both of us are experts in software and AI. We didn’t even think about Wix or WordPress this time. We started using Curosr and its Agent experience with Claude-3.7-sonnet. The Cursor’s agent can generate the website structure in a second, but it fails when it comes to details. For example, when you want to align two different texts with each other, especially when one of them is static and one of them is dynamic, the AI can’t do it right. Basically, AI can do the website structure in a second, but can’t do the required details (details that you want to apply as a human on top of the prebuilt structure) as well as a UI design expert. That means, even though we don’t need to be experts in React or CSS, we must know the basics to intervene in the codebase when needed. Plus, we must know the concepts well enough to elaborate on them. If you can’t say it, AI can’t make it! I am not shocked by this weakness of AI models. They are built on the “wisdom of crowds” principle. It means they are built based on aggregating what’s most common, not emulating the intuition of an individual expert. This is rooted in their fundamentals. They are amazing in generality but suffer in specificity. In this short podcast, I explained a similar concept from a different angle: “The Erosion of Specificity.” Last Words I have been lucky to be part of the AI community. I am an AI architect with a solid plan to embrace this technology shift. But I am concerned about many other folks who can’t manage this change. This is not easy at all. If you have had an existential moment in your career due to AI, let me know. I may know something that helps. If I could share one tip here, it would be to “Learn the fundamentals, deeply.” You can (should) leave the repetitive, high-level, and generic tasks to AI, and spend your human creativity and expertise on the details to make your work/product shine. Follow me on YouTube if you want to hear more stories from the perspective of an AI architect: youtube.com/@AIUnorthodoxLessons/ The post An Existential Crisis of a Veteran Researcher in the Age of Generative AI appeared first on Towards Data Science.0 Comentários 0 Compartilhamentos 60 Visualizações
-
TOWARDSDATASCIENCE.COMWhy Most Cyber Risk Models Fail Before They BeginCybersecurity leaders are being asked impossible questions. “What’s the likelihood of a breach this year?” “How much would it cost?” And “how much should we spend to stop it?” Yet most risk models used today are still built on guesswork, gut instinct, and colorful heatmaps, not data. In fact, PwC’s 2025 Global Digital Trust Insights Survey found that only 15% of organizations are using quantitative risk modeling to a significant extent. This article explores why traditional cyber risk models fall short and how applying some light statistical tools such as probabilistic modeling offers a better way forward. The Two Schools of Cyber Risk Modeling Information security professionals primarily use two different approaches to modeling risk during the risk assessment process: qualitative and quantitative. Qualitative Risk Modeling Imagine two teams assess the same risk. One assigns it a score of 4/5 for likelihood and 5/5 for impact. The other, 3/5 and 4/5. Both plot it on a matrix. But neither can answer the CFO’s question: “How likely is this to actually happen, and how much would it cost us?“ A qualitative approach assigns subjective risk values and is primarily derived from the intuition of the assessor. A qualitative approach generally results in the classification of the likelihood and impact of the risk on an ordinal scale, such as 1-5. The risks are then plotted in a risk matrix to understand where they fall on this ordinal scale. Source: Securemetrics Risk Register Often, the two ordinal scales are multiplied together to help prioritize the most important risks based on probability and impact. At a glance, this seems reasonable as the commonly used definition for risk in information security is: \[\text{Risk} = \text{Likelihood } \times \text{Impact}\] From a statistical standpoint, however, qualitative risk modeling has some pretty important pitfalls. The first is the use of ordinal scales. While assigning numbers to the ordinal scale gives the appearance of some mathematical backing to the modeling, this is a mere illusion. Ordinal scales are simply labels — there is no defined distance between them. The distance between a risk with an impact of “2” and an impact of “3” is not quantifiable. Changing the labels on the ordinal scale to “A”, “B”, “C”, “D”, and “E” makes no difference. This in turn means our formula for risk is flawed when using qualitative modeling. A likelihood of “B” multiplied by an impact of “C” is impossible to compute.The other key pitfall is modeling uncertainty. When we model cyber risks, we are modeling future events that are not certain. In fact, there is a range of outcomes that could occur. Distilling cyber risks into single-point estimates (such as “20/25” or “High”) don’t express the important distinction between “most likely annual loss of $1 Million” and “There is a 5% chance of a $10 Million or more loss”. Quantitative Risk Modeling Imagine a team assessing a risk. They estimate a range of outcomes, from $100K to $10M. Running a Monte Carlo simulation, they derive a 10% chance of exceeding $1M in annual losses and an expected loss of $480K. Now when the CFO asks, “How likely is this to happen, and what would it cost?”, the team can respond with data, not just intuition. This approach shifts the conversation from vague risk labels to probabilities and potential financial impact, a language executives understand. If you have a background in statistics, one concept in particular should stand out here: Likelihood. Cyber risk modeling is, at its core, an attempt to quantify the likelihood of certain events occurring and the impact if they do. This opens the door to a variety of statistical tools, such as Monte Carlo Simulation, that can model uncertainty far more effectively than ordinal scales ever could. Quantitative risk modeling uses statistical models to assign dollar values to loss and model the likelihood of these loss events occurring, capturing the future uncertainty. While qualitative analysis might occasionally approximate the most likely outcome, it fails to capture the full range of uncertainty, such as rare but impactful events, known as “long tail risk”. Source: Securemetrics Cyber Risk Quantification The loss exceedance curve plots the likelihood of exceeding a certain annual loss amount on the y-axis, and the various loss amounts on the x-axis, resulting in a downward sloping line. Pulling different percentiles off the loss exceedance curve, such as the 5th percentile, mean, and 95th percentile can provide an idea of the possible annual losses for a risk with 90% confidence. While the single-point estimate of Qualitative Analysis may get close to the most likely risk (depending on the accuracy of the assessors judgement), quantitative analysis captures the uncertainty of outcomes, even those that are rare but still possible (known as “long tail risk”). Looking Outside Cyber Risk To improve our risk models in information security, we only need to look outwards at the techniques used in other domains. Risk modeling has been matured in a variety of applications, such as finance, insurance, aerospace safety, and supply chain management. Financial teams model and manage portfolio risk using similar Bayesian statistics. Insurance teams model risk with mature actuarial models. The aerospace industry models the risk of system failures using likelihood modeling. And supply chain teams model risk using probabilistic simulations. The tools exist. The math is well understood. Other industries have paved the way. Now it’s cybersecurity’s turn to embrace quantitative risk modeling to drive better decisions. Key Takeaways QualitativeQuantitativeOrdinal Scales (1-5)Probabilistic modelingSubjective intuitionStatistical rigorSingle-point scoresRisk distributionsHeatmaps & color codesLoss exceedance curvesIgnores rare but severe eventsCaptures long-tail risk The post Why Most Cyber Risk Models Fail Before They Begin appeared first on Towards Data Science.0 Comentários 0 Compartilhamentos 57 Visualizações
-
WWW.GAMESPOT.COMAll Overwatch 2 Dokiwatch Skins, Name Cards, And CosmeticsOverwatch 2 Season 16 is now live, adding the new damage hero Freja and the new competitive mode Stadium. To kick off the season, there is a new collection of cosmetic items to get, the Dokiwatch skins. These are skins designed to invoke the imagery of magical girls, a genre of anime and manga, like Sailor Moon, but instead of tying in an official license, Blizzard has made its own. There are nine skins total, featuring different magical girls and a few adjacent skins, but collecting all of them will require a few different purchases, unfortunately.The main Dokiwatch Mega bundle on the store costs 5,900 coins (about $60), and includes five skins: Dream Sheep Orisa, Glitter Mei, Heart of Grace Widowmaker, Heart of Passion Kiriko, and Nocturna D.Va. The bundle also includes name cards and player icons, but these five skins can be purchased individually for 1,900 coins directly in the hero gallery. However, there are a few more Dokiwatch skins to get elsewhere in Overwatch 2.An additional two skins are available in the battle pass: the Heart of Strength Brigitte skin at tier 20 in the premium pass and Hero of Heart Tracer at tier 40 for free. The Juno Heart of Hope Mythic skin is in the Mythic shop, with a base cost of 50 Prisms, with additional tiers of customization available for more Mythic Prisms, which can be earned via the battle pass or purchased for real money.Lastly, there is one final Dokiwatch skin, the Heart of Courage Freja skin, which is available exclusively in the Ultimate Battle Pass Bundle, which costs $40 and includes the premium battle pass, the Demon Rocker Illari skin, 2,000 coins, and 20 tier skips. This means that in order to get all of the skins you would need to purchase the ultimate battle pass bundle and the megabundle, for a grand total of $100. Below you can see every Dokiwatch skin, along with other cosmetics like icons and sprays. Heart of Courage - Freja skin Heart of Hope - Juno skin Heart of Strength Hero of Heart - Tracer skin Dream Sheep - Orisa skin Glitter - Mei skin Heart of Grace - Widowmaker skin Heart of Passion - Kiriko skin Nocturna - D.Va skin Winged Heart - weapon charm Glitter Mei - name card Heart of Grace Widowmaker - name card Heart of Passion Kiriko - name card Nocturna D.Va - name card Dream Sheep Orisa - player icon Glitter Mei - player icon Heart of Grace Widowmaker - player icon Heart of Passion Kiriko - player icon Hallow Hearts Persephone - player icon Ariadne's Heart - spray Dream Sheep Orisa - spray Glitter Mei - spray Heart of Passion Kiriko - spray Nocturna D.Va - spray0 Comentários 0 Compartilhamentos 25 Visualizações
-
WWW.GAMESPOT.COMMLB The Show 25 Gets First Big Discount For PS5, Xbox, And SwitchMLB The Show 25 is on sale for a big discount at Amazon just one month after launch. Sports fans can save $20 on Sony San Diego Studio's annual baseball sim for PS5, Xbox Series X, and Nintendo Switch. The PlayStation and Xbox editions are up for grabs for $50 (was $70), while the Switch version is down to $40 (was $60). Since it just released on March 18, this is unsurprisingly the best deal so far on MLB The Show 25.MLB The Show 25 Deals at Amazon:Continue Reading at GameSpot0 Comentários 0 Compartilhamentos 24 Visualizações