0 Commentarii
0 Distribuiri
28 Views
Director
Director
-
Vă rugăm să vă autentificați pentru a vă dori, partaja și comenta!
-
WWW.MARKTECHPOST.COMLLM Reasoning Benchmarks are Statistically Fragile: New Study Shows Reinforcement Learning RL Gains often Fall within Random VarianceReasoning capabilities have become central to advancements in large language models, crucial in leading AI systems developed by major research labs. Despite a surge in research focused on understanding and enhancing LLM reasoning abilities, significant methodological challenges persist in evaluating these capabilities accurately. The field faces growing concerns regarding evaluation rigor as non-reproducible or inconclusive assessments risk distorting scientific understanding, misguiding adoption decisions, and skewing future research priorities. In the rapidly evolving landscape of LLM reasoning, where quick publication cycles and benchmarking competitions are commonplace, methodological shortcuts can silently undermine genuine progress. While reproducibility issues in LLM evaluations have been documented, their continued presence—particularly in reasoning tasks—demands heightened scrutiny and more stringent evaluation standards to ensure that reported advances reflect genuine capabilities rather than artifacts of flawed assessment methodologies. Numerous approaches have emerged to enhance reasoning capabilities in language models, with supervised fine-tuning (SFT) and reinforcement learning (RL) being the primary methods of interest. Recent innovations have expanded upon the DeepSeek-R1 recipe through innovative RL algorithms like LCPO, REINFORCE++, DAPO, and VinePPO. Researchers have also conducted empirical studies exploring RL design spaces, data scaling trends, curricula, and reward mechanisms. Despite these advancements, the field faces significant evaluation challenges. Machine learning progress often lacks rigorous assessment, with many reported gains failing to hold up when tested against well-tuned baselines. RL algorithms are particularly susceptible to variations in implementation details, including random seeds, raising concerns about the reliability of benchmarking practices. Motivated by inconsistent claims in reasoning research, this study by researchers from Tübingen AI Center, University of Tübingen and University of Cambridge conducts a rigorous investigation into mathematical reasoning benchmarks, revealing that many recent empirical conclusions fail under careful re-evaluation. The analysis identifies surprising sensitivity in LLM reasoning pipelines to minor design choices, including decoding parameters, prompt formatting, random seeds, and hardware configurations. Small benchmark sizes contribute significantly to this instability, with single questions potentially shifting Pass@1 scores by over 3 percentage points on datasets like AIME’24 and AMC’23. This leads to double-digit performance variations across seeds, undermining published results. The study systematically analyzes these instability sources and proposes best practices for improving reproducibility and rigor in reasoning evaluations, providing a standardized framework for re-evaluating recent techniques under more controlled conditions. The study explores design factors affecting reasoning performance in language models through a standardized experimental framework. Nine widely used models across 1.5B and 7B parameter classes were evaluated, including DeepSeek-R1-Distill variants, DeepScaleR-1.5B, II-1.5 B-Preview, OpenRS models, S1.1-7B, and OpenThinker7B. Using consistent hardware (A100 GPU, AMD CPU) and software configurations, models were benchmarked on AIME’24, AMC’23, and MATH500 datasets using Pass@1 metrics. The analysis revealed significant performance variance across random seeds, with standard deviations ranging from 5 to 15 percentage points. This instability is particularly pronounced in smaller datasets where a single question can shift performance by 2.5-3.3 percentage points, making single-seed evaluations unreliable. Based on rigorous standardized evaluations, the study reveals several key findings about current reasoning methodologies in language models. Most RL-trained variants of the DeepSeek R1-Distill model fail to deliver meaningful performance improvements, with only DeepScaleR demonstrating robust, significant gains across benchmarks. While RL training can substantially improve base model performance when applied to models like Qwen2.5, instruction tuning generally remains superior, with Open Reasoner-Zero-7B being the notable exception. In contrast, SFT consistently outperforms instruction-tuned baselines across all benchmarks and generalizes well to new datasets like AIME’25, highlighting its robustness as a training paradigm. RL-trained models show pronounced performance drops between AIME’24 and the more challenging AIME’25, indicating problematic overfitting to training distributions. Additional phenomena investigated include the correlation between response length and accuracy, with longer responses consistently showing higher error rates across all model types. This comprehensive analysis reveals that apparent progress in LLM-based reasoning has been built on unstable foundations, with performance metrics susceptible to minor variations in evaluation protocols. The investigation demonstrates that reinforcement learning approaches yield modest improvements at best and frequently exhibit overfitting to specific benchmarks, while supervised fine-tuning consistently delivers robust, generalizable performance gains. To establish more reliable assessment standards, standardized evaluation frameworks with Dockerized environments, seed-averaged metrics, and transparent protocols are essential. These findings highlight the critical need for methodological rigor over leaderboard competition to ensure that claimed advances in reasoning capabilities reflect genuine progress rather than artifacts of inconsistent evaluation practices. Here is the Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Multimodal Models Don’t Need Late Fusion: Apple Researchers Show Early-Fusion Architectures are more Scalable, Efficient, and Modality-AgnosticMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Step by Step Coding Guide to Build a Neural Collaborative Filtering (NCF) Recommendation System with PyTorchMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/This AI Paper Introduces a Machine Learning Framework to Estimate the Inference Budget for Self-Consistency and GenRMs (Generative Reward Models)Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/MMSearch-R1: End-to-End Reinforcement Learning for Active Image Search in LMMs0 Commentarii 0 Distribuiri 43 Views
-
WWW.IGN.COMStan Lee's Daughter J.C. Lee Denies Elder Abuse AllegationsJ.C. Lee, the daughter of legendary Marvel creative Stan Lee, has spoken out for the first time in an interview with Business Insider to deny past allegations that she committed elder abuse toward both her father and her mother, Joan.The accusations against J.C. Lee first began circulating in 2017 following the death of her mother, with the most damning portrait appearing in a 2018 The Hollywood Reporter piece. The report alleges that, under pressure from three other individuals connected to the Lees, J.C. Lee frequently pressured her parents for money and control of assets. It claims that she engaged in screaming matches with Stan Lee and had what the article describes as a "powder-keg" relationship with him, with claims including verbal arguments and at least one alleged physical altercation. THR was provided photos of a bruise on Joan Lee’s arm and included detailed accusations that J.C. Lee has denied.Now, in the Business Insider interview, J.C. Lee says that the accusations were all "a lie." She says she did not make a public denial at the time of the initial THR piece under advisement from people around her. "You think I haven't regretted it to this day?" she said. "They are all lies. That photo is insane. I never did it."J.C. Lee acknowledged that she did often get into screaming fights with her parents over money, but says it never became physical. "I never ever touched my parents," she said.Stan Lee died of a heart attack in 2018 at the age of 95.The full interview with J.C. Lee in Business Insider recounts her struggles growing up as the child of the famous Stan Lee and her ongoing troubles with money, being manipulated by others, loneliness, creativity, and living with her father's legacy.Rebekah Valentine is a senior reporter for IGN. You can find her posting on BlueSky @duckvalentine.bsky.social. Got a story tip? Send it to rvalentine@ign.com.Blogroll image credit: Albert L. Ortega/Getty Images0 Commentarii 0 Distribuiri 34 Views
-
NEWS.XBOX.COMUnleash Chaos with Five Tips to Become a Hot Rod Mayhem ChampionSummaryHot Rod Mayhem is available now Xbox Series X|S & Xbox One.Learn more about the delightfully chaotic racer, powerups and track modes.Hit the tracks and become a Hot Rod Champion solo or with friends. Hello soon-to-be Hot Rod Champion! Hot Rod Mayhem has drifted onto Xbox Series X|S & Xbox One! Jump in starting today as toy-sized racers on huge tracks.Enjoy Hot Rod Mayhem with two distinct modes. Race mode, where you can play solo, grab a friend or play with CPU rivals in a more casual race setting, or get competitive and rise to the top in Championship mode. In this mode, you’ll face classic races on circuits to earn points to win championships, unlock new races, trials, tracks and outfits. With three difficulty settings to customize your play style, ten unlockable tracks and an arsenal of powerups at your disposal, things can and will get delightfully chaotic! Now, before you customize your Hot Rod Champion and hit that gas pedal to the floor, I want to make sure you’re ready to become a champion. Choose your Hot Rod Wisely Each vehicle has different strengths. Choose and customize your car with colors and designs, but also keep in mind its stats. Weight, Speed, Handling and Boost will all play important roles on the tacks. Understand the strengths and choose which car suits your driving style to get the most out of your races. Get Familiar with the Tracks Get to know the tracks so you can anticipate what’s going to happen. Hot Rod Mayhem‘s lively circuits are full of surprises, living creatures and elements that will drive mayhem all over the race. Be sure to learn soon how those creatures and stray objects behave to have a winning card on your deck! Drift Like the Trophy Depends on it! (It Does) Master the drift. Make the art of the drift and do it as much as possible everywhere! Practice doing snaking or even drifting mid-air, anytime you can. Drifting will not only give you a quick edge but also fill the turbo bar, which will give you extra speed all around. Turbo Turbo Turbo! Don’t fall asleep, go fast by using turbo energy and increase the fill on your turbo bar while drifting. The faster you go, the quicker your turbo bar will reload, giving you more speed. This applies to all the game elements that grant you turbo energy when colliding with them. Find these soon! The Power of Power-ups While many other kart games have power-ups, the key in Hot Rod Mayhem is to use them strategically. Hold onto your power-ups until you are sure it’s the best time for it. Crystal Marble will roll over the competition, but be aware of rolling over yourself when it bounces. Hot Missile makes a beeline for first place, but try not to shoot it when you’re in the lead. These are just a few examples of the tactical powerups that you can make use of, so be sure to make smart use of the power-ups! Hot Rod Mayhem is a big project for our studio, Casual Brothers, which has been making games for more than ten years, mostly working for other publishers. We are very happy to have the chance to publish something on our own this time. Our team worked hard to bring joy to players all over the world. We hope to connect with people by giving our unique take on racing games with t! Feel free to share the love and feedback on social media or join us and other Hot Rod Champions on our official Casual Brothers Games Discord. I hope these tips help you rise to the top of the leaderboards and avoid as many rockets as possible. Check out Hot Rod Mayhem today on Xbox Series X|S & Xbox One.From our whole team at Casual Brothers Games, thank you for reading and good luck out there on the track! Hot Rod Mayhem Casual Brothers Ltd. ☆☆☆☆☆ ★★★★★ $19.99 Get it now Start your engines, gearheads – it's time for Hot Rod Mayhem! Prove yourself as the ultimate little racer by putting your speed and swerving skills to the test in wild races and one-of-a-kind trials! Use perilous pick-ups, like the tricky Marble or the Homing Dart, to leave your opponents in the dust! The post Unleash Chaos with Five Tips to Become a Hot Rod Mayhem Champion appeared first on Xbox Wire.0 Commentarii 0 Distribuiri 38 Views
-
9TO5MAC.COMApple’s Messages app shows Meta is not a monopoly, says MetaIn an extremely high-profile legal case, Meta is currently trying to fend off antitrust claims so the FTC doesn’t break it up. And today as part of its prepared defense, the company sought to use Apple’s Messages app as evidence that it’s not a monopoly. Meta’s opening statement slides are available here for viewing in full, per The Verge. And as highlighted by Wes Davis, even some of the slides with redacted info are easy to uncover. One slide, for example, compares weekly device usage of Apple’s Messages app to Meta’s competing offerings on iOS. Here are the numbers: Apple Messages: 88.39% device use Instagram: 48.19% Facebook Messenger: 37.55% WhatsApp: 36.76% This data is presented in the context of Meta seeking to refute the FTC’s apparent perspective that standard “Messaging” is different from “Personal Social Networking.” There’s also an accompanying quote from Apple Director of Product Marketing, Ronak Shah, saying: A “core use case” of iMessage is “to allow users to communicate with the people that are in their life that they know.” In other words, Meta can’t have a monopoly when Apple’s built-in Messages app is more popular—at least on iOS. Of course, Apple isn’t the only company Meta highlights in its defense. The tech giant also points to competition from TikTok, YouTube, Snapchat, and more. What do you think of Meta using Apple’s Messages app in its defense? Let us know in the comments. Best iPhone accessories Add 9to5Mac to your Google News feed. FTC: We use income earning auto affiliate links. More.You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel0 Commentarii 0 Distribuiri 28 Views
-
FUTURISM.COMZillionaire Girlbosses Astonished by Backlash to Their Frivolous Trip to SpaceEarlier this week, a crew of six women — including pop star Katy Perry, CBS broadcast journalist Gayle King, and Blue Origin CEO Jeff Bezos' fiancée Lauren Sánchez — launched to the edge of space as part of an 11-minute thrill ride organized by the Amazon cofounder's space company.The vacuous publicity stunt — which claimed to make the crew of mostly uber-wealthy media personalities "astronauts" after a mere two days of basic safety training — drew plenty of criticism.After all, apart from spending an obscene amount of money and rattling off cringeworthy statements about "making space for future women," the crew had little to contribute to science, discourse, or meaningful feminism.Put simply, the collective eye-rolls the stunt induced could've been visible from space.Yet the widespread backlash came to the surprise of crew members, who had allegedly been inundated by messages from inspired fans."Anybody that’s criticizing doesn’t really understand what is happening here," King said during an interview following the launch, as quoted by People magazine. "We can all speak to the response we’re getting from young women, from young girls, about what this represents."Bezos' multimillionaire fiancée also said that the criticism got her "fired up," arguing that Blue Origin employees had "put their heart and soul into this vehicle" — while she laid down on a padded, reclining seat to rocket into space.Several other high-profile celebrities took a swipe at the publicity stunt."Billion dollars bought some good memes I guess," actor Olivia Wilde wrote in a Monday Instagram post, as quoted by People."Space exploration was to further our knowledge and to help mankind," she argued, while hosting an NBC daytime TV show earlier this month. "What are they gonna do up there that has made it better for us down here?"Comedian Amy Schumer also skewered the trip in a video."Guys, last second, they added me to space, and I’m going to space," she said sarcastically.Model Emily Ratajkowski had an even stronger reaction, noting that she was "literally disgusted" by the "beyond parody" stunt. In a TikTok video, she pointed out that while the optics of women of color going to "space" looked great on paper, the stunt had little to do with actual progress."Instead it just speaks to the fact that we are living in an oligarchy where there's a very small group of people who are interested in going into space for the sake of getting a new lease on life, while the rest of the population... are worried about paying rent or [providing] dinner for their kids," Ratajkowski said.Other onlookers also noted the baffling demonstration of privilege by the ultra-rich."If Jeff Bezos can send Katy Perry into space, he can pay a wealth tax so every American has debt-free healthcare," educator and activist Nina Turner wrote in a post on Bluesky.However, the widespread criticism appeared to have fallen on deaf ears."This is a freaking journey," a defensive King said during a post-launch interview. "It was not a joyride.""I’m not going to let you steal our joy," she added while addressing her "haters."Share This Article0 Commentarii 0 Distribuiri 34 Views
-
SCREENCRUSH.COMOne of America's Oldest Taco Bell Locations Is Closing ForeverTaco Bell is an institution in America, but we’re seeing the end of an era: The restaurant chain’s longest-standing brick and mortar location is going to close forever.The Scottsdale, Ariz. Taco Bell is one of the oldest standing mission-style locations, which is basically a nod to the throwback Taco Bell cantina-style ones from way back in the day.It was also running 24 hours a day when it wrapped up shop.The New York Post reports that the location, better known to Taco Bell associates as “Store No. 31,” opened up in the 1960s and just served its last taco on April 12, 2025.That’s right about the age of retirement here in America, so it only makes sense to let this tired horse rest. The reason for the store’s closing is to build a bigger, newer and better Taco Bell, directly across the street from where this one sits.Taco BellGoogle Mapsloading...Google MapsGoogle Mapsloading...This location is so old school that it only had three tables to sit at inside if you wanted to dine in, and the drive-thru was quite a tight squeeze for cars. That must have been difficult back in the ’60s and ’70s as cars were the size of small boats back then.We headed over to Reddit, where Taco Bell has a strong presence among members.One person says, “I decided to go through the drive-thru again a month ago for the first time in at least a decade after hearing it would be closing.”Fans are also suggesting the building should be brought to the Smithsonian Museum to be forever cherished.“Stopped by to eat in one last time today,” one more fan writes. “Couldn’t see this place close down without one last visit.”R.I.P. Store No. 31.The Craziest Fast Food Menu Items EverCategories: Original Features0 Commentarii 0 Distribuiri 30 Views
-
WEWORKREMOTELY.COMTechsource: Senior Customer Success ManagerWe're a fast-growing start up on a mission to help companies save money, streamline vendor relationships, and optimize their technology spend. Our clients range from high-growth startups to mid-market companies, specializing in strategic negotiations, software license optimization, and vendor consolidation.We are hiring our first Senior Customer Success Manager (CSM) to help scale customer relationships, drive value, and continuously improve internal processes. If you thrive in a start-up environment, open to wearing many hats, and working hands-on with a proactive team, we want to hear from you! The RoleAs the first Senior Customer Success Manager, you'll be at the heart of client delivery and vendor strategy. You’ll be responsible for managing and growing relationships with existing customers, ensuring they achieve measurable outcomes from our services. You'll also collaborate closely with vendors and internal teams to ensure our recommendations are not only cost-effective but strategically aligned with client goals. This is a high-impact, high-visibility role where you'll shape the customer experience, influence vendor strategy, and help build our CSM function from the ground up. Key ResponsibilitiesOwn and nurture post-sale relationships with a portfolio of clientsServe as a strategic advisor, helping clients implement recommendations and capture savingsTrack key performance indicators (KPIs), savings outcomes, and renewal milestonesBuild and maintain relationships with key software vendorsIdentify vendor trends and escalate risks or opportunities internallyWork with clients to audit, rationalize, and streamline their software toolsProvide insight into usage, license tiers, overlap, and consolidation opportunitiesStay current on SaaS trends and emerging technologies relevant to our client basePartner with the internal team to deliver a seamless client experienceHelp establish CSM playbooks, client reporting frameworks, and communication cadencesContribute to the continuous improvement of our client success strategyRequirements5+ years of experience in customer success, account management, procurement consulting, or SaaS strategyProven track record of managing and growing client relationships with mid-market or enterprise customersDeep understanding of software procurement and SaaS licensingExceptional communication skillsAnalytical mindset with a knack for identifying savings opportunities and translating data into actionComfortable in a startup or high-growth environment; self-starter with a builder mentalityBonus: Experience working in procurement, finance, IT strategy, or vendor managementBenefitsWork directly with the founder and play a key role in shaping the future of the companyHelp companies make smarter decisions and save meaningful dollars on tech spendFlexible remote work environment with a tight-knit, mission-driven teamCompetitive compensation and performance-based bonusesRoom to grow—this role can evolve into a leadership position as we scaleApply NowLet's start your dream job Apply now Automatically Apply to Remote All Other Remote JobsLet your copilot automatically search and apply to remote jobs from We Work Remotely0 Commentarii 0 Distribuiri 30 Views
-
WWW.CNET.COMMaking Sense of Phone Tariffs: See How Much iPhone Prices Could RisePrices are likely to go up, even if the exact amount is unclear. But don't panic-buy if it means going into debt, experts say.0 Commentarii 0 Distribuiri 29 Views
-
WWW.SCIENTIFICAMERICAN.COMA Colossal Squid Has Been Filmed in the Deep Sea for the First TimeApril 15, 20253 min readThis Is the First Colossal Squid Filmed in the Deep Sea—And It’s a Baby!A colossal squid was filmed for the first time in its natural habitat near the South Sandwich Islands during a recent expedition, and it turned out to be a babyBy Ashley Balzer Vigil edited by Andrea ThompsonThis is the first confirmed live observation of the colossal squid (Mesonychoteuthis hamiltoni) in its natural habitat, taken by the remotely operated vehicle SuBastian on March 9 during an Ocean Census flagship expedition in the remote South Sandwich Islands in the South Atlantic Ocean. This squid is a baby about one foot long. ROV SuBastian / Schmidt Ocean InstituteA faintly fluttering specter, at first hardly visible among bits of marine snow falling in slow motion, emerged from the deep-sapphire void. The pilot of the underwater robot brought the creature to the center of the frame, giving scientists on a ship at the ocean’s surface a good view of the strange life-form. Its mostly transparent, speckled dome was topped with fins that busily flapped like tiny wings, and its tentacles were drawn up underneath it, toward its glowing red undercarriage.There was little fanfare—just a few minutes of quiet, almost reverent observation. But the encounter, 100 years in the making, marked the first time a colossal squid (Mesonychoteuthis hamiltoni) had ever been caught on film in its natural habitat.“This is one of the planet’s true giants, living in one of our most pristine marine ecosystems,” says Kat Bolstad, an associate professor at the Auckland University of Technology in New Zealand, who helped independently identify the creature from the footage. “It’s a source of fascination and wonder, and it also plays a huge role in Antarctic food webs.”On supporting science journalismIf you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.Scientists onboard the Schmidt Ocean Institute’s research vessel Falkor (too) saw the colossal squid about 2,000 feet beneath the surface near the remote, uninhabited South Sandwich Islands in the South Atlantic Ocean. These leviathans can grow to about 23 feet long and can weigh more than 1,000 pounds—shorter, but much stockier, than the giant squid (Architeuthis dux), which grows to about 43 feet long and 600 pounds. The colossal squid that was captured on film was just a baby, however, measuring only about a foot in length.“We filmed it because it was beautiful and unusual, and then we kind of descended back all the way down to the seafloor to do the exploration that the rest of that dive was focused on,” the expedition’s chief scientist, Michelle Taylor of the University of Essex in England, said during a press conference. It wasn’t until a few days later, after the team heard from some glass squid experts, that the researchers fully realized the observation’s significance.Though people have known about the existence of colossal squids for a century, the animals had mainly been found among the stomach contents of whales and seabirds, successfully evading human eyes in their natural habitat.“Much of our scientific and filming gear is noisy and bright, so squid will be aware of our equipment long before we know they’re there—and they will stay well away,” Bolstad says. “The deep sea is a vast 3D space, and looking for specific animals there is tricky, especially when they are probably actively trying to avoid us!” Scientists still don’t know much about what these reticent creatures feed on, how long they live or what their reproductive traits are.“To get footage of a juvenile is so wonderful,” said Aaron Evans, an independent glass squid expert, at the press conference. Scientists know colossal squid are born tiny, and some adult specimens are preserved in collections, but their time between those stages isn’t well understood. “So for us to see this kind of midrange size, in between a hatchling and an adult, is really exciting because it gives us the opportunity to fill in some of those missing puzzle pieces to the life history of this very mysterious and enigmatic animal.”The sighting came alongside many other strange encounters, including a grenadier fish with parasitic pigtails, a piñatalike anemone, a carnivorous sponge that resembled a dandelion and Seussian corals. In a region so remote and underexplored, it’s no surprise that weird and wonderful discoveries emerge—and each new find offers valuable insight into a world that science is still just beginning to understand.“Observing the colossal squid gives us the chance both to learn about this remote place,” Bolstad says, “and to share the excitement of such discoveries with people who may not think about the deep sea very often—even though it makes up 95 percent of the living space on Earth and plays an enormous role in regulating our climate.”And hopefully one day the researchers will catch sight of a grown colossal squid. “Eventually, when we see the adults, we will get footage of very large ones,” Bolstad said at the press conference. “They will have impressive hooks; they’ll be big and muscly. There will be lots of monster hype about them. But in this case, we get to introduce the live colossal squid to the world as this beautiful, little, delicate animal that highlights the magnificence of a lot of deep-sea creatures.”0 Commentarii 0 Distribuiri 28 Views