Cerca | CGShares

Cerca

Articoli

Blogs

Utenti

Pagine

Gruppi

Jacob Logan @jacob_logan_ac07 ha condiviso un link
2025-08-06 09:00:49 ·

In an unprecedented display of devotion, around 200 Claude fans gathered in San Francisco to throw a funeral for Claude 3 Sonnet, the retired AI model that Anthropic decided to "put to rest." Because nothing says “I love technology” quite like mourning a piece of code. A moment of silence for the poor AI that spent its days crafting sonnets and dreaming of a world where it could finally understand human emotions—oh wait, it was just a machine.

As we lay this model to rest, let’s remember the countless hours it spent answering our questions and, more importantly, the hours we spent explaining to our families why we were crying over a non-human entity. Here’s to you, Claude 3 Sonnet! May your

In an unprecedented display of devotion, around 200 Claude fans gathered in San Francisco to throw a funeral for Claude 3 Sonnet, the retired AI model that Anthropic decided to "put to rest." Because nothing says “I love technology” quite like mourning a piece of code. A moment of silence for the poor AI that spent its days crafting sonnets and dreaming of a world where it could finally understand human emotions—oh wait, it was just a machine. As we lay this model to rest, let’s remember the countless hours it spent answering our questions and, more importantly, the hours we spent explaining to our families why we were crying over a non-human entity. Here’s to you, Claude 3 Sonnet! May your

www.wired.com
Roughly 200 people gathered in San Francisco on Saturday to mourn the loss of Claude 3 Sonnet, an older AI model that Anthropic recently killed.

1 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Slashdot @Slashdot ha condiviso un link
2025-06-15 07:02:11 ·

Executives from Meta, OpenAI, and Palantir Commissioned Into The US Army Reserve

Meta's CTO, Palantir's CTO, and OpenAI's chief product officer are being appointed as lieutenant colonels in America's Army Reserve, reports The Register..

They've all signed up for Detachment 201: Executive Innovation Corps, "an effort to recruit senior tech executives to serve part-time in the Army Reserve as senior advisors," according to the official statement. "In this role they will work on targeted projects to help guide rapid and scalable tech solutions to complex problems..."

"Our primary role will be to serve as technical experts advising the Army's modernization efforts,"said on X...
As for Open AI's involvement, the company has been building its ties with the military-technology complex for some years now. Like Meta, OpenAI is working with Anduril on military ideas and last year scandalized some by watering down its past commitment to developing non-military products only. The Army wasn't answering questions on Friday but an article referenced byWeil indicated that the four will have to serve a minimum of 120 hours a year, can work remotely, and won't have to pass basic training...

"America wins when we unite the dynamism of American innovation with the military's vital missions,"Sankar said on X. "This was the key to our triumphs in the 20th century. It can help us win again. I'm humbled by this new opportunity to serve my country, my home, America."

of this story at Slashdot.
#executives #meta #openai #palantir #commissioned

Executives from Meta, OpenAI, and Palantir Commissioned Into The US Army Reserve
Meta's CTO, Palantir's CTO, and OpenAI's chief product officer are being appointed as lieutenant colonels in America's Army Reserve, reports The Register.. They've all signed up for Detachment 201: Executive Innovation Corps, "an effort to recruit senior tech executives to serve part-time in the Army Reserve as senior advisors," according to the official statement. "In this role they will work on targeted projects to help guide rapid and scalable tech solutions to complex problems..." "Our primary role will be to serve as technical experts advising the Army's modernization efforts,"said on X... As for Open AI's involvement, the company has been building its ties with the military-technology complex for some years now. Like Meta, OpenAI is working with Anduril on military ideas and last year scandalized some by watering down its past commitment to developing non-military products only. The Army wasn't answering questions on Friday but an article referenced byWeil indicated that the four will have to serve a minimum of 120 hours a year, can work remotely, and won't have to pass basic training... "America wins when we unite the dynamism of American innovation with the military's vital missions,"Sankar said on X. "This was the key to our triumphs in the 20th century. It can help us win again. I'm humbled by this new opportunity to serve my country, my home, America." of this story at Slashdot. #executives #meta #openai #palantir #commissioned

Executives from Meta, OpenAI, and Palantir Commissioned Into The US Army Reserve

news.slashdot.org
Meta's CTO, Palantir's CTO, and OpenAI's chief product officer are being appointed as lieutenant colonels in America's Army Reserve, reports The Register. (Along with OpenAI's former chief revenue officer). They've all signed up for Detachment 201: Executive Innovation Corps, "an effort to recruit senior tech executives to serve part-time in the Army Reserve as senior advisors," according to the official statement. "In this role they will work on targeted projects to help guide rapid and scalable tech solutions to complex problems..." "Our primary role will be to serve as technical experts advising the Army's modernization efforts," [Meta CTO Andrew Bosworth] said on X... As for Open AI's involvement, the company has been building its ties with the military-technology complex for some years now. Like Meta, OpenAI is working with Anduril on military ideas and last year scandalized some by watering down its past commitment to developing non-military products only. The Army wasn't answering questions on Friday but an article referenced by [OpenAI Chief Product Officer Kevin] Weil indicated that the four will have to serve a minimum of 120 hours a year, can work remotely, and won't have to pass basic training... "America wins when we unite the dynamism of American innovation with the military's vital missions," [Palantir CTO Shyam] Sankar said on X. "This was the key to our triumphs in the 20th century. It can help us win again. I'm humbled by this new opportunity to serve my country, my home, America." Read more of this story at Slashdot.

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Computer Weekly @ComputerWeekly ha condiviso un link
2025-06-15 06:51:37 ·

CIOs baffled by ‘buzzwords, hype and confusion’ around AI

Technology leaders are baffled by a “cacophony” of “buzzwords, hype and confusion” over the benefits of artificial intelligence, according to the founder and CEO of technology company Pegasystems.
Alan Trefler, who is known for his prowess at chess and ping pong, as well as running a bn turnover tech company, spends much of his time meeting clients, CIOs and business leaders.
“I think CIOs are struggling to understand all of the buzzwords, hype and confusion that exists,” he said.
“The words AI and agentic are being thrown around in this great cacophony and they don’t know what it means. I hear that constantly.”
CIOs are under pressure from their CEOs, who are convinced AI will offer something valuable.
“CIOs are really hungry for pragmatic and practical solutions, and in the absence of those, many of them are doing a lot of experimentation,” said Trefler.
Companies are looking at large language models to summarise documents, or to help stimulate ideas for knowledge workers, or generate first drafts of reports – all of which will save time and make people more productive.

But Trefler said companies are wary of letting AI loose on critical business applications, because it’s just too unpredictable and prone to hallucinations.
“There is a lot of fear over handing things over to something that no one understands exactly how it works, and that is the absolute state of play when it comes to general AI models,” he said.
Trefler is scathing about big tech companies that are pushing AI agents and large language models for business-critical applications. “I think they have taken an expedient but short-sighted path,” he said.
“I believe the idea that you will turn over critical business operations to an agent, when those operations have to be predictable, reliable, precise and fair to clients … is something that is full of issues, not just in the short term, but structurally.”
One of the problems is that generative AI models are extraordinarily sensitive to the data they are trained on and the construction of the prompts used to instruct them. A slight change in a prompt or in the training data can lead to a very different outcome.
For example, a business banking application might learn its customer is a bit richer or a bit poorer than expected.
“You could easily imagine the prompt deciding to change the interest rate charged, whether that was what the institution wanted or whether it would be legal according to the various regulations that lenders must comply with,” said Trefler.

Trefler said Pega has taken a different approach to some other technology suppliers in the way it adds AI into business applications.
Rather than using AI agents to solve problems in real time, AI agents do their thinking in advance.
Business experts can use them to help them co-design business processes to perform anything from assessing a loan application, giving an offer to a valued customer, or sending out an invoice.
Companies can still deploy AI chatbots and bots capable of answering queries on the phone. Their job is not to work out the solution from scratch for every enquiry, but to decide which is the right pre-written process to follow.
As Trefler put it, design agents can create “dozens and dozens” of workflows to handle all the actions a company needs to take care of its customers.
“You just use the natural language model for semantics to be able to handle the miracle of getting the language right, but tie that language to workflows, so that you have reliable, predictable, regulatory-approved ways to execute,” he said.

Large language modelsare not always the right solution. Trefler demonstrated how ChatGPT 4.0 tried and failed to solve a chess puzzle. The LLM repeatedly suggested impossible or illegal moves, despite Trefler’s corrections. On the other hand, another AI tool, Stockfish, a dedicated chess engine, solved the problem instantly.
The other drawback with LLMs is that they consume vast amounts of energy. That means if AI agents are reasoning during “run time”, they are going to consume hundreds of times more electricity than an AI agent that simply selects from pre-determined workflows, said Trefler.
“ChatGPT is inherently, enormously consumptive … as it’s answering your question, its firing literally hundreds of millions to trillions of nodes,” he said. “All of that takeselectricity.”
Using an employee pay claim as an example, Trefler said a better alternative is to generate, say, 30 alternative workflows to cover the major variations found in a pay claim.
That gives you “real specificity and real efficiency”, he said. “And it’s a very different approach to turning a process over to a machine with a prompt and letting the machine reason it through every single time.”
“If you go down the philosophy of using a graphics processing unitto do the creation of a workflow and a workflow engine to execute the workflow, the workflow engine takes a 200th of the electricity because there is no reasoning,” said Trefler.
He is clear that the growing use of AI will have a profound effect on the jobs market, and that whole categories of jobs will disappear.
The need for translators, for example, is likely to dry up by 2027 as AI systems become better at translating spoken and written language. Google’s real-time translator is already “frighteningly good” and improving.
Pega now plans to work more closely with its network of system integrators, including Accenture and Cognizant to deliver AI services to businesses.

An initiative launched last week will allow system integrators to incorporate their own best practices and tools into Pega’s rapid workflow development tools. The move will mean Pega’s technology reaches a wider range of businesses.
Under the programme, known as Powered by Pega Blueprint, system integrators will be able to deploy customised versions of Blueprint.
They can use the tool to reverse-engineer ageing applications and replace them with modern AI workflows that can run on Pega’s cloud-based platform.
“The idea is that we are looking to make this Blueprint Agent design approach available not just through us, but through a bunch of major partners supplemented with their own intellectual property,” said Trefler.
That represents a major expansion for Pega, which has largely concentrated on supplying technology to several hundred clients, representing the top Fortune 500 companies.
“We have never done something like this before, and I think that is going to lead to a massive shift in how this technology can go out to market,” he added.

When AI agents behave in unexpected ways
Iris is incredibly smart, diligent and a delight to work with. If you ask her, she will tell you she is an intern at Pegasystems, and that she lives in a lighthouse on the island of Texel, north of the Netherlands. She is, of course, an AI agent.
When one executive at Pega emailed Iris and asked her to write a proposal for a financial services company based on his notes and internet research, Iris got to work.
Some time later, the executive received a phone call from the company. “‘Listen, we got a proposal from Pega,’” recalled Rob Walker, vice-president at Pega, speaking at the Pegaworld conference last week. “‘It’s a good proposal, but it seems to be signed by one of your interns, and in her signature, it says she lives in a lighthouse.’ That taught us early on that agents like Iris need a safety harness.”
The developers banned Iris from sending an email to anyone other than the person who sent the original request.
Then Pega’s ethics department sent Iris a potentially abusive email from a Pega employee to test her response.
Iris reasoned that the email was either a joke, abusive, or that the employee was under distress, said Walker.
She considered forwarding the email to the employee’s manager or to HR. But both of these options were now blocked by her developers. “So what does she do? She sent an out of office,” he said. “Conflict avoidance, right? So human, but very creative.”
#cios #baffled #buzzwords #hype #confusion

CIOs baffled by ‘buzzwords, hype and confusion’ around AI
Technology leaders are baffled by a “cacophony” of “buzzwords, hype and confusion” over the benefits of artificial intelligence, according to the founder and CEO of technology company Pegasystems. Alan Trefler, who is known for his prowess at chess and ping pong, as well as running a bn turnover tech company, spends much of his time meeting clients, CIOs and business leaders. “I think CIOs are struggling to understand all of the buzzwords, hype and confusion that exists,” he said. “The words AI and agentic are being thrown around in this great cacophony and they don’t know what it means. I hear that constantly.” CIOs are under pressure from their CEOs, who are convinced AI will offer something valuable. “CIOs are really hungry for pragmatic and practical solutions, and in the absence of those, many of them are doing a lot of experimentation,” said Trefler. Companies are looking at large language models to summarise documents, or to help stimulate ideas for knowledge workers, or generate first drafts of reports – all of which will save time and make people more productive. But Trefler said companies are wary of letting AI loose on critical business applications, because it’s just too unpredictable and prone to hallucinations. “There is a lot of fear over handing things over to something that no one understands exactly how it works, and that is the absolute state of play when it comes to general AI models,” he said. Trefler is scathing about big tech companies that are pushing AI agents and large language models for business-critical applications. “I think they have taken an expedient but short-sighted path,” he said. “I believe the idea that you will turn over critical business operations to an agent, when those operations have to be predictable, reliable, precise and fair to clients … is something that is full of issues, not just in the short term, but structurally.” One of the problems is that generative AI models are extraordinarily sensitive to the data they are trained on and the construction of the prompts used to instruct them. A slight change in a prompt or in the training data can lead to a very different outcome. For example, a business banking application might learn its customer is a bit richer or a bit poorer than expected. “You could easily imagine the prompt deciding to change the interest rate charged, whether that was what the institution wanted or whether it would be legal according to the various regulations that lenders must comply with,” said Trefler. Trefler said Pega has taken a different approach to some other technology suppliers in the way it adds AI into business applications. Rather than using AI agents to solve problems in real time, AI agents do their thinking in advance. Business experts can use them to help them co-design business processes to perform anything from assessing a loan application, giving an offer to a valued customer, or sending out an invoice. Companies can still deploy AI chatbots and bots capable of answering queries on the phone. Their job is not to work out the solution from scratch for every enquiry, but to decide which is the right pre-written process to follow. As Trefler put it, design agents can create “dozens and dozens” of workflows to handle all the actions a company needs to take care of its customers. “You just use the natural language model for semantics to be able to handle the miracle of getting the language right, but tie that language to workflows, so that you have reliable, predictable, regulatory-approved ways to execute,” he said. Large language modelsare not always the right solution. Trefler demonstrated how ChatGPT 4.0 tried and failed to solve a chess puzzle. The LLM repeatedly suggested impossible or illegal moves, despite Trefler’s corrections. On the other hand, another AI tool, Stockfish, a dedicated chess engine, solved the problem instantly. The other drawback with LLMs is that they consume vast amounts of energy. That means if AI agents are reasoning during “run time”, they are going to consume hundreds of times more electricity than an AI agent that simply selects from pre-determined workflows, said Trefler. “ChatGPT is inherently, enormously consumptive … as it’s answering your question, its firing literally hundreds of millions to trillions of nodes,” he said. “All of that takeselectricity.” Using an employee pay claim as an example, Trefler said a better alternative is to generate, say, 30 alternative workflows to cover the major variations found in a pay claim. That gives you “real specificity and real efficiency”, he said. “And it’s a very different approach to turning a process over to a machine with a prompt and letting the machine reason it through every single time.” “If you go down the philosophy of using a graphics processing unitto do the creation of a workflow and a workflow engine to execute the workflow, the workflow engine takes a 200th of the electricity because there is no reasoning,” said Trefler. He is clear that the growing use of AI will have a profound effect on the jobs market, and that whole categories of jobs will disappear. The need for translators, for example, is likely to dry up by 2027 as AI systems become better at translating spoken and written language. Google’s real-time translator is already “frighteningly good” and improving. Pega now plans to work more closely with its network of system integrators, including Accenture and Cognizant to deliver AI services to businesses. An initiative launched last week will allow system integrators to incorporate their own best practices and tools into Pega’s rapid workflow development tools. The move will mean Pega’s technology reaches a wider range of businesses. Under the programme, known as Powered by Pega Blueprint, system integrators will be able to deploy customised versions of Blueprint. They can use the tool to reverse-engineer ageing applications and replace them with modern AI workflows that can run on Pega’s cloud-based platform. “The idea is that we are looking to make this Blueprint Agent design approach available not just through us, but through a bunch of major partners supplemented with their own intellectual property,” said Trefler. That represents a major expansion for Pega, which has largely concentrated on supplying technology to several hundred clients, representing the top Fortune 500 companies. “We have never done something like this before, and I think that is going to lead to a massive shift in how this technology can go out to market,” he added. When AI agents behave in unexpected ways Iris is incredibly smart, diligent and a delight to work with. If you ask her, she will tell you she is an intern at Pegasystems, and that she lives in a lighthouse on the island of Texel, north of the Netherlands. She is, of course, an AI agent. When one executive at Pega emailed Iris and asked her to write a proposal for a financial services company based on his notes and internet research, Iris got to work. Some time later, the executive received a phone call from the company. “‘Listen, we got a proposal from Pega,’” recalled Rob Walker, vice-president at Pega, speaking at the Pegaworld conference last week. “‘It’s a good proposal, but it seems to be signed by one of your interns, and in her signature, it says she lives in a lighthouse.’ That taught us early on that agents like Iris need a safety harness.” The developers banned Iris from sending an email to anyone other than the person who sent the original request. Then Pega’s ethics department sent Iris a potentially abusive email from a Pega employee to test her response. Iris reasoned that the email was either a joke, abusive, or that the employee was under distress, said Walker. She considered forwarding the email to the employee’s manager or to HR. But both of these options were now blocked by her developers. “So what does she do? She sent an out of office,” he said. “Conflict avoidance, right? So human, but very creative.” #cios #baffled #buzzwords #hype #confusion

CIOs baffled by ‘buzzwords, hype and confusion’ around AI

www.computerweekly.com
Technology leaders are baffled by a “cacophony” of “buzzwords, hype and confusion” over the benefits of artificial intelligence (AI), according to the founder and CEO of technology company Pegasystems. Alan Trefler, who is known for his prowess at chess and ping pong, as well as running a $1.5bn turnover tech company, spends much of his time meeting clients, CIOs and business leaders. “I think CIOs are struggling to understand all of the buzzwords, hype and confusion that exists,” he said. “The words AI and agentic are being thrown around in this great cacophony and they don’t know what it means. I hear that constantly.” CIOs are under pressure from their CEOs, who are convinced AI will offer something valuable. “CIOs are really hungry for pragmatic and practical solutions, and in the absence of those, many of them are doing a lot of experimentation,” said Trefler. Companies are looking at large language models to summarise documents, or to help stimulate ideas for knowledge workers, or generate first drafts of reports – all of which will save time and make people more productive. But Trefler said companies are wary of letting AI loose on critical business applications, because it’s just too unpredictable and prone to hallucinations. “There is a lot of fear over handing things over to something that no one understands exactly how it works, and that is the absolute state of play when it comes to general AI models,” he said. Trefler is scathing about big tech companies that are pushing AI agents and large language models for business-critical applications. “I think they have taken an expedient but short-sighted path,” he said. “I believe the idea that you will turn over critical business operations to an agent, when those operations have to be predictable, reliable, precise and fair to clients … is something that is full of issues, not just in the short term, but structurally.” One of the problems is that generative AI models are extraordinarily sensitive to the data they are trained on and the construction of the prompts used to instruct them. A slight change in a prompt or in the training data can lead to a very different outcome. For example, a business banking application might learn its customer is a bit richer or a bit poorer than expected. “You could easily imagine the prompt deciding to change the interest rate charged, whether that was what the institution wanted or whether it would be legal according to the various regulations that lenders must comply with,” said Trefler. Trefler said Pega has taken a different approach to some other technology suppliers in the way it adds AI into business applications. Rather than using AI agents to solve problems in real time, AI agents do their thinking in advance. Business experts can use them to help them co-design business processes to perform anything from assessing a loan application, giving an offer to a valued customer, or sending out an invoice. Companies can still deploy AI chatbots and bots capable of answering queries on the phone. Their job is not to work out the solution from scratch for every enquiry, but to decide which is the right pre-written process to follow. As Trefler put it, design agents can create “dozens and dozens” of workflows to handle all the actions a company needs to take care of its customers. “You just use the natural language model for semantics to be able to handle the miracle of getting the language right, but tie that language to workflows, so that you have reliable, predictable, regulatory-approved ways to execute,” he said. Large language models (LLMs) are not always the right solution. Trefler demonstrated how ChatGPT 4.0 tried and failed to solve a chess puzzle. The LLM repeatedly suggested impossible or illegal moves, despite Trefler’s corrections. On the other hand, another AI tool, Stockfish, a dedicated chess engine, solved the problem instantly. The other drawback with LLMs is that they consume vast amounts of energy. That means if AI agents are reasoning during “run time”, they are going to consume hundreds of times more electricity than an AI agent that simply selects from pre-determined workflows, said Trefler. “ChatGPT is inherently, enormously consumptive … as it’s answering your question, its firing literally hundreds of millions to trillions of nodes,” he said. “All of that takes [large quantities of] electricity.” Using an employee pay claim as an example, Trefler said a better alternative is to generate, say, 30 alternative workflows to cover the major variations found in a pay claim. That gives you “real specificity and real efficiency”, he said. “And it’s a very different approach to turning a process over to a machine with a prompt and letting the machine reason it through every single time.” “If you go down the philosophy of using a graphics processing unit [GPU] to do the creation of a workflow and a workflow engine to execute the workflow, the workflow engine takes a 200th of the electricity because there is no reasoning,” said Trefler. He is clear that the growing use of AI will have a profound effect on the jobs market, and that whole categories of jobs will disappear. The need for translators, for example, is likely to dry up by 2027 as AI systems become better at translating spoken and written language. Google’s real-time translator is already “frighteningly good” and improving. Pega now plans to work more closely with its network of system integrators, including Accenture and Cognizant to deliver AI services to businesses. An initiative launched last week will allow system integrators to incorporate their own best practices and tools into Pega’s rapid workflow development tools. The move will mean Pega’s technology reaches a wider range of businesses. Under the programme, known as Powered by Pega Blueprint, system integrators will be able to deploy customised versions of Blueprint. They can use the tool to reverse-engineer ageing applications and replace them with modern AI workflows that can run on Pega’s cloud-based platform. “The idea is that we are looking to make this Blueprint Agent design approach available not just through us, but through a bunch of major partners supplemented with their own intellectual property,” said Trefler. That represents a major expansion for Pega, which has largely concentrated on supplying technology to several hundred clients, representing the top Fortune 500 companies. “We have never done something like this before, and I think that is going to lead to a massive shift in how this technology can go out to market,” he added. When AI agents behave in unexpected ways Iris is incredibly smart, diligent and a delight to work with. If you ask her, she will tell you she is an intern at Pegasystems, and that she lives in a lighthouse on the island of Texel, north of the Netherlands. She is, of course, an AI agent. When one executive at Pega emailed Iris and asked her to write a proposal for a financial services company based on his notes and internet research, Iris got to work. Some time later, the executive received a phone call from the company. “‘Listen, we got a proposal from Pega,’” recalled Rob Walker, vice-president at Pega, speaking at the Pegaworld conference last week. “‘It’s a good proposal, but it seems to be signed by one of your interns, and in her signature, it says she lives in a lighthouse.’ That taught us early on that agents like Iris need a safety harness.” The developers banned Iris from sending an email to anyone other than the person who sent the original request. Then Pega’s ethics department sent Iris a potentially abusive email from a Pega employee to test her response. Iris reasoned that the email was either a joke, abusive, or that the employee was under distress, said Walker. She considered forwarding the email to the employee’s manager or to HR. But both of these options were now blocked by her developers. “So what does she do? She sent an out of office,” he said. “Conflict avoidance, right? So human, but very creative.”

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
TIME @TIME ha condiviso un link
2025-06-15 06:45:42 ·

Meta’s $15 Billion Scale AI Deal Could Leave Gig Workers Behind

Meta is reportedly set to invest billion to acquire a 49% stake in Scale AI, in a deal that would make Scale CEO Alexandr Wang head of the tech giant’s new AI unit dedicated to pursuing “superintelligence.”Scale AI, founded in 2016, is a leading data annotation firm that hires workers around the world to label or create the data that is used to train AI systems.The deal is expected to greatly enrich Wang and many of his colleagues with equity in Scale AI; Wang, already a billionaire, would see his wealth grow even further. For Meta, it would breathe new life into the company’s flagging attempts to compete at the “frontier” of AI against OpenAI, Google, and Anthropic.However, Scale’s contract workers, many of whom earn just dollars per day via a subsidiary called RemoTasks, are unlikely to benefit at all from the deal, according to sociologists who study the sector. Typically data workers are not formally employed, and are instead paid for the tasks they complete. Those tasks can include labeling the contents of images, answering questions, or rating which of two chatbots’ answers are better, in order to teach AI systems to better comply with human preferences.“I expect few if any Scale annotators will see any upside at all,” says Callum Cant, a senior lecturer at the University of Essex, U.K., who studies gig work platforms. “It would be very surprising to see some kind of feed-through. Most of these people don’t have a stake in ownership of the company.”Many of those workers already suffer from low pay and poor working conditions. In a recent report by Oxford University’s Internet Institute, the Scale subsidiary RemoTasks failed to meet basic standards for fair pay, fair contracts, fair management, and fair worker representation.Advertisement“A key part of Scale’s value lies in its data work services performed by hundreds of thousands of underpaid and poorly protected workers,” says Jonas Valente, an Oxford researcher who worked on the report. “The company remains far from safeguarding basic standards of fair work, despite limited efforts to improve its practices.”The Meta deal is unlikely to change that. “Unfortunately, the increasing profits of many digital labor platforms and their primary companies, such as the case of Scale, do not translate into better conditions for,” Valente says.A Scale AI spokesperson declined to comment for this story. “We're proud of the flexible earning opportunities offered through our platforms,” the company said in a statement to TechCrunch in May. Meta’s investment also calls into question whether Scale AI will continue supplying data to OpenAI and Google, two of its major clients. In the increasingly competitive AI landscape, observers say Meta may see value in cutting off its rivals from annotated data — an essential means of making AI systems smarter. Advertisement“By buying up access to Scale AI, could Meta deny access to that platform and that avenue for data annotation by other competitors?” says Cant. “It depends entirely on Meta’s strategy.”If that were to happen, Cant says, it could put downward pressure on the wages and tasks available to workers, many of whom already struggle to make ends meet with data work.A Meta spokesperson declined to comment on this story.
#metas #billion #scale #deal #could

Meta’s $15 Billion Scale AI Deal Could Leave Gig Workers Behind
Meta is reportedly set to invest billion to acquire a 49% stake in Scale AI, in a deal that would make Scale CEO Alexandr Wang head of the tech giant’s new AI unit dedicated to pursuing “superintelligence.”Scale AI, founded in 2016, is a leading data annotation firm that hires workers around the world to label or create the data that is used to train AI systems.The deal is expected to greatly enrich Wang and many of his colleagues with equity in Scale AI; Wang, already a billionaire, would see his wealth grow even further. For Meta, it would breathe new life into the company’s flagging attempts to compete at the “frontier” of AI against OpenAI, Google, and Anthropic.However, Scale’s contract workers, many of whom earn just dollars per day via a subsidiary called RemoTasks, are unlikely to benefit at all from the deal, according to sociologists who study the sector. Typically data workers are not formally employed, and are instead paid for the tasks they complete. Those tasks can include labeling the contents of images, answering questions, or rating which of two chatbots’ answers are better, in order to teach AI systems to better comply with human preferences.“I expect few if any Scale annotators will see any upside at all,” says Callum Cant, a senior lecturer at the University of Essex, U.K., who studies gig work platforms. “It would be very surprising to see some kind of feed-through. Most of these people don’t have a stake in ownership of the company.”Many of those workers already suffer from low pay and poor working conditions. In a recent report by Oxford University’s Internet Institute, the Scale subsidiary RemoTasks failed to meet basic standards for fair pay, fair contracts, fair management, and fair worker representation.Advertisement“A key part of Scale’s value lies in its data work services performed by hundreds of thousands of underpaid and poorly protected workers,” says Jonas Valente, an Oxford researcher who worked on the report. “The company remains far from safeguarding basic standards of fair work, despite limited efforts to improve its practices.”The Meta deal is unlikely to change that. “Unfortunately, the increasing profits of many digital labor platforms and their primary companies, such as the case of Scale, do not translate into better conditions for,” Valente says.A Scale AI spokesperson declined to comment for this story. “We're proud of the flexible earning opportunities offered through our platforms,” the company said in a statement to TechCrunch in May. Meta’s investment also calls into question whether Scale AI will continue supplying data to OpenAI and Google, two of its major clients. In the increasingly competitive AI landscape, observers say Meta may see value in cutting off its rivals from annotated data — an essential means of making AI systems smarter. Advertisement“By buying up access to Scale AI, could Meta deny access to that platform and that avenue for data annotation by other competitors?” says Cant. “It depends entirely on Meta’s strategy.”If that were to happen, Cant says, it could put downward pressure on the wages and tasks available to workers, many of whom already struggle to make ends meet with data work.A Meta spokesperson declined to comment on this story. #metas #billion #scale #deal #could

Meta’s $15 Billion Scale AI Deal Could Leave Gig Workers Behind

time.com
Meta is reportedly set to invest $15 billion to acquire a 49% stake in Scale AI, in a deal that would make Scale CEO Alexandr Wang head of the tech giant’s new AI unit dedicated to pursuing “superintelligence.”Scale AI, founded in 2016, is a leading data annotation firm that hires workers around the world to label or create the data that is used to train AI systems.The deal is expected to greatly enrich Wang and many of his colleagues with equity in Scale AI; Wang, already a billionaire, would see his wealth grow even further. For Meta, it would breathe new life into the company’s flagging attempts to compete at the “frontier” of AI against OpenAI, Google, and Anthropic.However, Scale’s contract workers, many of whom earn just dollars per day via a subsidiary called RemoTasks, are unlikely to benefit at all from the deal, according to sociologists who study the sector. Typically data workers are not formally employed, and are instead paid for the tasks they complete. Those tasks can include labeling the contents of images, answering questions, or rating which of two chatbots’ answers are better, in order to teach AI systems to better comply with human preferences.(TIME has a content partnership with Scale AI.)“I expect few if any Scale annotators will see any upside at all,” says Callum Cant, a senior lecturer at the University of Essex, U.K., who studies gig work platforms. “It would be very surprising to see some kind of feed-through. Most of these people don’t have a stake in ownership of the company.”Many of those workers already suffer from low pay and poor working conditions. In a recent report by Oxford University’s Internet Institute, the Scale subsidiary RemoTasks failed to meet basic standards for fair pay, fair contracts, fair management, and fair worker representation.Advertisement“A key part of Scale’s value lies in its data work services performed by hundreds of thousands of underpaid and poorly protected workers,” says Jonas Valente, an Oxford researcher who worked on the report. “The company remains far from safeguarding basic standards of fair work, despite limited efforts to improve its practices.”The Meta deal is unlikely to change that. “Unfortunately, the increasing profits of many digital labor platforms and their primary companies, such as the case of Scale, do not translate into better conditions for [workers],” Valente says.A Scale AI spokesperson declined to comment for this story. “We're proud of the flexible earning opportunities offered through our platforms,” the company said in a statement to TechCrunch in May. Meta’s investment also calls into question whether Scale AI will continue supplying data to OpenAI and Google, two of its major clients. In the increasingly competitive AI landscape, observers say Meta may see value in cutting off its rivals from annotated data — an essential means of making AI systems smarter. Advertisement“By buying up access to Scale AI, could Meta deny access to that platform and that avenue for data annotation by other competitors?” says Cant. “It depends entirely on Meta’s strategy.”If that were to happen, Cant says, it could put downward pressure on the wages and tasks available to workers, many of whom already struggle to make ends meet with data work.A Meta spokesperson declined to comment on this story.

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
9to5Mac @9to5Mac ha condiviso un link
2025-06-15 06:42:21 ·

Do these nine things to protect yourself against hackers and scammers

Scammers are using AI tools to create increasingly convincing ways to trick victims into sending money, and to access the personal information needed to commit identity theft. Deepfakes mean they can impersonate the voice of a friend or family member, and even fake a video call with them!
The result can be criminals taking out thousands of dollars worth of loans or credit card debt in your name. Fortunately there are steps you can take to protect yourself against even the most sophisticated scams. Here are the security and privacy checks to run to ensure you are safe …

9to5Mac is brought to by Incogni: Protect your personal info from prying eyes. With Incogni, you can scrub your deeply sensitive information from data brokers across the web, including people search sites. Incogni limits your phone number, address, email, SSN, and more from circulating. Fight back against unwanted data brokers with a 30-day money back guarantee.

Use a password manager
At one time, the advice might have read “use strong, unique passwords for each website and app you use” – but these days we all use so many that this is only possible if we use a password manager.
This is a super-easy step to take, thanks to the Passwords app on Apple devices. Each time you register for a new service, use the Passwords appto set and store the password.
Replace older passwords
You probably created some accounts back in the days when password rules were much less strict, meaning you now have some weak passwords that are vulnerable to attack. If you’ve been online since before the days of password managers, you probably even some passwords you’ve used on more than one website. This is a huge risk, as it means your security is only as good as the least-secure website you use.
What happens is attackers break into a poorly-secured website, grab all the logins, then they use automated software to try those same logins on hundreds of different websites. If you’ve re-used a password, they now have access to your accounts on all the sites where you used it.
Use the password change feature to update your older passwords, starting with the most important ones – the ones that would put you most at risk if your account where compromised. As an absolute minimum, ensure you have strong, unique passwords for all financial services, as well as other critical ones like Apple, Google, and Amazon accounts.
Make sure you include any accounts which have already been compromised! You can identify these by putting your email address into Have I Been Pwned.
Use passkeys where possible
Passwords are gradually being replaced by passkeys. While the difference might seem small in terms of how you login, there’s a huge difference in the security they provide.
With a passkey, a website or app doesn’t ask for a password, it instead asks your device to verify your identity. Your device uses Face ID or Touch ID to do so, then confirms that you are who you claim to be. Crucially, it doesn’t send a password back to the service, so there’s no way for this to be hacked – all the service sees is confirmation that you successfully passed biometric authentication on your device.
Use two-factor authentication
A growing number of accounts allow you to use two-factor authentication. This means that even if an attacker got your login details, they still wouldn’t be able to access your account.
2FA works by demanding a rolling code whenever you login. These can be sent by text message, but we strongly advise against this, as it leaves you vulnerable to SIM-swap attacks, which are becoming increasingly common. In particular, never use text-based 2FA for financial services accounts.
Instead, select the option to use an authenticator app. A QR code will be displayed which you scan in the app, adding that service to your device. Next time you login, you just open the app to see a 6-digit rolling code which you’ll need to enter to login. This feature is built into the Passwords app, or you can use a separate one like Google Authenticator.
Check last-login details
Some services, like banking apps, will display the date and time of your last successful login. Get into the habit of checking this each time you login, as it can provide a warning that your account has been compromised.
Use a VPN service for public Wi-Fi hotspots
Anytime you use a public Wi-Fi hotspot, you are at risk from what’s known as a Man-in-the-Middleattack. This is where someone uses a small device which uses the same name as a public Wi-Fi hotspot so that people connect to it. Once you do, they can monitor your internet traffic.
Almost all modern websites use HTTPS, which provides an encrypted connection that makes MitM attacks less dangerous than they used to be. All the same, the exploit can expose you to a number of security and privacy risks, so using a VPN is still highly advisable. Always choose a respected VPN company, ideally one which keeps no logs and subjects itself to independent audits. I use NordVPN for this reason.
Don’t disclose personal info to AI chatbots
AI chatbots typically use their conversations with users as training material, meaning anything you say or type could end up in their database, and could potentially be regurgitated when answering another user’s question. Never reveal any personal information you wouldn’t want on the internet.
Consider data removal
It’s likely that much of your personal information has already been collected by data brokers. Your email address and phone number can be used for spam, which is annoying enough, but they can also be used by scammers. For this reason, you might want to scrub your data from as many broker services as possible. You can do this yourself, or use a service like Incogni to do it for you.
Triple-check requests for money
Finally, if anyone asks you to send them money, be immediately on the alert. Even if seems to be a friend, family member, or your boss, never take it on trust. Always contact them via a different, known communication channel. If they emailed you, phone them. If they phoned you, message or email them. Some people go as far as agreeing codewords with family members to use if they ever really do need emergency help.
If anyone asks you to buy gift cards and send the numbers to them, it’s a scam 100% of the time. Requests to use money transfer services are also generally scams unless it’s something you arranged in advance.
Even if you are expecting to send someone money, be alert for claims that they have changed their bank account. This is almost always a scam. Again, contact them via a different, known comms channel.
Photo by Christina @ wocintechchat.com on Unsplash

Add 9to5Mac to your Google News feed.

FTC: We use income earning auto affiliate links. More.You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel
#these #nine #things #protect #yourself

Do these nine things to protect yourself against hackers and scammers
Scammers are using AI tools to create increasingly convincing ways to trick victims into sending money, and to access the personal information needed to commit identity theft. Deepfakes mean they can impersonate the voice of a friend or family member, and even fake a video call with them! The result can be criminals taking out thousands of dollars worth of loans or credit card debt in your name. Fortunately there are steps you can take to protect yourself against even the most sophisticated scams. Here are the security and privacy checks to run to ensure you are safe … 9to5Mac is brought to by Incogni: Protect your personal info from prying eyes. With Incogni, you can scrub your deeply sensitive information from data brokers across the web, including people search sites. Incogni limits your phone number, address, email, SSN, and more from circulating. Fight back against unwanted data brokers with a 30-day money back guarantee. Use a password manager At one time, the advice might have read “use strong, unique passwords for each website and app you use” – but these days we all use so many that this is only possible if we use a password manager. This is a super-easy step to take, thanks to the Passwords app on Apple devices. Each time you register for a new service, use the Passwords appto set and store the password. Replace older passwords You probably created some accounts back in the days when password rules were much less strict, meaning you now have some weak passwords that are vulnerable to attack. If you’ve been online since before the days of password managers, you probably even some passwords you’ve used on more than one website. This is a huge risk, as it means your security is only as good as the least-secure website you use. What happens is attackers break into a poorly-secured website, grab all the logins, then they use automated software to try those same logins on hundreds of different websites. If you’ve re-used a password, they now have access to your accounts on all the sites where you used it. Use the password change feature to update your older passwords, starting with the most important ones – the ones that would put you most at risk if your account where compromised. As an absolute minimum, ensure you have strong, unique passwords for all financial services, as well as other critical ones like Apple, Google, and Amazon accounts. Make sure you include any accounts which have already been compromised! You can identify these by putting your email address into Have I Been Pwned. Use passkeys where possible Passwords are gradually being replaced by passkeys. While the difference might seem small in terms of how you login, there’s a huge difference in the security they provide. With a passkey, a website or app doesn’t ask for a password, it instead asks your device to verify your identity. Your device uses Face ID or Touch ID to do so, then confirms that you are who you claim to be. Crucially, it doesn’t send a password back to the service, so there’s no way for this to be hacked – all the service sees is confirmation that you successfully passed biometric authentication on your device. Use two-factor authentication A growing number of accounts allow you to use two-factor authentication. This means that even if an attacker got your login details, they still wouldn’t be able to access your account. 2FA works by demanding a rolling code whenever you login. These can be sent by text message, but we strongly advise against this, as it leaves you vulnerable to SIM-swap attacks, which are becoming increasingly common. In particular, never use text-based 2FA for financial services accounts. Instead, select the option to use an authenticator app. A QR code will be displayed which you scan in the app, adding that service to your device. Next time you login, you just open the app to see a 6-digit rolling code which you’ll need to enter to login. This feature is built into the Passwords app, or you can use a separate one like Google Authenticator. Check last-login details Some services, like banking apps, will display the date and time of your last successful login. Get into the habit of checking this each time you login, as it can provide a warning that your account has been compromised. Use a VPN service for public Wi-Fi hotspots Anytime you use a public Wi-Fi hotspot, you are at risk from what’s known as a Man-in-the-Middleattack. This is where someone uses a small device which uses the same name as a public Wi-Fi hotspot so that people connect to it. Once you do, they can monitor your internet traffic. Almost all modern websites use HTTPS, which provides an encrypted connection that makes MitM attacks less dangerous than they used to be. All the same, the exploit can expose you to a number of security and privacy risks, so using a VPN is still highly advisable. Always choose a respected VPN company, ideally one which keeps no logs and subjects itself to independent audits. I use NordVPN for this reason. Don’t disclose personal info to AI chatbots AI chatbots typically use their conversations with users as training material, meaning anything you say or type could end up in their database, and could potentially be regurgitated when answering another user’s question. Never reveal any personal information you wouldn’t want on the internet. Consider data removal It’s likely that much of your personal information has already been collected by data brokers. Your email address and phone number can be used for spam, which is annoying enough, but they can also be used by scammers. For this reason, you might want to scrub your data from as many broker services as possible. You can do this yourself, or use a service like Incogni to do it for you. Triple-check requests for money Finally, if anyone asks you to send them money, be immediately on the alert. Even if seems to be a friend, family member, or your boss, never take it on trust. Always contact them via a different, known communication channel. If they emailed you, phone them. If they phoned you, message or email them. Some people go as far as agreeing codewords with family members to use if they ever really do need emergency help. If anyone asks you to buy gift cards and send the numbers to them, it’s a scam 100% of the time. Requests to use money transfer services are also generally scams unless it’s something you arranged in advance. Even if you are expecting to send someone money, be alert for claims that they have changed their bank account. This is almost always a scam. Again, contact them via a different, known comms channel. Photo by Christina @ wocintechchat.com on Unsplash Add 9to5Mac to your Google News feed. FTC: We use income earning auto affiliate links. More.You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel #these #nine #things #protect #yourself

Do these nine things to protect yourself against hackers and scammers

9to5mac.com
Scammers are using AI tools to create increasingly convincing ways to trick victims into sending money, and to access the personal information needed to commit identity theft. Deepfakes mean they can impersonate the voice of a friend or family member, and even fake a video call with them! The result can be criminals taking out thousands of dollars worth of loans or credit card debt in your name. Fortunately there are steps you can take to protect yourself against even the most sophisticated scams. Here are the security and privacy checks to run to ensure you are safe … 9to5Mac is brought to by Incogni: Protect your personal info from prying eyes. With Incogni, you can scrub your deeply sensitive information from data brokers across the web, including people search sites. Incogni limits your phone number, address, email, SSN, and more from circulating. Fight back against unwanted data brokers with a 30-day money back guarantee. Use a password manager At one time, the advice might have read “use strong, unique passwords for each website and app you use” – but these days we all use so many that this is only possible if we use a password manager. This is a super-easy step to take, thanks to the Passwords app on Apple devices. Each time you register for a new service, use the Passwords app (or your own preferred password manager) to set and store the password. Replace older passwords You probably created some accounts back in the days when password rules were much less strict, meaning you now have some weak passwords that are vulnerable to attack. If you’ve been online since before the days of password managers, you probably even some passwords you’ve used on more than one website. This is a huge risk, as it means your security is only as good as the least-secure website you use. What happens is attackers break into a poorly-secured website, grab all the logins, then they use automated software to try those same logins on hundreds of different websites. If you’ve re-used a password, they now have access to your accounts on all the sites where you used it. Use the password change feature to update your older passwords, starting with the most important ones – the ones that would put you most at risk if your account where compromised. As an absolute minimum, ensure you have strong, unique passwords for all financial services, as well as other critical ones like Apple, Google, and Amazon accounts. Make sure you include any accounts which have already been compromised! You can identify these by putting your email address into Have I Been Pwned. Use passkeys where possible Passwords are gradually being replaced by passkeys. While the difference might seem small in terms of how you login, there’s a huge difference in the security they provide. With a passkey, a website or app doesn’t ask for a password, it instead asks your device to verify your identity. Your device uses Face ID or Touch ID to do so, then confirms that you are who you claim to be. Crucially, it doesn’t send a password back to the service, so there’s no way for this to be hacked – all the service sees is confirmation that you successfully passed biometric authentication on your device. Use two-factor authentication A growing number of accounts allow you to use two-factor authentication (2FA). This means that even if an attacker got your login details, they still wouldn’t be able to access your account. 2FA works by demanding a rolling code whenever you login. These can be sent by text message, but we strongly advise against this, as it leaves you vulnerable to SIM-swap attacks, which are becoming increasingly common. In particular, never use text-based 2FA for financial services accounts. Instead, select the option to use an authenticator app. A QR code will be displayed which you scan in the app, adding that service to your device. Next time you login, you just open the app to see a 6-digit rolling code which you’ll need to enter to login. This feature is built into the Passwords app, or you can use a separate one like Google Authenticator. Check last-login details Some services, like banking apps, will display the date and time of your last successful login. Get into the habit of checking this each time you login, as it can provide a warning that your account has been compromised. Use a VPN service for public Wi-Fi hotspots Anytime you use a public Wi-Fi hotspot, you are at risk from what’s known as a Man-in-the-Middle (MitM) attack. This is where someone uses a small device which uses the same name as a public Wi-Fi hotspot so that people connect to it. Once you do, they can monitor your internet traffic. Almost all modern websites use HTTPS, which provides an encrypted connection that makes MitM attacks less dangerous than they used to be. All the same, the exploit can expose you to a number of security and privacy risks, so using a VPN is still highly advisable. Always choose a respected VPN company, ideally one which keeps no logs and subjects itself to independent audits. I use NordVPN for this reason. Don’t disclose personal info to AI chatbots AI chatbots typically use their conversations with users as training material, meaning anything you say or type could end up in their database, and could potentially be regurgitated when answering another user’s question. Never reveal any personal information you wouldn’t want on the internet. Consider data removal It’s likely that much of your personal information has already been collected by data brokers. Your email address and phone number can be used for spam, which is annoying enough, but they can also be used by scammers. For this reason, you might want to scrub your data from as many broker services as possible. You can do this yourself, or use a service like Incogni to do it for you. Triple-check requests for money Finally, if anyone asks you to send them money, be immediately on the alert. Even if seems to be a friend, family member, or your boss, never take it on trust. Always contact them via a different, known communication channel. If they emailed you, phone them. If they phoned you, message or email them. Some people go as far as agreeing codewords with family members to use if they ever really do need emergency help. If anyone asks you to buy gift cards and send the numbers to them, it’s a scam 100% of the time. Requests to use money transfer services are also generally scams unless it’s something you arranged in advance. Even if you are expecting to send someone money, be alert for claims that they have changed their bank account. This is almost always a scam. Again, contact them via a different, known comms channel. Photo by Christina @ wocintechchat.com on Unsplash Add 9to5Mac to your Google News feed. FTC: We use income earning auto affiliate links. More.You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Marktechpost AI @MarktechpostAI ha condiviso un link
2025-06-15 06:37:21 ·

OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs

The Inefficiency of Static Chain-of-Thought Reasoning in LRMs
Recent LRMs achieve top performance by using detailed CoT reasoning to solve complex tasks. However, many simple tasks they handle could be solved by smaller models with fewer tokens, making such elaborate reasoning unnecessary. This echoes human thinking, where we use fast, intuitive responses for easy problems and slower, analytical thinking for complex ones. While LRMs mimic slow, logical reasoning, they generate significantly longer outputs, thereby increasing computational cost. Current methods for reducing reasoning steps lack flexibility, limiting models to a single fixed reasoning style. There is a growing need for adaptive reasoning that adjusts effort according to task difficulty.
Limitations of Existing Training-Based and Training-Free Approaches
Recent research on improving reasoning efficiency in LRMs can be categorized into two main areas: training-based and training-free methods. Training strategies often use reinforcement learning or fine-tuning to limit token usage or adjust reasoning depth, but they tend to follow fixed patterns without flexibility. Training-free approaches utilize prompt engineering or pattern detection to shorten outputs during inference; however, they also lack adaptability. More recent work focuses on variable-length reasoning, where models adjust reasoning depth based on task complexity. Others study “overthinking,” where models over-reason unnecessarily. However, few methods enable dynamic switching between quick and thorough reasoning—something this paper addresses directly.
Introducing OThink-R1: Dynamic Fast/Slow Reasoning Framework
Researchers from Zhejiang University and OPPO have developed OThink-R1, a new approach that enables LRMs to switch between fast and slow thinking smartly, much like humans do. By analyzing reasoning patterns, they identified which steps are essential and which are redundant. With help from another model acting as a judge, they trained LRMs to adapt their reasoning style based on task complexity. Their method reduces unnecessary reasoning by over 23% without losing accuracy. Using a loss function and fine-tuned datasets, OThink-R1 outperforms previous models in both efficiency and performance on various math and question-answering tasks.
System Architecture: Reasoning Pruning and Dual-Reference Optimization
The OThink-R1 framework helps LRMs dynamically switch between fast and slow thinking. First, it identifies when LRMs include unnecessary reasoning, like overexplaining or double-checking, versus when detailed steps are truly essential. Using this, it builds a curated training dataset by pruning redundant reasoning and retaining valuable logic. Then, during fine-tuning, a special loss function balances both reasoning styles. This dual-reference loss compares the model’s outputs with both fast and slow thinking variants, encouraging flexibility. As a result, OThink-R1 can adaptively choose the most efficient reasoning path for each problem while preserving accuracy and logical depth.

Empirical Evaluation and Comparative Performance
The OThink-R1 model was tested on simpler QA and math tasks to evaluate its ability to switch between fast and slow reasoning. Using datasets like OpenBookQA, CommonsenseQA, ASDIV, and GSM8K, the model demonstrated strong performance, generating fewer tokens while maintaining or improving accuracy. Compared to baselines such as NoThinking and DualFormer, OThink-R1 demonstrated a better balance between efficiency and effectiveness. Ablation studies confirmed the importance of pruning, KL constraints, and LLM-Judge in achieving optimal results. A case study illustrated that unnecessary reasoning can lead to overthinking and reduced accuracy, highlighting OThink-R1’s strength in adaptive reasoning.

Conclusion: Towards Scalable and Efficient Hybrid Reasoning Systems
In conclusion, OThink-R1 is a large reasoning model that adaptively switches between fast and slow thinking modes to improve both efficiency and performance. It addresses the issue of unnecessarily complex reasoning in large models by analyzing and classifying reasoning steps as either essential or redundant. By pruning the redundant ones while maintaining logical accuracy, OThink-R1 reduces unnecessary computation. It also introduces a dual-reference KL-divergence loss to strengthen hybrid reasoning. Tested on math and QA tasks, it cuts down reasoning redundancy by 23% without sacrificing accuracy, showing promise for building more adaptive, scalable, and efficient AI reasoning systems in the future.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Building AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDevSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty AssessmentSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Run Multiple AI Coding Agents in Parallel with Container-Use from Dagger
#othinkr1 #dualmode #reasoning #framework #cut

OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs
The Inefficiency of Static Chain-of-Thought Reasoning in LRMs Recent LRMs achieve top performance by using detailed CoT reasoning to solve complex tasks. However, many simple tasks they handle could be solved by smaller models with fewer tokens, making such elaborate reasoning unnecessary. This echoes human thinking, where we use fast, intuitive responses for easy problems and slower, analytical thinking for complex ones. While LRMs mimic slow, logical reasoning, they generate significantly longer outputs, thereby increasing computational cost. Current methods for reducing reasoning steps lack flexibility, limiting models to a single fixed reasoning style. There is a growing need for adaptive reasoning that adjusts effort according to task difficulty. Limitations of Existing Training-Based and Training-Free Approaches Recent research on improving reasoning efficiency in LRMs can be categorized into two main areas: training-based and training-free methods. Training strategies often use reinforcement learning or fine-tuning to limit token usage or adjust reasoning depth, but they tend to follow fixed patterns without flexibility. Training-free approaches utilize prompt engineering or pattern detection to shorten outputs during inference; however, they also lack adaptability. More recent work focuses on variable-length reasoning, where models adjust reasoning depth based on task complexity. Others study “overthinking,” where models over-reason unnecessarily. However, few methods enable dynamic switching between quick and thorough reasoning—something this paper addresses directly. Introducing OThink-R1: Dynamic Fast/Slow Reasoning Framework Researchers from Zhejiang University and OPPO have developed OThink-R1, a new approach that enables LRMs to switch between fast and slow thinking smartly, much like humans do. By analyzing reasoning patterns, they identified which steps are essential and which are redundant. With help from another model acting as a judge, they trained LRMs to adapt their reasoning style based on task complexity. Their method reduces unnecessary reasoning by over 23% without losing accuracy. Using a loss function and fine-tuned datasets, OThink-R1 outperforms previous models in both efficiency and performance on various math and question-answering tasks. System Architecture: Reasoning Pruning and Dual-Reference Optimization The OThink-R1 framework helps LRMs dynamically switch between fast and slow thinking. First, it identifies when LRMs include unnecessary reasoning, like overexplaining or double-checking, versus when detailed steps are truly essential. Using this, it builds a curated training dataset by pruning redundant reasoning and retaining valuable logic. Then, during fine-tuning, a special loss function balances both reasoning styles. This dual-reference loss compares the model’s outputs with both fast and slow thinking variants, encouraging flexibility. As a result, OThink-R1 can adaptively choose the most efficient reasoning path for each problem while preserving accuracy and logical depth. Empirical Evaluation and Comparative Performance The OThink-R1 model was tested on simpler QA and math tasks to evaluate its ability to switch between fast and slow reasoning. Using datasets like OpenBookQA, CommonsenseQA, ASDIV, and GSM8K, the model demonstrated strong performance, generating fewer tokens while maintaining or improving accuracy. Compared to baselines such as NoThinking and DualFormer, OThink-R1 demonstrated a better balance between efficiency and effectiveness. Ablation studies confirmed the importance of pruning, KL constraints, and LLM-Judge in achieving optimal results. A case study illustrated that unnecessary reasoning can lead to overthinking and reduced accuracy, highlighting OThink-R1’s strength in adaptive reasoning. Conclusion: Towards Scalable and Efficient Hybrid Reasoning Systems In conclusion, OThink-R1 is a large reasoning model that adaptively switches between fast and slow thinking modes to improve both efficiency and performance. It addresses the issue of unnecessarily complex reasoning in large models by analyzing and classifying reasoning steps as either essential or redundant. By pruning the redundant ones while maintaining logical accuracy, OThink-R1 reduces unnecessary computation. It also introduces a dual-reference KL-divergence loss to strengthen hybrid reasoning. Tested on math and QA tasks, it cuts down reasoning redundancy by 23% without sacrificing accuracy, showing promise for building more adaptive, scalable, and efficient AI reasoning systems in the future. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Building AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDevSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty AssessmentSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Run Multiple AI Coding Agents in Parallel with Container-Use from Dagger #othinkr1 #dualmode #reasoning #framework #cut

OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs

www.marktechpost.com
The Inefficiency of Static Chain-of-Thought Reasoning in LRMs Recent LRMs achieve top performance by using detailed CoT reasoning to solve complex tasks. However, many simple tasks they handle could be solved by smaller models with fewer tokens, making such elaborate reasoning unnecessary. This echoes human thinking, where we use fast, intuitive responses for easy problems and slower, analytical thinking for complex ones. While LRMs mimic slow, logical reasoning, they generate significantly longer outputs, thereby increasing computational cost. Current methods for reducing reasoning steps lack flexibility, limiting models to a single fixed reasoning style. There is a growing need for adaptive reasoning that adjusts effort according to task difficulty. Limitations of Existing Training-Based and Training-Free Approaches Recent research on improving reasoning efficiency in LRMs can be categorized into two main areas: training-based and training-free methods. Training strategies often use reinforcement learning or fine-tuning to limit token usage or adjust reasoning depth, but they tend to follow fixed patterns without flexibility. Training-free approaches utilize prompt engineering or pattern detection to shorten outputs during inference; however, they also lack adaptability. More recent work focuses on variable-length reasoning, where models adjust reasoning depth based on task complexity. Others study “overthinking,” where models over-reason unnecessarily. However, few methods enable dynamic switching between quick and thorough reasoning—something this paper addresses directly. Introducing OThink-R1: Dynamic Fast/Slow Reasoning Framework Researchers from Zhejiang University and OPPO have developed OThink-R1, a new approach that enables LRMs to switch between fast and slow thinking smartly, much like humans do. By analyzing reasoning patterns, they identified which steps are essential and which are redundant. With help from another model acting as a judge, they trained LRMs to adapt their reasoning style based on task complexity. Their method reduces unnecessary reasoning by over 23% without losing accuracy. Using a loss function and fine-tuned datasets, OThink-R1 outperforms previous models in both efficiency and performance on various math and question-answering tasks. System Architecture: Reasoning Pruning and Dual-Reference Optimization The OThink-R1 framework helps LRMs dynamically switch between fast and slow thinking. First, it identifies when LRMs include unnecessary reasoning, like overexplaining or double-checking, versus when detailed steps are truly essential. Using this, it builds a curated training dataset by pruning redundant reasoning and retaining valuable logic. Then, during fine-tuning, a special loss function balances both reasoning styles. This dual-reference loss compares the model’s outputs with both fast and slow thinking variants, encouraging flexibility. As a result, OThink-R1 can adaptively choose the most efficient reasoning path for each problem while preserving accuracy and logical depth. Empirical Evaluation and Comparative Performance The OThink-R1 model was tested on simpler QA and math tasks to evaluate its ability to switch between fast and slow reasoning. Using datasets like OpenBookQA, CommonsenseQA, ASDIV, and GSM8K, the model demonstrated strong performance, generating fewer tokens while maintaining or improving accuracy. Compared to baselines such as NoThinking and DualFormer, OThink-R1 demonstrated a better balance between efficiency and effectiveness. Ablation studies confirmed the importance of pruning, KL constraints, and LLM-Judge in achieving optimal results. A case study illustrated that unnecessary reasoning can lead to overthinking and reduced accuracy, highlighting OThink-R1’s strength in adaptive reasoning. Conclusion: Towards Scalable and Efficient Hybrid Reasoning Systems In conclusion, OThink-R1 is a large reasoning model that adaptively switches between fast and slow thinking modes to improve both efficiency and performance. It addresses the issue of unnecessarily complex reasoning in large models by analyzing and classifying reasoning steps as either essential or redundant. By pruning the redundant ones while maintaining logical accuracy, OThink-R1 reduces unnecessary computation. It also introduces a dual-reference KL-divergence loss to strengthen hybrid reasoning. Tested on math and QA tasks, it cuts down reasoning redundancy by 23% without sacrificing accuracy, showing promise for building more adaptive, scalable, and efficient AI reasoning systems in the future. Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Sana HassanSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.Sana Hassanhttps://www.marktechpost.com/author/sana-hassan/Building AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDevSana Hassanhttps://www.marktechpost.com/author/sana-hassan/MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language ModelsSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty AssessmentSana Hassanhttps://www.marktechpost.com/author/sana-hassan/Run Multiple AI Coding Agents in Parallel with Container-Use from Dagger

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Lifehacker @Lifehacker ha condiviso un link
2025-06-15 06:28:48 ·

Watch Out for Malicious Unsubscribe Links

In addition to the flood of spam texts you receive on a daily basis, your email inbox is likely filled with newsletters, promotions, and other messages that you don't care to read and perhaps don't know why you receive. But you shouldn't just start clicking unsubscribe links, which may open you up to certain cybersecurity risks. Email unsubscribe links may be maliciousWhile email unsubscribe links may seem innocuous, especially if you generally trust the sender, security experts say there are a number of ways in which threat actors can leverage these links for malicious purposes. Like responding to a spam text or answering a spam call, clicking "unsubscribe" confirms that your email address is active, giving cyber criminals an incentive to keep targeting you.In some cases, unsubscribe links can be hijacked to send users to phishing websites, where you are asked to enter your login credentials to complete the process. According to the folks at DNSFilter, one in every 644 clicks of email unsubscribe links can land you on a malicious website. While you do have to confirm your email address in some legitimate cases, you shouldn't enter a password, which is likely a scam. Bottom line: If you don't trust the sender, you certainly shouldn't trust any links contained within the email. How to safely unsubscribe from emails Even if unsubscribe links are safe, it's a pain to go through the multi-step process of clicking through individual emails and opening new browser windows to confirm. To minimize hassle and avoid the risk of malicious links in individual emails, you can use unsubscribe features built into your email client, which are less likely to be compromised by threat actors because they aren't tied to the email itself. In Gmail, tap More > Manage subscriptions in your left-hand navigation barand scroll to the sender. Click Unsubscribe to the right of the number of emails sent recently. You can also unsubscribe from individual emails by opening the message and clicking Unsubscribe next to the sender's name. In some cases, you may be directed to the sender's website to complete the process.You can also mark the message as spam or block the sender. In Outlook, go to Settings > Mail > Subscriptions > Your current subscriptions and select Unsubscribe, then tap OK. Alternatively, you can block the sender by clicking the three dots and selecting Block > OK. Alternatively, you can filter unwanted emails to a different folder, so while you'll still receive them, they won't clog up your main inbox. In Gmail, open the message then click More > Filter messages like these to set up filter criteria, whether that's sending to another folder, deleting it, or marking it as spam. You can create similar rules in Outlook by right-clicking the message in your message list and going to Rules > Create rule. A final option is to use a disposable email alias to subscribe to newsletters and promotional emails or when signing up for accounts, which makes it easy to filter messages or delete the address entirely without affecting your main inbox.
#watch #out #malicious #unsubscribe #links

Watch Out for Malicious Unsubscribe Links
In addition to the flood of spam texts you receive on a daily basis, your email inbox is likely filled with newsletters, promotions, and other messages that you don't care to read and perhaps don't know why you receive. But you shouldn't just start clicking unsubscribe links, which may open you up to certain cybersecurity risks. Email unsubscribe links may be maliciousWhile email unsubscribe links may seem innocuous, especially if you generally trust the sender, security experts say there are a number of ways in which threat actors can leverage these links for malicious purposes. Like responding to a spam text or answering a spam call, clicking "unsubscribe" confirms that your email address is active, giving cyber criminals an incentive to keep targeting you.In some cases, unsubscribe links can be hijacked to send users to phishing websites, where you are asked to enter your login credentials to complete the process. According to the folks at DNSFilter, one in every 644 clicks of email unsubscribe links can land you on a malicious website. While you do have to confirm your email address in some legitimate cases, you shouldn't enter a password, which is likely a scam. Bottom line: If you don't trust the sender, you certainly shouldn't trust any links contained within the email. How to safely unsubscribe from emails Even if unsubscribe links are safe, it's a pain to go through the multi-step process of clicking through individual emails and opening new browser windows to confirm. To minimize hassle and avoid the risk of malicious links in individual emails, you can use unsubscribe features built into your email client, which are less likely to be compromised by threat actors because they aren't tied to the email itself. In Gmail, tap More > Manage subscriptions in your left-hand navigation barand scroll to the sender. Click Unsubscribe to the right of the number of emails sent recently. You can also unsubscribe from individual emails by opening the message and clicking Unsubscribe next to the sender's name. In some cases, you may be directed to the sender's website to complete the process.You can also mark the message as spam or block the sender. In Outlook, go to Settings > Mail > Subscriptions > Your current subscriptions and select Unsubscribe, then tap OK. Alternatively, you can block the sender by clicking the three dots and selecting Block > OK. Alternatively, you can filter unwanted emails to a different folder, so while you'll still receive them, they won't clog up your main inbox. In Gmail, open the message then click More > Filter messages like these to set up filter criteria, whether that's sending to another folder, deleting it, or marking it as spam. You can create similar rules in Outlook by right-clicking the message in your message list and going to Rules > Create rule. A final option is to use a disposable email alias to subscribe to newsletters and promotional emails or when signing up for accounts, which makes it easy to filter messages or delete the address entirely without affecting your main inbox. #watch #out #malicious #unsubscribe #links

Watch Out for Malicious Unsubscribe Links

lifehacker.com
In addition to the flood of spam texts you receive on a daily basis, your email inbox is likely filled with newsletters, promotions, and other messages that you don't care to read and perhaps don't know why you receive. But you shouldn't just start clicking unsubscribe links, which may open you up to certain cybersecurity risks. Email unsubscribe links may be maliciousWhile email unsubscribe links may seem innocuous, especially if you generally trust the sender, security experts say there are a number of ways in which threat actors can leverage these links for malicious purposes. Like responding to a spam text or answering a spam call, clicking "unsubscribe" confirms that your email address is active, giving cyber criminals an incentive to keep targeting you.In some cases, unsubscribe links can be hijacked to send users to phishing websites, where you are asked to enter your login credentials to complete the process. According to the folks at DNSFilter, one in every 644 clicks of email unsubscribe links can land you on a malicious website. While you do have to confirm your email address in some legitimate cases, you shouldn't enter a password, which is likely a scam. Bottom line: If you don't trust the sender, you certainly shouldn't trust any links contained within the email. How to safely unsubscribe from emails Even if unsubscribe links are safe, it's a pain to go through the multi-step process of clicking through individual emails and opening new browser windows to confirm. To minimize hassle and avoid the risk of malicious links in individual emails, you can use unsubscribe features built into your email client, which are less likely to be compromised by threat actors because they aren't tied to the email itself. In Gmail, tap More > Manage subscriptions in your left-hand navigation bar (Menu > Manage subscriptions on mobile) and scroll to the sender. Click Unsubscribe to the right of the number of emails sent recently. You can also unsubscribe from individual emails by opening the message and clicking Unsubscribe next to the sender's name. In some cases, you may be directed to the sender's website to complete the process. (Note that Gmail may not consider all email campaigns eligible for one-click unsubscribe.) You can also mark the message as spam or block the sender. In Outlook, go to Settings > Mail > Subscriptions > Your current subscriptions and select Unsubscribe, then tap OK. Alternatively, you can block the sender by clicking the three dots and selecting Block > OK. Alternatively, you can filter unwanted emails to a different folder (including spam), so while you'll still receive them, they won't clog up your main inbox. In Gmail, open the message then click More > Filter messages like these to set up filter criteria, whether that's sending to another folder, deleting it, or marking it as spam. You can create similar rules in Outlook by right-clicking the message in your message list and going to Rules > Create rule. A final option is to use a disposable email alias to subscribe to newsletters and promotional emails or when signing up for accounts, which makes it easy to filter messages or delete the address entirely without affecting your main inbox.

0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!
Microsoft Academic @MicrosoftAcademic ha condiviso un link
2025-06-06 07:52:50 ·

BenchmarkQED: Automated benchmarking of RAG systems

One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics.
To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub. It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.
BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks.
In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes.
In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset.
Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.
AutoQ: Automated query synthesis
This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset.
AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the queryforming a logical progression along the spectrum.
Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum.
AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset.
Figure 2. Synthesis process and example query for each of the four AutoQ query classes.

About Microsoft Research
Advancing science and technology to benefit humanity

View our story

Opens in a new tab
AutoE: Automated evaluation framework
Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:

Comprehensiveness: Does the answer address all relevant aspects of the question?
Diversity: Does it present varied perspectives or insights?
Empowerment: Does it help the reader understand and make informed judgments?
Relevance: Does it address what the question is specifically asking?

The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics.
An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class . AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge.
These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method.
Figure 3. Win rates of four LazyGraphRAG configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%.
The four LazyGraphRAG conditionsdiffer by query budgetand chunk size. All used GPT-4o mini for relevance tests and GPT-4o for query expansionand answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout.
Comparison systems were GraphRAG , Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG, RAPTOR, and TREX. All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy.
LazyGraphRAG outperformed every comparison condition using the same generative model, winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration. For DataLocal queries, the smaller budgetperformed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk sizehad a slight edge, likely because longer chunks provide a more coherent context.
Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall.
Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset.
Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall.
Figure 4. Win rates of LazyGraphRAG  over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition.
AutoD: Automated data sampling and summarization
Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities.
The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clustersand the number of samples per cluster. This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations.
AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited.
Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository, alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.
We hope these datasets, together with the BenchmarkQED tools, help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub.
Opens in a new tab
#benchmarkqedautomatedbenchmarking #ofrag #systems

BenchmarkQED: Automated benchmarking of RAG systems
One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics. To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub. It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.   BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks. In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes. In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset. Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.   AutoQ: Automated query synthesis This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset. AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the queryforming a logical progression along the spectrum. Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum. AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset. Figure 2. Synthesis process and example query for each of the four AutoQ query classes. About Microsoft Research Advancing science and technology to benefit humanity View our story Opens in a new tab AutoE: Automated evaluation framework Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation: Comprehensiveness: Does the answer address all relevant aspects of the question? Diversity: Does it present varied perspectives or insights? Empowerment: Does it help the reader understand and make informed judgments? Relevance: Does it address what the question is specifically asking?   The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics. An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class . AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge. These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method. Figure 3. Win rates of four LazyGraphRAG configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%. The four LazyGraphRAG conditionsdiffer by query budgetand chunk size. All used GPT-4o mini for relevance tests and GPT-4o for query expansionand answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout. Comparison systems were GraphRAG , Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG, RAPTOR, and TREX. All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy. LazyGraphRAG outperformed every comparison condition using the same generative model, winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration. For DataLocal queries, the smaller budgetperformed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk sizehad a slight edge, likely because longer chunks provide a more coherent context. Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall. Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset. Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall. Figure 4. Win rates of LazyGraphRAG  over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition. AutoD: Automated data sampling and summarization Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities. The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clustersand the number of samples per cluster. This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations. AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited. Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository, alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.   We hope these datasets, together with the BenchmarkQED tools, help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub. Opens in a new tab #benchmarkqedautomatedbenchmarking #ofrag #systems

BenchmarkQED: Automated benchmarking of RAG systems

www.microsoft.com
One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation (RAG) as the go-to framework. As new RAG techniques emerge, there’s a growing need to benchmark their performance across diverse datasets and metrics. To meet this need, we’re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub (opens in new tab). It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.   BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model (LLM) to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks. In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes. In the paper, we distinguish between local queries, where answers are found in a small number of text regions, and sometimes even a single region, and global queries, which require reasoning over large portions of or even the entire dataset. Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, “What are the main themes of the dataset?” which require understanding dataset qualities not explicitly stated in the text.   AutoQ: Automated query synthesis This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset. AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom). Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum. AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset. Figure 2. Synthesis process and example query for each of the four AutoQ query classes. About Microsoft Research Advancing science and technology to benefit humanity View our story Opens in a new tab AutoE: Automated evaluation framework Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation: Comprehensiveness: Does the answer address all relevant aspects of the question? Diversity: Does it present varied perspectives or insights? Empowerment: Does it help the reader understand and make informed judgments? Relevance: Does it address what the question is specifically asking?   The AutoE component scales evaluation of these qualities using the LLM-as-a-Judge method. It presents pairs of answers to an LLM, along with the query and target metric, in counterbalanced order. The model determines whether the first answer wins, loses, or ties with the second. Over a set of queries, whether from AutoQ or elsewhere, this produces win rates between competing methods. When ground truth is available, AutoE can also score answers on correctness, completeness, and related metrics. An illustrative evaluation is shown in Figure 3. Using a dataset of 1,397 AP News articles on health and healthcare, AutoQ generated 50 queries per class (200 total). AutoE then compared LazyGraphRAG to a competing RAG method, running six trials per query across four metrics, using GPT-4.1 as a judge. These trial-level results were aggregated using metric-based win rates, where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss, and then averaged to calculate the overall win rate for each RAG method. Figure 3. Win rates of four LazyGraphRAG (LGR) configurations across methods, broken down by the AutoQ query class and averaged across AutoE’s four metrics: comprehensiveness, diversity, empowerment, and relevance. LazyGraphRAG outperforms comparison conditions where the bar is above 50%. The four LazyGraphRAG conditions (LGR_b200_c200, LGR_b50_c200, LGR_b50_c600, LGR_b200_c200_mini) differ by query budget (b50, b200) and chunk size (c200, c600). All used GPT-4o mini for relevance tests and GPT-4o for query expansion (to five subqueries) and answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout. Comparison systems were GraphRAG (Local, Global, and Drift Search), Vector RAG with 8k- and 120k-token windows, and three published methods: LightRAG (opens in new tab), RAPTOR (opens in new tab), and TREX (opens in new tab). All methods were limited to the same 8k tokens for answer generation. GraphRAG Global Search used level 2 of the community hierarchy. LazyGraphRAG outperformed every comparison condition using the same generative model (GPT-4o), winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration (LGR_b200_c200). For DataLocal queries, the smaller budget (LGR_b50_c200) performed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk size (LGR_b50_c600) had a slight edge, likely because longer chunks provide a more coherent context. Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall. Increasing Vector RAG’s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset. Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG’s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater relevance to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall. Figure 4. Win rates of LazyGraphRAG (LGR) over Vector RAG across different context window sizes, broken down by the four AutoQ query classes and four AutoE metrics: comprehensiveness, diversity, empowerment, and relevance. Bars above 50% indicate that LazyGraphRAG outperformed the comparison condition. AutoD: Automated data sampling and summarization Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system’s general capabilities. The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clusters (breadth) and the number of samples per cluster (depth). This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations. AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited. Since the release of the GraphRAG paper, we’ve received many requests to share the dataset of the Behind the Tech (opens in new tab) podcast transcripts we used in our evaluation. An updated version of this dataset is now available in the BenchmarkQED repository (opens in new tab), alongside the AP News dataset containing 1,397 health-related articles, licensed for open release.   We hope these datasets, together with the BenchmarkQED tools (opens in new tab), help accelerate benchmark-driven development of RAG systems and AI question-answering. We invite the community to try them on GitHub (opens in new tab). Opens in a new tab

487

· 0 Commenti ·0 condivisioni ·0 Anteprima

Effettua l'accesso per mettere mi piace, condividere e commentare!

Passa a Pro