Claude 3.7 Sonnet debuts with extended thinking to tackle complex problems
arstechnica.com
ponder me this Claude 3.7 Sonnet debuts with extended thinking to tackle complex problems Anthropic's first simulated reasoning model is a beast at coding tasks. Benj Edwards Feb 24, 2025 5:23 pm | 4 Credit: Anthropic Credit: Anthropic Story textSizeSmallStandardLargeWidth *StandardWideLinksStandardOrange* Subscribers only Learn moreOn Monday, Anthropic announced Claude 3.7 Sonnet, a new AI language model with a simulated reasoning (SR) capability called "extended thinking," allowing the system to work through problems step by step. The company also revealed Claude Code, a command line AI agent for developers currently available as a limited research preview.Anthropic calls Claude 3.7 the first "hybrid reasoning model" on the market, giving users the option to choose between quick responses or extended, visible chain-of-thought processing similar to OpenAI's o1 and o3 series models, Google's Gemini 2.0 Flash Thinking, and DeepSeek's R1. When using Claude 3.7's API, developers can specify exactly how many tokens the model should use for thinking, up to its 128,000 token output limit.The new model is available across all Claude subscription plans, and the extended thinking mode feature is available on all plans except the free tier. API pricing remains unchanged at $3 per million input tokens and $15 per million output tokens, with thinking tokens included in the output pricing since they are part of the context considered by the model.In another interesting developmentsince Claude 3.5 Sonnet was known as something of a Goody Two-shoes in the AI worldAnthropic said that it had reduced unnecessary refusals in 3.7 Sonnet by 45 percent. In other words, 3.7 Sonnet is more likely to do what you ask without complaining about ethical boundaries, which can otherwise pop up in innocent situations when interpreted incorrectly by the neural network running under Claude's hood.In benchmarks, Anthropic's latest model seems to hold its own, and even excels in at least one category in particular: coding. 3.7's predecessor, Claude 3.5 Sonnet, was excellent at programming tasks compared to other AI models in our experience, and according to Anthropic, early testing indicates strong performance in that area. The company claims Claude 3.7 Sonnet achieved top scores on SWE-bench Verified, which evaluates how AI models handle real-world software issues, and also in TAU-bench, which tests AI agents on complex tasks with user and tool interactions. A chart showing self-reported Claude 3.7 Sonnet benchmark results. Credit: Anthropic Aiming at software developers, Anthropic has also expanded its GitHub integration to all Claude plans, allowing devs to connect code repositories directly to Claude for bug fixes, feature development, and documentation work.In our personal experience creating hobby programs with Claude 3.5 Sonnet over the past six months, the tool proved valuable for quickly prototyping projects, but we often ran up against usage limits. So far, Anthropic has not announced a subscription plan beyond the existing "Claude Pro" ($20/month) that might extend them, though we suspect developers who come to rely on 3.7 are soon going to need a plan more along the lines of OpenAI's ChatGPT Pro that features vastly expanded usage options for $200 a month. As an aside, our subjective experience with o1 and o3 in coding aligns with the benchmarks in the chart above; they have not been as good as Sonnet at coding.And speaking of upgrades, we might as well talk about the name. Claude 3.5 Sonnetlaunched in June 2024, but it received an update in October with a nearly identical name (sometimes referred to as "Claude 3.5 Sonnet (new) or "Claude 3.5 Sonnet (October 2024)") that some users criticized as confusing. As a result, some users began unofficially calling that version "Claude 3.6 Sonnet" instead. Apparently, Anthropic got the message on the desire for clearer naming practices, writing "Lesson learned on naming" in a footnote on the Claude 3.7 release page.Taking extended reasoning for a spinLike other SR models, Claude 3.7, with extended thinking, tries to work through more complex problems by throwing more tokens at them through an ingrained simulated reasoning process. Just like o1, o3, and DeepSeek R1, you can see the "thinking" process going through Claude 3.7's simulated mind while it works out an ideal answer.To test it out briefly, we gave it a couple of simple tasks, including our time-honored (and now likely compromised as part of training datasets scraped from the web) test of asking it about the origin of the "magenta" color name. An example of Claude 3.7 Sonnet with extended thinking is asked, "Would the color be called 'magenta' if the town of Magenta didn't exist?" Credit: Benj Edwards Interestingly, xAI's Grok 3 with "thinking" (its SR mode) enabled was the first model that definitively gave us a "no" and not an "it's not likely" to the magenta question. Claude 3.7 Sonnet with extended thinking also impressed us with our second-ever firm "no," then an explanation.In another informal test, we asked 3.7 Sonnet with extended thinking to compose five original dad jokes. We've found in the past that our old prompt, "write 5 original dad jokes," was not specific enough and always resulted in canned dad jokes pulled directly from training data, so we asked, "Compose 5 original dad jokes that are not found anywhere in the world." An example of Claude 3.7 Sonnet with extended thinking is asked, "Compose 5 original dad jokes that are not found anywhere in the world." Credit: Benj Edwards Claude made some attempts at crafting original jokes, although we'll let you judge whether they are funny or not. We will likely put 3.7 Sonnet's SR capabilities to the test more exhaustively in a future article.Anthropics first agent: Claude CodeSo far, 2025 has been the year of both SR models (like R1 and o3) and agentic AI tools (like OpenAI's Operator and Deep Research). Not to be left out, Anthropic has announced its first agentic tool, Claude Code.Claude Code operates directly from a console terminal and is an autonomous coding assistant. It allows Claude to search through codebases, read and edit files, write and run tests, commit and push code to GitHub repositories, and execute command line tools while keeping developers informed throughout the process. Introducing Claude Code. Anthropic also aims for Claude Code to be used as an assistant for debugging and refactoring tasks. The company claims that during internal testing, Claude Code completed tasks in a single session that would typically require 45-plus minutes of manual work.Claude Code is currently available only as a "limited research preview," with Anthropic stating it plans to improve the tool based on user feedback over time. Meanwhile, Claude 3.7 Sonnet is now available through the Claude website, the Claude app, Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.Benj EdwardsSenior AI ReporterBenj EdwardsSenior AI Reporter Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC. 4 Comments
0 Комментарии ·0 Поделились ·67 Просмотры