The second wave of AI coding is here
www.technologyreview.com
Ask people building generative AI what generative AI is good for right nowwhat theyre really fired up aboutand many will tell you: coding. Thats something thats been very exciting for developers, Jared Kaplan, chief scientist at Anthropic, told MIT Technology Review this month: Its really understanding whats wrong with code, debugging it. Copilot, a tool built on top of OpenAIs large language models and launched by Microsoft-backed GitHub in 2022, is now used by millions of developers around the world. Millions more turn to general-purpose chatbots like Anthropics Claude, OpenAIs ChatGPT, and Google DeepMinds Gemini for everyday help. Today, more than a quarter of all new code at Google is generated by AI, then reviewed and accepted by engineers, Alphabet CEO Sundar Pichai claimed on an earnings call in October: This helps our engineers do more and move faster. Expect other tech companies to catch up, if they havent already. Its not just the big beasts rolling out AI coding tools. A bunch of new startups have entered this buzzy market too. Newcomers such as Zencoder, Merly, Cosine, Tessl (valued at $750 million within months of being set up), and Poolside (valued at $3 billion before it even released a product) are all jostling for their slice of the pie. It actually looks like developers are willing to pay for copilots, says Nathan Benaich, an analyst at investment firm Air Street Capital: And so code is one of the easiest ways to monetize AI. Such companies promise to take generative coding assistants to the next level. Instead of providing developers with a kind of supercharged autocomplete, like most existing tools, this next generation can prototype, test, and debug code for you. The upshot is that developers could essentially turn into managers, who may spend more time reviewing and correcting code written by a model than writing it from scratch themselves. But theres more. Many of the people building generative coding assistants think that they could be a fast track to artificial general intelligence (AGI), the hypothetical superhuman technology that a number of top firms claim to have in their sights. The first time we will see a massively economically valuable activity to have reached human-level capabilities will be in software development, says Eiso Kant, CEO and cofounder of Poolside. (OpenAI has already boasted that its latest o3 model beat the companys own chief scientist in a competitive coding challenge.) Welcome to the second wave of AI coding. Correct code Software engineers talk about two types of correctness. Theres the sense in which a programs syntax (its grammar) is correctmeaning all the words, numbers, and mathematical operators are in the right place. This matters a lot more than grammatical correctness in natural language. Get one tiny thing wrong in thousands of lines of code and none of it will run. The first generation of coding assistants are now pretty good at producing code thats correct in this sense. Trained on billions of pieces of code, they have assimilated the surface-level structures of many types of programs. But theres also the sense in which a programs function is correct: Sure, it runs, but does it actually do what you wanted it to? Its that second level of correctness that the new wave of generative coding assistants are aiming forand this is what will really change the way software is made. Large language models can write code that compiles, but they may not always write the program that you wanted, says Alistair Pullen, a cofounder of Cosine. To do that, you need to re-create the thought processes that a human coder would have gone through to get that end result. The problem is that the data most coding assistants have been trained onthe billions of pieces of code taken from online repositoriesdoesnt capture those thought processes. It represents a finished product, not what went into making it. Theres a lot of code out there, says Kant. But that data doesnt represent software development. What Pullen, Kant, and others are finding is that to build a model that does a lot more than autocompleteone that can come up with useful programs, test them, and fix bugsyou need to show it a lot more than just code. You need to show it how that code was put together. In short, companies like Cosine and Poolside are building models that dont just mimic what good code looks likewhether it works well or notbut mimic the process that produces such code in the first place. Get it right and the models will come up with far better code and far better bug fixes. Breadcrumbs But you first need a data set that captures that processthe steps that a human developer might take when writing code. Think of these steps as a breadcrumb trail that a machine could follow to produce a similar piece of code itself. Part of that is working out what materials to draw from: Which sections of the existing codebase are needed for a given programming task? Context is critical, says Zencoder founder Andrew Filev. The first generation of tools did a very poor job on the context, they would basically just look at your open tabs. But your repo [code repository] might have 5000 files and theyd miss most of it. Zencoder has hired a bunch of search engine veterans to help it build a tool that can analyze large codebases and figure out what is and isnt relevant. This detailed context reduces hallucinations and improves the quality of code that large language models can produce, says Filev: We call it repo grokking. Cosine also thinks context is key. But it draws on that context to create a new kind of data set. The company has asked dozens of coders to record what they were doing as they worked through hundreds of different programming tasks. We asked them to write down everything, says Pullen: Why did you open that file? Why did you scroll halfway through? Why did you close it? They also asked coders to annotate finished pieces of code, marking up sections that would have required knowledge of other pieces of code or specific documentation to write. Cosine then takes all that information and generates a large synthetic data set that maps the typical steps coders take, and the sources of information they draw on, to finished pieces of code. They use this data set to train a model to figure out what breadcrumb trail it might need to follow to produce a particular program, and then how to follow it. Poolside, based in San Francisco, is also creating a synthetic data set that captures the process of coding, but it leans more on a technique called RLCEreinforcement learning from code execution. (Cosine uses this too, but to a lesser degree.) RLCE is analogous to the technique used to make chatbots like ChatGPT slick conversationalists, known as RLHFreinforcement learning from human feedback. With RLHF, a model is trained to produce text thats more like the kind human testers say they favor. With RLCE, a model is trained to produce code thats more like the kind that does what it is supposed to do when it is run (or executed). Gaming the system Cosine and Poolside both say they are inspired by the approach DeepMind took with its game-playing model AlphaZero. AlphaZero was given the steps it could takethe moves in a gameand then left to play against itself over and over again, figuring out via trial and error what sequence of moves were winning moves and which were not. They let it explore moves at every possible turn, simulate as many games as you can throw compute atthat led all the way to beating Lee Sedol, says Pengming Wang, a founding scientist at Poolside, referring to the Korean Go grandmaster that AlphaZero beat in 2016. Before Poolside, Wang worked at Google DeepMind on applications of AlphaZero beyond board games, including FunSearch, a version trained to solve advanced math problems. When that AlphaZero approach is applied to coding, the steps involved in producing a piece of codethe breadcrumbsbecome the available moves in a game, and a correct program becomes winning that game. Left to play by itself, a model can improve far faster than a human could. A human coder tries and fails one failure at a time, says Kant. Models can try things 100 times at once. A key difference between Cosine and Poolside is that Cosine is using a custom version of GPT-4o provided by OpenAI, which makes it possible to train on a larger data set than the base model can cope with, but Poolside is building its own large language model from scratch. Poolsides Kant thinks that training a model on code from the start will give better results than adapting an existing model that has sucked up not only billions of pieces of code but most of the internet. Im perfectly fine with our model forgetting about butterfly anatomy, he says. Cosine claims that its generative coding assistant, called Genie, tops the leaderboard on SWE-Bench, a standard set of tests for coding models. Poolside is still building its model but claims that what it has so far already matches the performance of GitHubs Copilot. I personally have a very strong belief that large language models will get us all the way to being as capable as a software developer, says Kant. Not everyone takes that view, however. Illogical LLMs To Justin Gottschlich, the CEO and founder of Merly, large language models are the wrong tool for the jobperiod. He invokes his dog: No amount of training for my dog will ever get him to be able to code, it just won't happen, he says. He can do all kinds of other things, but hes just incapable of that deep level of cognition. Having worked on code generation for more than a decade, Gottschlich has a similar sticking point with large language models. Programming requires the ability to work through logical puzzles with unwavering precision. No matter how well large language models may learn to mimic what human programmers do, at their core they are still essentially statistical slot machines, he says: I cant train an illogical system to become logical. Instead of training a large language model to generate code by feeding it lots of examples, Merly does not show its system human-written code at all. Thats because to really build a model that can generate code, Gottschlich argues, you need to work at the level of the underlying logic that code represents, not the code itself. Merlys system is therefore trained on an intermediate representationsomething like the machine-readable notation that most programming languages get translated into before they are run. Gottschlich wont say exactly what this looks like or how the process works. But he throws out an analogy: Theres this idea in mathematics that the only numbers that have to exist are prime numbers, because you can calculate all other numbers using just the primes. Take that concept and apply it to code, he says. Not only does this approach get straight to the logic of programming; its also fast, because millions of lines of code are reduced to a few thousand lines of intermediate language before the system analyzes them. Shifting mindsets What you think of these rival approaches may depend on what you want generative coding assistants to be. In November, Cosine banned its engineers from using tools other than its own products. It is now seeing the impact of Genie on its own engineers, who often find themselves watching the tool as it comes up with code for them. You now give the model the outcome you would like, and it goes ahead and worries about the implementation for you, says Yang Li, another Cosine cofounder. Pullen admits that it can be baffling, requiring a switch of mindset. We have engineers doing multiple tasks at once, flitting between windows, he says. While Genie is running code in one, they might be prompting it to do something else in another. These tools also make it possible to protype multiple versions of a system at once. Say youre developing software that needs a payment system built in. You can get a coding assistant to simultaneously try out several different optionsStripe, Mango, Checkoutinstead of having to code them by hand one at a time. Genie can be left to fix bugs around the clock. Most software teams use bug-reporting tools that let people upload descriptions of errors they have encountered. Genie can read these descriptions and come up with fixes. Then a human just needs to review them before updating the code base. No single human understands the trillions of lines of code in todays biggest software systems, says Li, and as more and more software gets written by other software, the amount of code will only get bigger. This will make coding assistants that maintain that code for us essential. The bottleneck will become how fast humans can review the machine-generated code, says Li. How do Cosines engineers feel about all this? According to Pullen, at least, just fine. If I give you a hard problem, youre still going to think about how you want to describe that problem to the model, he says. Instead of writing the code, you have to write it in natural language. But theres still a lot of thinking that goes into that, so youre not really taking the joy of engineering away. The itch is still scratched. Some may adapt faster than others. Cosine likes to invite potential hires to spend a few days coding with its team. A couple of months ago it asked one such candidate to build a widget that would let employees share cool bits of software they were working on to social media. The task wasnt straightforward, requiring working knowledge of multiple sections of Cosines millions of lines of code. But the candidate got it done in a matter of hours. This person who had never seen our code base turned up on Monday and by Tuesday afternoon hed shipped something, says Li. We thought it would take him all week. (They hired him.) But theres another angle too. Many companies will use this technology to cut down on the number of programmers they hire. Li thinks we will soon see tiers of software engineers. At one end there will be elite developers with million-dollar salaries who can diagnose problems when the AI goes wrong. At the other end, smaller teams of 10 to 20 people will do a job that once required hundreds of coders. It will be like how ATMs transformed banking, says Li. Anything you want to do will be determined by compute and not head count, he says. I think its generally accepted that the era of adding another few thousand engineers to your organization is over. Warp drives Indeed, for Gottschlich, machines that can code better than humans are going to be essential. For him, thats the only way we will build the vast, complex software systems that he thinks we will eventually need. Like many in Silicon Valley, he anticipates a future in which humans move to other planets. Thats only going to be possible if we get AI to build the software required, he says: Merlys real goal is to get us to Mars. Gottschlich prefers to talk about machine programming rather than coding assistants, because he thinks that term frames the problem the wrong way. I dont think that these systems should be assisting humansI think humans should be assisting them, he says. They can move at the speed of AI. Why restrict their potential? Theres this cartoon called The Flintstones where they have these cars, but they only move when the drivers use their feet, says Gottschlich. This is sort of how I feel most people are doing AI for software systems. But what Merlys building is, essentially, spaceships, he adds. Hes not joking. And I dont think spaceships should be powered by humans on a bicycle. Spaceships should be powered by a warp engine. If that sounds wildit is. But theres a serious point to be made about what the people building this technology think the end goal really is. Gottschlich is not an outlier with his galaxy-brained take. Despite their focus on products that developers will want to use today, most of these companies have their sights on a far bigger payoff. Visit Cosines website and the company introduces itself as a Human Reasoning Lab. It sees coding as just the first step toward a more general-purpose model that can mimic human problem-solving in a number of domains. Poolside has similar goals: The company states upfront that it is building AGI. Code is a way of formalizing reasoning, says Kant. Wang invokes agents. Imagine a system that can spin up its own software to do any task on the fly, he says. If you get to a point where your agent can really solve any computational task that you want through the means of softwarethat is a display of AGI, essentially. Down here on Earth, such systems may remain a pipe dream. And yet software engineering is changing faster than many at the cutting edge expected. Were not at a point where everythings just done by machines, but were definitely stepping away from the usual role of a software engineer, says Cosines Pullen. Were seeing the sparks of that new workflowwhat it means to be a software engineer going into the future.
0 Comments
·0 Shares
·41 Views