TOWARDSAI.NET
Taming the Oracle: Key Principals That Bring Our LLM Agents to Production
Taming the Oracle: Key Principals That Bring Our LLM Agents to Production 1 like November 15, 2024Share this postAuthor(s): Nate Liebmann Originally published on Towards AI. A Tame Oracle. Generated with Microsoft DesignerWith the second anniversary of the ChatGPT earthquake right around the corner, the rush to build useful applications based on large language models (LLMs) of its like seems to be in full force. But despite the aura of magic surrounding demos of LLM agents or involved conversations, I am sure many can relate to my own experience developing LLM-based applications: you start with some example that seems to be working great, but buyers remorse is soon to follow. Trying out other variations of the task could simply fail miserably, without a clear differentiator; and agentic flows could reveal their tendency to diverge when straying away from the original prototyping happy path.If not for the title, you might have thought at this point I was a generative AI luddite, which could not be further from the truth. The journey my team at Torq and I have been on in the past two years, developing LLM-based software features that enhance the no-code automation building experience on our platform, has taught me a lot about the great power LLMs bring if handled correctly.From here on I will discuss three core principals that guide our development and allow our agents to reach successful production deployment and customer utility. I believe they are highly relevant to other LLM based applications just as much. The least freedom principleLLMs interact through free-text, but its not always the way our users will interact with our LLM-based application. In many cases, even if the input is indeed a textual description provided by the user, the output is much more structured, and could be used to take actions in the application automatically. In such a setting, the great power in the LLMs ability to solve some tasks otherwise requiring massive and complex deterministic logic or human intervention can turn into a problem. The more leeway we give the LLM, the more prone our application is to hallucinations and diverging agentic flows. Therefore, a-la the least privileges principle in security, I believe its important to constrain the LLM as much as possible.Fig. 1: The unconstrained, multi-step agentic flowConsider an agent that takes a snapshot of a hand-written grocery list, extracts the text via OCR, locates the most relevant items in stock, and prepares an order. It may sound tempting to opt for a flexible multi-step agentic flow where the agent can use methods such as search_product and add_to_order (see fig. 1 above). However, this process could turn out to be very slow, consist of superfluous steps, and might even get stuck in a loop in case some function call returns an error the model struggles with recovering from. An alternative approach could constrain the flow to two steps, the first being a batch search to get a filtered product tree object, and the second being generating the order based on it, referencing appropriate products from the partial product tree returned by the search function call (see fig. 2 below). Apart from the clear performance benefit, we can be much more confident the agent will remain on track and complete the task.Fig. 2: A structured agentic flow with deterministic auto-fixingWhen dealing with problems in the generated output, I believe its best to do as much of the correction deterministically, without involving the LLM again. This is because against our intuition, sending an error back to an LLM agent and asking it to correct it does not always get it back on track, and might even increase the likelihood of further errors, as some evidence has shown. Circling back to the grocery shopping agent, it is very likely that in some cases invalid JSON paths will be produced to refer to products (e.g., food.cheeses.goats[0] instead of food.dairy.cheeses.goat[0]). As we have the entire stock at hand, we can apply a simple heuristic to automatically fix the incorrect path in a deterministic way, for example by using an edit distance algorithm to find the valid path closest to the generated one in the product tree. Even then, some invalid paths might be too far from any valid ones. In such a case, we might want to simply retry the LLM request rather than adding the error to the context and asking it to fix it. Automated empirical evaluationUnlike traditional 3rd-party APIs, calling an LLM with the exact same input could produce different results each time, even when setting the temperature hyper-parameter to zero. This is in direct conflict with fundamental principals of good software engineering, that is supposed to give the users an expected and consistent experience. The key to tackling this conflict is automated empirical evaluation, which I consider the LLM edition of test-driven development.The evaluation suite can be implemented as a regular test suite, which has the benefit of natural integration into the development cycle and CI/CD pipelines. Crucially, however, the LLMs must be actually called, and not mocked, of course. Each evaluation case consists of user inputs and initial system state, as well as a grading function for the generated output or modified state. Unlike traditional test cases, the notion of PASS or FAIL is insufficient here, because the evaluation suite plays an important role in guiding improvements and enhancements, as well as catching unintended degradations. The grading function should therefore return a fitness score for the output or state modifications our agent produces. How do we actually implement the grading function? Think, for example, of a simple LLM task for generating small Python utility functions. An evaluation case could prompt it to write a function that computes the nth element of the Fibonacci sequence. The models implementation might take either the iterative or the recursive path, both valid (though suboptimal, because there is a closed form expression), so we cannot make assertions about the specifics of the functions code. The grading function in this case could, however, take a handful of test values for the Fibonacci functions argument, spin up an isolated environment, run the generated function on those values, and verify the results. This black-box grading of the produced output does not make unnecessary assumptions, while strictly validating it in a fully deterministic fashion.While I believe that should be the preferred approach, it is not suitable for all applications. There are cases where we cannot fully validate the result, but we can still make assertions about some of its properties. For example, consider an agent that generates short summaries of system logs. Some properties of its outputs, like length, are easy to check deterministically. Other, semantic ones, not as much. If the otherwise business-as-usual logs serving as input for an evaluation case contain a single record about a kernel panic, we want to make sure the summary will mention that. A naive approach for the grading function in this case will involve an LLM task that directly produces a fitness score for the summary based on the log records. This approach might lock our evaluation in a sort of LLM complacency loop, with none of the guarantees provided by deterministic checks. A more nuanced approach, however, could still use an LLM for grading, but craft the task differently: given a summary, the model could be instructed to answer multiple-choice factual questions (e.g. Has there been a major incident in the covered period? (a) No (b) Yes, a kernel panic (c) Yes, a network connectivity loss). We can be much more confident that the LLM would simply not be able to consistently answer such questions correctly if the key information is missing from the summary, making the score much more reliable.Finally, due to non-determinism, each evaluation case must be run several times, with the results aggregated to form a final evaluation report. I have found it very useful to implement the evaluation suite early and use it to guide our development. Once the application has reached some maturity, it could make sense to fail the integration pipeline if the aggregate score for its evaluation suite drops below some set threshold, to prevent catastrophic degradations. Not letting the tail wag the dogGood LLM-based software is, first and foremost, good software. The magic factor we see in LLMs (which is telling of human nature and the role language plays in our perception of other intelligent beings, a topic I will not cover here of course) might tempt us to think about LLM-based software as a whole new field, requiring novel tools, frameworks and development processes. As discussed above, the non-deterministic nature of commercial LLMs, as well as their unstructured API, indeed necessitate dedicated handling. But I would argue that instead of looking at LLM-based application as a whole new creature that might here and there utilise familiar coding patterns we should treat such an application as any other application, except for where it is not. The power of this approach lies in the fact that by doing so, we do not let external abstractions hide away the low-level LLM handling, which is crucial for truly understanding its capabilities and limitations in the scope of our application. Abstractions can and should be adopted where they save time and reduce boilerplate code, but never at the cost of losing control over the most important part of your application: the intricate touchpoints between the LLM and your deterministic code, that should be tailored to your specific use case.Wrapping up, LLMs can be viewed as powerful oracles that enable previously-unfeasible applications. My experience developing LLM based agents has taught me several principles that correlated with successful production deployment and utility. Firstly, agents should be given the least possible freedom: flows should be structured, and whatever can be done deterministically should be. Secondly, automated empirical evaluation of the LLM task and surrounding logic should be a cornerstone of the development process, relying as much as possible on deterministic scoring. Thirdly, abstractions provided by libraries and frameworks should not be adopted where they hide essential details of the integration between the LLM and our code, the core of LLM-based applications.Feel free to reach out to discuss this matter further and tell me what you think!Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AITowards AI - Medium Share this post
0 Σχόλια 0 Μοιράστηκε 28 Views