WWW.COMPUTERWORLD.COM
How enterprise IT can protect itself from genAI unreliability
Mesmerized by the scalability, efficiency and flexibility claims from generative AI (genAI) vendors, enterprise execs have been all but tripping over themselves trying to push the technology to its limits.The fear of flawed deliverables based on a combination ofhallucinations, imperfect training data and a model that can disregard query specificsand can ignore guardrails is usually minimized.But the Mayo Clinic is trying to push back on all those problematic answers.In an interview with VentureBeat,Matthew Callstrom,Mayos medical director, explained: Mayo paired whats known as the clustering using representatives (CURE) algorithm with LLMs and vector databases to double-check data retrieval.The algorithm has the ability to detect outliers or data points that dont match the others. Combining CURE with a reverse RAG approach, Mayos [large language model] split the summaries it generated into individual facts, then matched those back to source documents. A second LLM then scored how well the facts aligned with those sources, specifically if there was a causal relationship between the two.(Computerworldreached directly to Callstrom for an interview, but he was not available.)There are, broadly speaking, two categories for reducing genAIs lack of reliability: humans in the loop (usually, an awfullotof humans in the loop) or some version of AI watching AI.The idea of having more humans monitoring what these tools deliver is typically seen as the safer approach, but it undercuts the key value of genAI massive efficiencies. Those efficiencies, the argument goes, should allow workers to be redeployed to more strategic work or, as the argument becomes a whisper, to sharply reduce that workforce.Butat the scale of a typical enterprise, genAI efficienciescouldreplace the work of thousands of people. Adding human oversight might only require dozens of humans. It still makes mathematical sense.The AI-watching-AI approach is scarier, althougha lot of enterprisesare giving it a go. Some are looking to push any liability down the road bypartnering with others to do their genAI calculationsfor them. Still others are looking topay third-parties to come in and try and improvetheir genAI accuracy. The phrase throwing good money after bad immediately comes to mind.The lack of effective ways to improve genAI reliability internally is a key factor in why so many proof-of-concept trialsgot approved quickly, but never moved into production.Some version of throwing more humans into the mix to keep an eye on genAI outputs seems to be winning the argument, for now. You have to have a human babysitter on it. AI watching AI is guaranteed to fail, said Missy Cummings, a George Mason University professor and director of Masons Autonomy and Robotics Center (MARC).People are going to do it because they want to believe in the (technologys) promises. People can be taken in by the self-confidence of a genAI system, she said, comparing it to the experience of driving autonomous vehicles (AVs).When driving an AV, the AI is pretty good and it can work. But if you quit paying attention for a quick second, disaster can strike, Cummings said. The bigger problem is that people develop an unhealthy complacency.Rowan Curran, a Forrester senior analyst,said Mayos approach might have some merit. Look at the input and look at the output and see how close it adheres, Curran said.Curran argued that identifying the objective truth of a response is important, but its also important to simply see whether the model is even attempting to directly answer the query posed, including all of the querys components. If the system concludes that the answer is non-responsive, it can be ignored on that basis.Another genAI expert is Rex Booth, CISO for identity vendor Sailpoint. Booth said that simply forcing LLMs to explain more about their own limitations would be a major help in making outputs more reliable.For example, many if not most hallucinations happen when the model cant find an answer in its massive database. If the system were set up to simply say, I dont know, or even the more face-saving, The data I was trained on doesnt cover that, confidence in outputs would likely rise.Booth focused on how current data is. If a question asks about something that happened in April 2025 and the model knows its training data was last updated in December 2024 it should simply say that rather than making something up. It wont even flag that its data is so limited, he said.He also said that the concept of agents checking agents can work well provided each agent is assigned a discrete task.But IT decision-makers should never assume those tasks and that separation will be respected. You cant rely on the effective establishment of rules, Booth said. Whether human or AI agents, everything steps outside the rules. You have to be able to detect that once it happens.Another popular concept for making genAI more reliable is to force senior management and especially the board of directors to agree on a risk tolerance level, put it in writing and publish it. This would ideally push senior managers and execs to ask the tough questions about what can go wrong with these tools and how much damage they could cause.Reece Hayden,principal analyst with ABI Research, is skeptical about how much senior management truly understands genAI risks.They see the benefits and they understand the 10% inaccuracy, but they see it as though they are human-like errors: small mistakes, recoverable mistakes, Hayden said. But when algorithms go off track, they can make errors light years more serious than humans.For example, humans often spot-check their work. But spot-checking genAI doesnt work, Hayden said. In no way does the accuracy of one answer indicate the accuracy of other answers.Its possible the reliability issues wont be fixed until enterprise environments adapt to become more technologically hospitable to genAI systems.The deeper problem lies in how most enterprises treat the model like a magic box, expecting it to behave perfectly in a messy, incomplete and outdated system, said Soumendra Mohanty, chief strategy officer at AI vendor Tredence. GenAI models hallucinate not just because theyre flawed, but because theyre being used in environments that were never built for machine decision-making. To move past this, CIOs need to stop managing the model and start managing the system around the model. This means rethinking how data flows, how AI is embedded in business processes, and how decisions are made, checked and improved.Mohanty offered an example: A contract summarizer should not just generate a summary, but it should validate which clauses to flag, highlight missing sections and pull definitions from approved sources. This is decision engineering defining the path, limits, and rules for AI output, not just the prompt.There is a psychological reason execs tend to resist facing this issue. Licensing genAI models is stunningly expensive. And after making a massive investment in the technology, theres natural resistance to pouring even more money into it to make outputs reliable.And yet, the whole genAI game has to be focused on delivering the goods. That means not only looking at what works, but dealing with what doesnt. Theres going to be a substantial cost to fixing things when these erroneus answers or flawed actions are discovered.Its galling, yes; it is also necessary. The same people who will be praised effusively about the benefits of genAI will be the ones blamed for errors that materialize later. Its your career choose wisely.
0 Comments 0 Shares 13 Views