CIOs grapple with subpar global genAI models
www.computerworld.com
With the number of generative AI trials soaring in the enterprise, it is typical for the CIO to purchase numerous large language models from various model makers, tweaked for different geographies and languages.But CIOs are discovering that non-English models are faring far more poorly than English ones, even when purchased from the same vendor.There is nothing nefarious about that fact. It is simply because there is a lot less data available to train non-English models.It is almost guaranteed that all LLM implementations in languages other than English will perform with less accuracy and less relevance than implementations in English because of the vast disparity in training sample size, said Akhil Seth, head of AI business development at consultant firm UST.Less data delivers less comprehensiveness, less accuracy, and much more frequent hallucinations. (Hallucinations typically happen when the model has no information to answer the query, so it makes something up. Proud algorithms these LLMs can be.)Nefarious or not, IT leaders at global companies need to deal with this situation or suffer subpar results for customers and employees who speak languages other than English.The major model makers OpenAI, Microsoft, Amazon/AWS, IBM, Google, Anthropic, and Perplexity, among others do not typically divulge the volume of data each model is trained on, and certainly not the quality or nature of that data.Enterprises usually deal with this lack of transparency about training data via extensive testing, but that testing is often focused on the English language model, not those in other languages.There are concerns that this [imbalance of training data] would put applications leveraging non-English languages at an informational and computational disadvantage, said Flavio Villanustre, global chief information security officer of LexisNexis Risk Solutions.The volume, richness, and variability in the underlying training data is key to obtaining high-quality runtime performance of the model. Inquiries in languages that are underrepresented in the training data are likely to yield poor performance, he said.The size difference can be extremeHow much smaller are the datasets used in non-English models? That varies widely depending on the language. Its not so much a matter of the number of people who speak that language as it is the volume of data in that language available for training.Vasi Philomin, the VP and general manager for generative AI at Amazon Web Services (AWS), one of the leading AI as a Service vendors, estimated that the training datasets for non-English models are roughly 10 to 100 times smaller than their English counterparts.Although there is no precise way to predetermine how much data is available for training in a given language, Hans Florian, a distinguished research scientist for multilingual natural language processing at IBM, has a trick. You can look at the number of Wikipedia pages in that language. That correlates quite well with the amount of data available in that language, he said.Training data availability also varies by industry, topic, and use case.If you want your language model to be multilingual, the best thing you can do is have parallel data in the languages you want to support, said Mary Osborne, the senior product manager of AI and natural language processing at SAS. Thats an easy proposition in places like Quebec, for example, where all their government data is created in both English and French. If you wanted to have an LLM that did a great job of answering questions about the Canadian government in both English and French, youd have a good supply of data to pull that off, Osbourne said.But if you wanted to add an obscure indigenous language like Cree or Micmac, those languages would be vastly underrepresented in the sample. They would yield poor results compared to English and French, because the model wouldnt have seen enough data in those indigenous languages to do well, she said.Although dataset size is extremely important in a genAI model, data quality is also critical. Even though there are no objective benchmarks for assessing data quality, experts in various topics have a rough sense of what good and bad content looks like. In healthcare, for example, it might be the difference between using the New England Journal of Medicine or Lancet versus scraping the personal website of a chiropractor in Milwaukee.Like dataset size, data quality often varies by geography, according to Jrgen Bross, senior research scientist and manager in multilingual at IBM.In Japan, for example, IBM needed to apply its own quality filtering, partly because so many quality web sites in Japan are behind strict paywalls. That meant that, on average, the available Japanese data was of lower quality. Fewer newspapers and more product pages, Bross said.Quick fixes bring limited successUSTs Seth said the dataset challenges with non-English genAI models are not going to be easy to overcome. Some of the more obvious mechanisms to address the smaller training datasets for non-English models including automated translation and more aggressive fine-tuning come with their own negatives.Putting a [software] translator somewhere in the inference pipeline is an obvious quick fix, but it will no doubt introduce idiomatic inconsistencies in the generated output and potentially even in the interpretation of the input. Even multilingual models suffer from this, Seth said.Another popular countermeasure for non-English genAI models is using synthetic data to supplement the actual data. Synthetic data is typically generated by machine learning, which extrapolates patterns from real data to create likely data. The problem is that if the original data has even a hint of bias which is common synthetic data is likely to perpetuate and magnify that bias. Forgive the clich, but its the genAI version of three steps forward, two steps back.Indeed, LexisNexis Villanustre worries that this problem could get worse, hurting the accuracy and credibility of genAI-produced global analysis.There is an increasing portion of unstructured content on the internet that is currently created by generative AI models. If not careful, future models could be increasingly trained on output from other models, potentially amplifying biases and inaccuracies, Villanustre said.Practical (and sometimes expensive) approachesSo how can tech leaders better address the problem?It starts during the procurement process. Although IT operations folks typically ask excellent questions about LLMs before they purchase, they tend to be overwhelmingly focused on the English version. It doesnt occur to them that the quality delivered in the non-English models may be dramatically lower.Jason Andersen, a VP and principal analyst with Moor Insights & Strategy, said CIOs need to do everything they can to get model makers to share more information about training data for every model being purchased or licensed. There has to be much more transparency of data provenance, he said.Alternatively, CIOs can consider sourcing their non-English models from regional/local genAI firms that are native to that language. Although that approach might solve the problem for many geographies, it is going to meet strong resistance from many enterprise CIOs, said Rowan Curren, a senior analyst for genAI strategies at Forrester.Most enterprises are far more interested in sourcing their foundation models from their trusted providers, which are generally the major hyperscalers, Curren said. Enterprises really want to acquire those [model training] capabilities via their deployments on AWS, Google, or Microsoft. That gives [CIOs] a higher comfort level. They are hesitant to work with a startup.AWSs Philomin said his team is trying to split the difference for IT customers by using a genAI marketplace approach, borrowing the technique from the AWS Marketplace which in turn had borrowed the concept from its Amazon parent company. Amazons retail approach allows users to purchase from small merchants through Amazon, with Amazon taking a cut.Amazons genAI marketplace called Bedrock does something similar, providing access to a large number of genAI model makers globally. Although it certainly doesnt mitigate all of the downsides of using a little-known provider in various geographies, Philomin argues that it addresses some of them.We are removing some of the risks, [such as] the resilience of the service and the support, Philomin said. But he also stressed that those smaller players are the seller of record, not AWS. That caveat raises the question of how much help the AWS reseller role will be if something later blows up.Another approach to address the training data disparity? Bypass the non-English models (for now) by employing bilingual humans who can comfortably interact with the English model.As a German native who works primarily in English, Ive found that while LLMs are competent in German, they dont quite reach native-level proficiency, said Vincent Schmalbach, an independent AI engineer in Munich.For critical German-language content, Ive developed a practical workflow. I interact with the LLM in English to get the highest quality output, then translate the final result to German. This approach consistently produces better results than working directly in German.The tactic that most genAI specialists agree on is that CIOs need to budget more money to test and fine-tune every non-English model they want to use. That money also needs to cover the additional processing and verification needed for non-English models.That said, fine-tuning can only help so much. The training data is the heart of the genAI brain. If that is inadequate, more fine-tuning can be akin to trying to save a salad with rotting spinach by pouring on more salad dressing.And allocating additional budget to fine-tuning models can be difficult because the number of variables such as the specific languages, topics, and industry in question is too numerous to offer any realistic guidance. But IBMs Florian does offer a tiny bit of optimism: You dont need a permanent budget increase. Its just a one-time budget increase, a one-time expense that you take.In other words, once the non-English model is fully integrated and supplemented, little to no funding is needed beyond whatever the English model needs.Looking aheadTheres reason to hope that the disparity in the quality of output from models in various languages may be lessened or even negated in the coming years. Thats because a model based on a smaller dataset may not suffer from lower accuracy if the underlying data is of a higher quality.One factor now coming into play lies in the difference between public and private data. An executive at one of the largest model makers who asked to not be identified by name or employer said the major LLM makers have pretty much captured as much of the data on the public internet as they can. They are continuing to harvest new data from the internet every day, of course, but those firms are shifting much of their data-gathering efforts to private sources such as corporations and universities.We have found a lot of super high-quality data, but we cannot get access to it because its not on the internet. We need to get agreements with the owners of this data to get access, he said.Tapping into private sources of information including those in various countries around the world will potentially improve the data quality for some topics and industries, and at the same time increase the amount of good training data available for non-English models. As the total universe of training data expands, the imbalance in the amount of training data across languages may matter less and less. However, this shift is also likely to raise prices as the model makers cut deals with third parties to license their private information.Another factor that could minimize the dataset size problem in the next few years is an anticipated increase in unstructured data. Indeed, highly unstructured data such as that collected by video drones watching businesses and their customers could potentially sidestep language issues entirely, as the video analysis could be captured directly and saved in many different languages.Until the volume of high-quality data for non-English languages gets much stronger something that might slowly happen with more unstructured, private, and language-agnostic data in the next few years CIOs need to demand better answers from model vendors on the training data for all non-English models.Lets say a global CIO is buying 118 models from an LLM vendor, in a wide range of languages. The CIO pays maybe $2 billion for the package. The vendor doesnt tell the CIO how little training was done on all of those non-English models, and certainly not where that training data came from.If the vendors were fully transparent on both of those points, CIOs would push back on pricing for everything other than the English model.In response, the model makers would likely not charge CIOs less for the non-English models but instead ramp up their efforts to find more training data to improve the accuracy of those models.Given the massive amount of money enterprises are spending on genAI, the carrot is obvious. The stick? Maybe CIOs need to get out of their comfort zone and start buying their non-English models from regional vendors in every language they need.If that starts to happen on a large scale, the major model makers may suddenly see the value of data-training transparency.
0 Comments
·0 Shares
·44 Views