Will the non-English genAI problem lead to data transparency and lower costs?
www.computerworld.com
Its become increasing clear thatquality plunges when moving from English to non-English-based large language models(LLMs). Theyre less accurate and theres a serious lack of transparency around data training, both in terms of data volume and data quality.The latter has long been aproblem for generative AI (genAI) tools and platforms.But enterprises arent paying less for less-productive models, even though the value they offer is diminished. So,why arent CIOs getting a price break for non-English models? Because without any data transparency, they rarely know theyre paying more for less.There are a variety of reasons why model makers dont disclose their data training particulars. (Lets not even get into the issue of whether they have legal rights to do whatever training they did though its tempting to do so, if only to explore the hypocrisy ofOpenAI complainingaboutDeepSeeknot getting permission before training on much of its data.)Speaking of DeepSeek, dont read too much into the lower cost of its underlying models. Yes, its builders cleverly leveraged open source to find efficiencies and lower pricing, but theres been little disclosure of how much the Chinese government helped with DeepSeeks funding, either directly or indirectly.That said, ifDeepSeek is the cudgel that puts downward pressure on genAI pricing, Im all for it and IT execs should be, too. But until we see evidence of meaningful price cuts, they should use the lack of data transparency in non-English models to try and get model maker pricetags out of the stratospheric.The non-English issue isnt really about the language, per se. Its more about the training data that isavailablewithin that language.(By some estimates,the training datasets for non-English models could be just 1/10 or even 1/100 the size of their English counterparts.)Hans Florian, whose title is a distinguished research scientist for multilingual natural language processing at IBM, said he uses a trick to guesstimate how much data is available in various languages. You can look at the number of Wikipedia pages in that language. That correlates quite well with the amount of data available in that language, he said.To further complicate the issue, sometimes its not about the language or the available data in that language. It can logically enough be about data related to activities in theregionwhere a particularlanguage is dominant.If model makers start seeing meaningful pricing pushback from a lot of enterprises concerned about model quality, they have only a couple of options. They can selectively and secretly negotiate lower prices for non-English models for some of their customers or they can get serious about data transparency.Because LLM makers have invested billions of dollars in genAI, they arent going to like the idea of lower pricing. That leads to that second option: deliver full transparency to all customers about all models both in terms of quantity and quality and price their wares accordingly.Given that quality is almost impossible to represent numerically, that will mean disclosing all training data details so each customer can make their own determination of quality for the topics, verticals and geographies they care about.The pricing disparity between what a model can deliver and what an enterprise is forced to pay is at the heart of whyCIOs are still struggling to deliver genAI ROI.Obviously, lower pricing would be the best way to improve the ROI for genAI investments. But if thats not going to happen anytime soon, full data transparency is the next best thing.There is a catch: model makers almost certainly realize that full data-training transparency will likely force them to lower prices, since it would showcase how low quality their data is.Note: I say that their data is low-quality as if its a given; it is absolutely a given. If model makers believed they were using lots of high-quality data, far from resisting transparency, they would embrace it. It would be a selling point. It might even be useful for propping up prices. High quality usually sells itself.Their refusal to deliver any kind of data-training transparency tells you everything you need to know about their quality beliefs, and about the state of the market at the moment.
0 Σχόλια ·0 Μοιράστηκε ·27 Views