The startup trying to turn the web into a database

@MITT shared a link

2024-12-04 01:01:32 ·

www.technologyreview.com

A startup called Exa is pitching a new spin on generative search. It uses the tech behind large language models to return lists of results that it claims are more on point than those from its rivals, including Google and OpenAI. The aim is to turn the internets chaotic tangle of web pages into a kind of directory, with results that are specific and precise.Exa already provides its search engine as a back-end service to companies that want to build their own applications on top of it. Today it is launching the first consumer version of that search engine, called Websets.The web is a collection of data, but its a mess, says Exa cofounder and CEO Will Bryk. Theres a Joe Rogan video over here, an Atlantic article over there. Theres no organization. But the dream is for the web to feel like a database.Websets is aimed at power users who need to look for things that other search engines arent great at finding, such as types of people or companies. Ask it for startups making futuristic hardware and you get a list of specific companies hundreds long rather than hit-or-miss links to web pages that mention those terms. Google cant do that, says Bryk: Theres a lot of valuable use cases for investors or recruiters or really anyone who wants any sort of data set from the web.Things have moved fast since MIT Technology Review broke the news in 2021 that Google researchers were exploring the use of large language models in a new kind of search engine. The idea soon attracted fierce critics. But tech companies took little notice. Three years on, giants like Google and Microsoft jostle with a raft of buzzy newcomers like Perplexity and OpenAI, which launched ChatGPT Search in October, for a piece of this hot new trend.Exa isnt (yet) trying to out-do any of those companies. Instead, its proposing something new. Most other search firms wrap large language models around existing search engines, using the models to analyze a users query and then summarize the results. But the search engines themselves havent changed much. Perplexity still directs its queries to Google Search or Bing, for example. Think of todays AI search engines as a sandwich with fresh bread but stale filling.More than keywordsExa provides users with familiar lists of links but uses the tech behind large language models to reinvent how search itself is done. Heres the basic idea: Google works by crawling the web and building a vast index of keywords that then get matched to users queries. Exa crawls the web and encodes the contents of web pages into a format known as embeddings, which can be processed by large language models.Embeddings turn words into numbers in such a way that words with similar meanings become numbers with similar values. In effect, this lets Exa capture the meaning of text on web pages, not just the keywords.A screenshot of Websets showing results for the search: companies; startups; US-based; healthcare focus; technical co-founderLarge language models use embeddings to predict the next words in a sentence. Exas search engine predicts the next link. Type startups making futuristic hardware and the model will come up with (real) links that might follow that phrase.Exas approach comes at cost, however. Encoding pages rather than indexing keywords is slow and expensive. Exa has encoded some billion web pages, says Bryk. Thats tiny next to Google, which has indexed around a trillion. But Bryk doesnt see this as a problem: You dont have to embed the whole web to be useful, he says. (Fun fact: exa means a 1 followed by 18 0s and googol means a 1 followed by 100 0s.)Websets is very slow at returning results. A search can sometimes take several minutes. But Bryk claims its worth it. A lot of our customers started to ask for, like, thousands of results, or tens of thousands, he says. And they were okay with going to get a cup of coffee and coming back to a huge list.I find Exa most useful when I dont know exactly what Im looking for, says Andrew Gao, a computer science student at Stanford Univesrsity who has used the search engine. For instance, the query an interesting blog post on LLMs in finance works better on Exa than Perplexity. But theyre good at different things, he says: I use both for different purposes.I think embeddings are a great way to represent entities like real-world people, places, and things, says Mike Tung, CEO of Diffbot, a company using knowledge graphs to build yet another kind of search engine. But he notes that you lose a lot of information if you try to embed whole sentences or pages of text: Representing War and Peace as a single embedding would lose nearly all of the specific events that happened in that story, leaving just a general sense of its genre and period.Bryk acknowledges that Exa is a work in progress. He points to other limitations, too. Exa is not as good as rival search engines if you just want to look up a single piece of information, such as the name of Taylor Swifts boyfriend or who Will Bryk is: Itll give a lot of Polish-sounding people, because my last name is Polish and embeddings are bad at matching exact keywords, he says.For now Exa gets around this by throwing keywords back into the mix when theyre needed. But Bryk is bullish: Were covering up the gaps in the embedding method until the embedding method gets so good that we dont need to cover up the gaps.

0 Comments ·0 Shares ·101 Views

Upgrade to Pro