When ChatGPTs Web Search Fails: Lessons for Data-Driven Decision Making
towardsai.net
LatestMachine LearningWhen ChatGPTs Web Search Fails: Lessons for Data-Driven Decision Making 0 like January 13, 2025Share this postLast Updated on January 14, 2025 by Editorial TeamAuthor(s): Ori Abramovsky Originally published on Towards AI. Photo by Omar RamadanChatGPT recently gained a game-changing capability: web search. This new feature allows the model to incorporate up-to-date data from the web into its responses, unlocking an incredible range of possibilities. Tasks like validating leads, conducting market analysis, or investigating recent events have become dramatically easier. Previously, a significant limitation of language models was their reliance on static, pre-trained knowledge bases. This not only made them ineffective for recent or niche topics but also made them prone to hallucination fabricating plausible-sounding but incorrect information when asked about events or details beyond their training cutoff. With web search, this risk is significantly reduced. ChatGPT can now focus on its strength analyzing vast amounts of information while outsourcing data gathering to the web.While this sounds like a perfect solution, this capability comes with its own set of challenges. The most significant risk arises when users place blind trust in the models outputs without verifying the underlying data. Given the immense promise of this feature, many may adopt it uncritically. To illustrate why caution is crucial, Ill share two real examples where ChatGPTs web search failed estimating event probabilities and analyzing stocks. These examples highlight common pitfalls and demonstrate how simple adjustments can mitigate these issues. But before diving into the examples, lets examine the inherent weaknesses of the search ecosystem as a whole.Why Do Search-Based LLM Responses Fail?Search-based queries with ChatGPT follow a systematic process: a user asks a question, and ChatGPT determines whether answering it requires a web search either by its own logic or because the user explicitly requests it. ChatGPT then generates a search query, sends it to Bing, retrieves the top results, processes them, and synthesizes a final answer based on these resources. While this workflow appears straightforward, several inherent weaknesses can lead to failure at different stages. Lets explore these critical pitfalls:Unoptimized User Input The quality of an LLMs response is directly tied to how well the user structures their question. Ambiguities or poorly worded prompts can significantly impact the output. Additionally, users often forget that LLMs process information differently than humans. For example, while humans might prefer a bullet-pointed list for clarity, LLMs can misinterpret this format, giving undue weight to the first items in the list. To address this, users should avoid long, list-heavy queries and instead break complex requests into smaller, more targeted questions.Suboptimal Search Query Generation Once ChatGPT determines that a web search is needed, it generates a search query based on its interpretation of the prompt. However, query optimization is a complex domain, and the LLMs automatically generated queries can often fall short. Users frequently discover better phrasing for their search queries through manual iteration and experimentation. To address this, a practical approach is to review the search results, refine the query offline, and then instruct ChatGPT to use the optimized version during subsequent searches.Limited Context from Search Results ChatGPT typically processes only the first few results from a search, assuming they are sorted by relevance. However, this can be problematic when the most useful source lies outside the top results. Humans often scan a broader range to identify the best sources, understanding that relevance rankings are not always perfect. To address this, users can manually examine search results and indicate which sources to prioritize for analysis.Restricted Search Engine Choice Currently, ChatGPT relies exclusively on Bing, which may not always provide the most relevant results for specific queries. Certain tasks might benefit from alternative engines for instance, Google Scholar for academic research. To address this, users could explicitly direct ChatGPT to prioritize specific sites or search engines relevant to their needs.Best Practices and a Need for CautionMany of these limitations could be mitigated by adopting a more iterative approach, involving either human review or LLM agents that can break the query into smaller, distinct steps. For example, users might first refine the search query, then determine which results to consider, and finally synthesize an answer. While this step-by-step method aligns with LLM best practices, it requires additional effort that most users are unlikely to invest. Consequently, relying on ChatGPTs search feature as is can lead to suboptimal or even erroneous results.This doesnt mean we need to micromanage every aspect of the search process, as that would defeat the purpose of using an LLM. Instead, users should be aware of these pitfalls and intervene when necessary whether by refining a query, highlighting relevant sources, or adjusting search engine preferences. To illustrate this point, lets explore two real examples of ChatGPTs (GPT-4o) search failures.Case Study 1: Estimating Movie Earnings The Case of Kraven the HunterKraven the Hunter estimations graph as visualized on Polymarket.comPolymarket, a leading platform for prediction markets, allows users to buy and sell shares tied to the likelihood of specific events. These events can range from predicting sports outcomes to estimating a movies box office performance. A recent example involved Kraven the Hunter, a movie where participants aimed to forecast its opening weekend domestic earnings.A common approach to generating such estimates is to conduct a web search and synthesize insights from various sources. With ChatGPTs web search capability, one might assume this process could be streamlined leveraging its ability to gather, analyze, and summarize data from multiple perspectives. Curious about its utility, I turned to ChatGPT for an estimate of Kraven the Hunters opening weekend performance.ChatGPT consulted around five online sources and provided an estimated range of $2025 million. Based on this prediction and the markets available ranges <$16M, $1619M, $1922M, $2225M, >$25M a logical conclusion would be to invest in the option predicting earnings wont be less than $16 million. This strategy appeared sound, even foolproof.ChatGPT estimations for the movie earningsHowever, a closer inspection of the prediction markets commentary revealed a crucial oversight. Variety, a reputable entertainment source, had projected a significantly lower range of $1315 million contradicting the optimistic consensus that ChatGPT had synthesized. Interestingly, while the Variety estimate appeared in Bings search results, ChatGPT did not incorporate it into its final analysis, likely because it was ranked lower on the results list. Bing itself prioritized more optimistic sources in its summary, which influenced ChatGPTs conclusions.Bings first week estimation, not taking into account the Varietys estimationsWhen I directly asked ChatGPT why it had disregarded Varietys estimate, it revisited the web and acknowledged the conflicting projection. While it recognized Variety as a credible source, it still chose to prioritize the higher estimates, deeming both viewpoints valid. Ultimately, the movie underperformed, grossing approximately $13 million aligning with the lower end of Varietys prediction.ChatGPT taking into account Varietys estimations only when explicitly asked toMovie Earnings Estimations Key TakeawaysThis case highlights the importance of critically assessing ChatGPTs information-gathering process. While ChatGPT excels at analyzing sources and synthesizing conclusions, its methodology for selecting which sources to consider can be suboptimal. A simple intervention explicitly instructing ChatGPT to factor in Varietys perspective could have mitigated the issue.The challenge lies in knowing which sources are essential. Each domain has its own authoritative voices, and relying on users to manually highlight these sources is neither scalable nor practical. A middle-ground approach would involve briefly reviewing ChatGPTs search results to ensure no critical source has been overlooked. While this solution is far from automated, it is feasible for occasional queries in daily tasks.Future iterations of LLMs could autonomously verify key sources or reconcile conflicting data more effectively. Until then, users must remain vigilant, ensuring that ChatGPTs outputs are supplemented with thoughtful oversight and double-checking.Case Study 2: Stock Recommendations A Cautionary TaleOne of the most appealing uses of ChatGPTs search capabilities is for stock recommendations. Imagine wanting to invest your funds and needing to decide which stocks to prioritize. Traditionally, this would involve conducting market research, reviewing investment blogs, and analyzing market trend projections. ChatGPTs ability to search the web seems like a perfect solution for streamlining this process delegating the research, analysis, and final recommendations to the LLM.To simplify the case, I posed a straightforward task to ChatGPT: given a predefined list of stocks (generated by an external oracle), recommend which ones were worth investing in. On the surface, this query seemed simple enough: Please recommend X stocks from the provided list What could possibly go wrong?Interestingly, the results varied significantly based on how the input was formatted. When I provided the stocks as a numbered list or as a line-separated list, ChatGPT disproportionately recommended stocks from the beginning of the list. The only way to make it consider stocks further down the list was to explicitly ask it to do so. However, when I presented the stocks as a single line separated by commas, ChatGPT included stocks from different positions on the list but these recommendations turned out to be flawed.ChatGPT search on list input (left, only considers the first stock) VS 1-line input (right, searching a bunch of stocks)Upon closer examination, some of the stocks it recommended in the single line format were based on hallucinations. For instance, ChatGPT advised investing in Stock X while citing articles and data that were entirely about Stock Y.ChatGPT rely on a irrelevant source to recommend a stock to invest atStock Recommendations Key TakeawaysThis behavior highlights several critical issues:Input Formatting Bias Lists, whether numbered or line-separated, are often unordered, yet investigating the actual search query revealed that ChatGPTs gave undue weight to the initial stocks on the list, only these stocks were thoroughly considered. Humans often use lists for readability, but LLMs interpret them differently. In general, LLMs perform better when inputs are stripped of unnecessary formatting, such as lists or complex layouts, and focus on one item at a time.Misleading References The references provided by ChatGPT often appear credible, leading users to trust its recommendations. However, further investigation revealed irrelevant or tangential sources. Even if users double-verify, they would need to cross-check each source against the conclusions drawn.Oversimplified Search Methodology Unlike humans, who approach research iteratively breaking down a query into smaller, targeted searches ChatGPT typically attempts to answer a complex question with a single search iteration. This approach often falls short when handling nuanced or multi-faceted research tasks, leading to incomplete or incorrect conclusions.To avoid these pitfalls, users should minimize unnecessary formatting in inputs, verify that the search query aligns with their intended question, and cross-check that the sources match the LLMs conclusions, such as ensuring that referenced article titles align with the claims drawn from them.Without such precautions, relying on ChatGPT for stock recommendations is no better than a lottery. This example underscores the need for critical oversight when using LLMs for decision-making in high-stakes scenarios like investments.ConclusionChatGPT and its LLM counterparts represent one of the most groundbreaking innovations of our time, with immense potential to transform how we approach information and problem-solving. However, their effectiveness is only as strong as the data sources and instructions we provide. Poorly optimized inputs can directly impact the quality of results, and even with ChatGPTs powerful capabilities, its web search process is constrained by inherent limitations such as evaluating only a limited number of search results and relying on a single search iteration.While LLMs give the impression of seamlessly handling complex tasks and vast amounts of input, their processes involve certain bottlenecks that can influence accuracy. In the future, advancements like multi-function querying and stepwise task decomposition may allow LLMs to better understand user intent, plan their approach, and break tasks into manageable steps for improved results.For now, it remains essential to verify the answers we receive and avoid taking them at face value. By understanding the LLMs potential weak points, we can help it navigate these challenges whether by refining our inputs, clarifying tasks, or critically evaluating the generated outputs.To fully harness the incredible capabilities of LLMs without falling prey to their limitations, we must strike a balance between trust and caution. Ignorance is no longer bliss; its a liability. Being informed and proactive is the key to making the most of this remarkable technology.Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AITowards AI - Medium Share this post
0 Commentaires ·0 Parts ·51 Vue