Hallucinations in Healthcare LLMs: Why They Happen and How to Prevent Them
Author(s): Marie
Originally published on Towards AI.
Hallucinations in Healthcare LLMs: Why They Happen and How to Prevent Them
Building Trustworthy Healthcare LLM Systems — Part 1
Image generated by the author using ChatGPT
TL;DR
LLM hallucinations: AI-generated outputs that sound convincing but contain factual errors or fabricated information — posing serious safety risks in healthcare settings.
Three main types of hallucinations: factual errors (recommending antibiotics for viral infections), fabrications (inventing non-existent studies or guidelines), misinterpretations (drawing incorrect conclusions from real data).
Root causes of hallucinations: probabilistic generation, training that rewards fluency over factual accuracy, lack of real-time verification, and stale or biased data.
Mitigation approaches: Retrieval-Augmented Generation (RAG), domain-specific fine-tuning, advanced prompting, guardrails.
This series: build a hallucination-resistant pipeline for infectious disease knowledge, starting with a PubMed Central corpus.
Hallucinations in medical LLMs aren’t just bugs — they’re safety risks.
This series walks through how to ground healthcare language models in real evidence, starting with infectious diseases.
Introduction
LLMs (large language models) are changing how we interact with medical knowledge — summarizing research, answering clinical questions, even offering second opinions.
But they still hallucinate — and in medicine that’s a safety risk, not a quirk.
In medical domains, trust is non-negotiable.
A hallucinated answer about infectious disease management (e.g., wrong antibiotic, incorrect diagnostic criteria) can directly impact patient safety, so grounding models in verifiable evidence is mandatory.
That’s why this blog series exists.
This four-part series will show you how to build a hallucination-resistant workflow, step-by-step:
Part 1 (this post): what hallucinations are, why they happen and how to build a domain-specific corpus using open access medical literature
Part 2: Turn that corpus into a RAG pipeline
Part 3: Add hallucination detection metrics
Part 4: Put it all together and build a transparent interface to show users the evidence behind the LLM’s responses
What Are Hallucinations in LLMs?
Hallucinations are model-generated outputs that sound correct and coherent, but are not factually correct.
They sound convincing but are often false, unverifiable or entirely made up.
Why They Matter in Healthcare
These errors can have serious implications, especially in clinical settings where they might lead to improper treatment recommendations.
The wrong recommendation in clinical settings could have life or death consequences, which is why it is critical to mitigate these hallucinations by building transparent, evidence-based systems.
Image generated by the author using ChatGPT
Main Types of Hallucinations
1.
Factual Errors
Factual errors happen when LLMs make incorrect claims about verifiable facts.
Using our infectious disease example, recommending antibiotics for influenza would be a type of factual error.
2.
Fabrications
Fabrications involve LLMs inventing non-existent entities or information.
In the context of healthcare, for example, these could be fictional research studies, medical guidelines that don’t exist or made-up technical concepts.
3.
Misinterpretations
Misinterpretation happens when LLMs take real information but misrepresents or mis-contextualizes it.
For example, a model might reference a study that exists, but draws the wrong conclusions
Why LLMs hallucinate
Large language models hallucinate because they
don’t truly understand facts like humans do
simply predict what words should come next based on patterns they’ve observed in their training data.
When these AI systems encounter unfamiliar topics or ambiguous questions, they don’t have the ability to say “I don’t know” and instead generate confident-sounding but potentially incorrect responses.
This tendency stems from several factors:
Their training prioritizes fluent, human-like text over factual caution
They lack real-time access to verified information sources
They have no inherent understanding of truth versus fiction.
Conflicting information in training data can push the model to average contradictory sources.
The problem is compounded by limitations in training data that may contain outdated, biased, or inaccurate information, as well as the fundamental auto-regressive nature of how these models generate text one piece at a time.
How Can We Address Hallucinations?
There are various methods to mitigate or detect hallucinations.
Mitigation Strategies
Fine-tuning with Domain-Specific Data: The main reason for hallucination lies in knowledge gaps in the model’s training data.
This approach helps by introducing domain specific knowledge and can be very powerful to create models that understand better the specialized medical terminology or various nuances in clinical text.
Retrieval-Augmented Generation (RAG): This method allows the integration of external knowledge sources by retrieving relevant information before generating the answer.
It helps by grounding the model outputs in verified external sources instead of relying only on the model’s training data.
This is the method we will be focusing on in this series
Other noteworthy strategies: advanced prompting methods like Chain-of-Thoughts or Few-Shot Learning can help mitigate hallucinations by guiding the model’s answer in the right direction.
Rules-based guardrails that screen outputs before they reach users add another safety layer.
Hallucination Detection
Source-attribution scoring: This method compares the LLM answer to the retrieved documents to detect how much of the answer is grounded in the source.
Beyond identifying hallucinations, it also allows to highlight the source behind the LLM answer, which helps building trust and transparency.
Semantic Entropy Measurement: This method measures uncertainty about the meaning of generated responses and has been developed specifically to address the risk of hallucinations in critical areas involving patient safety for example
Consistency-Based Methods: This method involves a self-consistency check, where hallucinations can be detected by prompting the model multiple times with the same query and comparing the outputs for consistency.
Some interesting open-access publications to go a bit further:
If you’re interesting in reading recent research on this topic, here’s a few research papers worth reading:
Code Walkthrough: Downloading Medical Research from PubMed Central
To reduce hallucinations in healthcare LLMs, grounding them in reliable medical literature is critical.
Let’s start by building a corpus from one of the best sources available: PubMed Central (PMC).
This script helps you automate the retrieval of open-access medical papers, making it easy to bootstrap a dataset tailored to your task (e.g., infectious diseases).
Here’s how it works:
1.
Setup and Environment
import requestsimport xml.etree.ElementTree as ETimport jsonimport os, re, timefrom dotenv import load_dotenvload_dotenv()api_key = os.getenv("NCBI_API_KEY")email = os.getenv("EMAIL")base_url = "
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"" style="color:
#0066cc;">
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
You’ll need to set your NCBI API key and email in a .env file.
You can still call the NCBI API without an API key, but this unlocks higher rate limits and it is free
2.
Search PMC
Because we are interested in full texts to build our knowledge base, we should only download articles that are open access.
To do so, we need to fetch the articles from PMC:
# 1.
Search PMCsearch_url = f"{base_url}esearch.fcgi"search_params = { "db": "pmc", "term": query, "retmax": max_results, "retmode": "json", "api_key": api_key, "email": email}print("Searching PMC...")search_resp = requests.get(search_url, params=search_params)search_resp.raise_for_status()ids = search_resp.json()["esearchresult"]["idlist"]
This code queries PMC with your search terms (for example “infectious diseases”) and returns a list of document identifiers (PMCIDs).
3.
Fetch and Parse Articles
Now we can fetch the full texts using the PMCIDs:
# 2.
Batch fetchfetch_url = f"{base_url}efetch.fcgi"for i in range(0, len(ids), batch_size): = ids[i : i + batch_size] batch_ids fetch_params = { "db": "pmc", "id": ",".join(batch_ids), "retmode": "xml", "api_key": api_key, "email": email, } time.sleep(delay) r = requests.get(fetch_url, params=fetch_params) r.raise_for_status()
Our response is an XML object, so the final step is to parse it and create a dictionary with the relevant sections: pmcid, title, abstract, full_text, publication_date, authors:
root = ET.fromstring(r.content)for idx, article in enumerate(root.findall(".//article")): # Extract article details article_data = { "pmcid": f"PMC{batch_ids[idx]}", "title": "", "abstract": "", "full_text": "", "publication_date": "", "authors": [], } # Extract title title_elem = article.find(".//article-title") if title_elem is not None: article_data["title"] = "".join(title_elem.itertext()).strip() # Extract abstract abstract_parts = article.findall(".//abstract//p") if abstract_parts: article_data["abstract"] = " ".join( "".join(p.itertext()).strip() for p in abstract_parts ) # Extract publication date pub_date = article.find(".//pub-date") if pub_date is not None: year = pub_date.find("year") month = pub_date.find("month") day = pub_date.find("day") date_parts = [] if year is not None: date_parts.append(year.text) if month is not None: date_parts.append(month.text) if day is not None: date_parts.append(day.text) article_data["publication_date"] = "-".join(date_parts) # Extract authors author_elems = article.findall(".//contrib[@contrib-type='author']") for author_elem in author_elems: surname = author_elem.find(".//surname") given_names = author_elem.find(".//given-names") author = {} if surname is not None: author["surname"] = surname.text if given_names is not None: author["given_names"] = given_names.text if author: article_data["authors"].append(author) # Extract full text (combining all paragraphs) body = article.find(".//body") if body is not None: paragraphs = body.findall(".//p") article_data["full_text"] = " ".join( "".join(p.itertext()).strip() for p in paragraphs )
The data can then be saved into a jsonl that will be used in our next step — building our RAG system.
Let’s be mindful of licensing restrictions: While open access literature allows anyone to access and read the content, it doesn’t mean the authors agreed to redistribution of their work.
While this blog post and its content are intended for personal and educational use, if you decide to use this function to build a dataset that will be redistributed or commercialized, it is important to comply with the article’s license agreement.
To do so, let’s define a function that will help us pull the license data from the downloaded article:
def detect_cc_license(lic_elem): """ Inspect <license> … </license> for Creative Commons URLs or keywords and return a normalised string such as 'cc-by', 'cc-by-nc', 'cc0', or 'other'.
""" if lic_elem is None: return "other" # 1) gather candidate strings: any ext-link href + full text candidates: list[str] = [] for link in lic_elem.findall(".//ext-link[@ext-link-type='uri']"): href = link.get("{
http://www.w3.org/1999/xlink}href")" style="color:
#0066cc;">
http://www.w3.org/1999/xlink}href") or link.get("href") if href: candidates.append(href.lower()) candidates.append("".join(lic_elem.itertext()).lower()) # 2) search for CC patterns for text in candidates: if "creativecommons.org" not in text and "publicdomain" not in text: continue # order matters (most restrictive first) if re.search(r"by[-_]nc[-_]nd", text): return "cc-by-nc-nd" if re.search(r"by[-_]nc[-_]sa", text): return "cc-by-nc-sa" if re.search(r"by[-_]nc", text): return "cc-by-nc" if re.search(r"by[-_]sa", text): return "cc-by-sa" if "/by/" in text: return "cc-by" if "publicdomain/zero" in text or "cc0" in text or "public domain" in text: return "cc0" return "other"
Here’s a short breakdown of what the licenses mean:
Image uploaded by author
Here’s the full function for PubMed download:
def download_pmc_articles(query, ): max_results = 100, batch_size = 20, delay = 0.2, allowed_licenses = {"cc-by", "cc-by-sa", "cc0"}, out_file = "pmc_articles.jsonl" base_url = "
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"" style="color:
#0066cc;">
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/" # 1.
Search PMC search_url = f"{base_url}esearch.fcgi" search_params = { "db": "pmc", "term": query, "retmax": max_results, "retmode": "json", "api_key": api_key, "email": email } print("Searching PMC...") search_resp = requests.get(search_url, params=search_params) search_resp.raise_for_status() ids = search_resp.json()["esearchresult"]["idlist"] # 2.
Batch fetch fetch_url = f"{base_url}efetch.fcgi" skipped, saved = 0, 0 with open(out_file, "w") as f: for i in range(0, len(ids), batch_size): batch_ids = ids[i:i+batch_size] fetch_params = { "db": "pmc", "id": ",".join(batch_ids), "retmode": "xml", "api_key": api_key, "email": email } time.sleep(delay) r = requests.get(fetch_url, params=fetch_params) r.raise_for_status() root = ET.fromstring(r.content) for idx, article in enumerate(root.findall(".//article")): # Check license license = detect_cc_license(article.find(".//license")) if license not in allowed_licenses: skipped += 1 continue # skip disallowed license # Extract article details article_data = { "pmcid": f"PMC{batch_ids[idx]}", "title": "", "abstract": "", "full_text": "", "publication_date": "", "authors": [] } # Extract title title_elem = article.find(".//article-title") if title_elem is not None: article_data["title"] = "".join(title_elem.itertext()).strip() # Extract abstract abstract_parts = article.findall(".//abstract//p") if abstract_parts: article_data["abstract"] = " ".join("".join(p.itertext()).strip() for p in abstract_parts) # Extract publication date pub_date = article.find(".//pub-date") if pub_date is not None: year = pub_date.find("year") month = pub_date.find("month") day = pub_date.find("day") date_parts = [] if year is not None: date_parts.append(year.text) if month is not None: date_parts.append(month.text) if day is not None: date_parts.append(day.text) article_data["publication_date"] = "-".join(date_parts) # Extract authors author_elems = article.findall(".//contrib[@contrib-type='author']") for author_elem in author_elems: surname = author_elem.find(".//surname") given_names = author_elem.find(".//given-names") author = {} if surname is not None: author["surname"] = surname.text if given_names is not None: author["given_names"] = given_names.text if author: article_data["authors"].append(author) # Extract full text (combining all paragraphs) body = article.find(".//body") if body is not None: paragraphs = body.findall(".//p") article_data["full_text"] = " ".join("".join(p.itertext()).strip() for p in paragraphs) f.write(json.dumps(article_data) + "\n") saved += 1 print(f"Saved batch {i//batch_size + 1}") print(f"Downloaded {saved} articles to {out_file}, {skipped} articles removed by license filter")
Now you can call your function with your query to create your corpus.
For example:
# Install packages if neededpip install python-dotenv requestsquery = 'bacterial pneumonia treatment'max_results = 500batch_size = 50download_pmc_articles(query, max_results, batch_size)
And that’s it! Now all your articles are saved in a jsonl file and ready to be processed for RAG
What’s Next: Preparing the Data for RAG
In Part 2, we’ll take the domain-specific corpus you just built and use it to power a Retrieval-Augmented Generation (RAG) system — grounding your LLM in real evidence to reduce hallucinations and improve trust.
Join thousands of data leaders on the AI newsletter.
Join over 80,000 subscribers and keep up to date with the latest developments in AI.
From research to projects and ideas.
If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Source:
https://towardsai.net/p/artificial-intelligence/hallucinations-in-healthcare-llms-why-they-happen-and-how-to-prevent-them" style="color:
#0066cc;">
https://towardsai.net/p/artificial-intelligence/hallucinations-in-healthcare-llms-why-they-happen-and-how-to-prevent-them
#hallucinations #healthcare #llms #why #they #happen #and #how #prevent #them
Hallucinations in Healthcare LLMs: Why They Happen and How to Prevent Them
Author(s): Marie
Originally published on Towards AI.
Hallucinations in Healthcare LLMs: Why They Happen and How to Prevent Them
Building Trustworthy Healthcare LLM Systems — Part 1
Image generated by the author using ChatGPT
TL;DR
LLM hallucinations: AI-generated outputs that sound convincing but contain factual errors or fabricated information — posing serious safety risks in healthcare settings.
Three main types of hallucinations: factual errors (recommending antibiotics for viral infections), fabrications (inventing non-existent studies or guidelines), misinterpretations (drawing incorrect conclusions from real data).
Root causes of hallucinations: probabilistic generation, training that rewards fluency over factual accuracy, lack of real-time verification, and stale or biased data.
Mitigation approaches: Retrieval-Augmented Generation (RAG), domain-specific fine-tuning, advanced prompting, guardrails.
This series: build a hallucination-resistant pipeline for infectious disease knowledge, starting with a PubMed Central corpus.
Hallucinations in medical LLMs aren’t just bugs — they’re safety risks.
This series walks through how to ground healthcare language models in real evidence, starting with infectious diseases.
Introduction
LLMs (large language models) are changing how we interact with medical knowledge — summarizing research, answering clinical questions, even offering second opinions.
But they still hallucinate — and in medicine that’s a safety risk, not a quirk.
In medical domains, trust is non-negotiable.
A hallucinated answer about infectious disease management (e.g., wrong antibiotic, incorrect diagnostic criteria) can directly impact patient safety, so grounding models in verifiable evidence is mandatory.
That’s why this blog series exists.
This four-part series will show you how to build a hallucination-resistant workflow, step-by-step:
Part 1 (this post): what hallucinations are, why they happen and how to build a domain-specific corpus using open access medical literature
Part 2: Turn that corpus into a RAG pipeline
Part 3: Add hallucination detection metrics
Part 4: Put it all together and build a transparent interface to show users the evidence behind the LLM’s responses
What Are Hallucinations in LLMs?
Hallucinations are model-generated outputs that sound correct and coherent, but are not factually correct.
They sound convincing but are often false, unverifiable or entirely made up.
Why They Matter in Healthcare
These errors can have serious implications, especially in clinical settings where they might lead to improper treatment recommendations.
The wrong recommendation in clinical settings could have life or death consequences, which is why it is critical to mitigate these hallucinations by building transparent, evidence-based systems.
Image generated by the author using ChatGPT
Main Types of Hallucinations
1.
Factual Errors
Factual errors happen when LLMs make incorrect claims about verifiable facts.
Using our infectious disease example, recommending antibiotics for influenza would be a type of factual error.
2.
Fabrications
Fabrications involve LLMs inventing non-existent entities or information.
In the context of healthcare, for example, these could be fictional research studies, medical guidelines that don’t exist or made-up technical concepts.
3.
Misinterpretations
Misinterpretation happens when LLMs take real information but misrepresents or mis-contextualizes it.
For example, a model might reference a study that exists, but draws the wrong conclusions
Why LLMs hallucinate
Large language models hallucinate because they
don’t truly understand facts like humans do
simply predict what words should come next based on patterns they’ve observed in their training data.
When these AI systems encounter unfamiliar topics or ambiguous questions, they don’t have the ability to say “I don’t know” and instead generate confident-sounding but potentially incorrect responses.
This tendency stems from several factors:
Their training prioritizes fluent, human-like text over factual caution
They lack real-time access to verified information sources
They have no inherent understanding of truth versus fiction.
Conflicting information in training data can push the model to average contradictory sources.
The problem is compounded by limitations in training data that may contain outdated, biased, or inaccurate information, as well as the fundamental auto-regressive nature of how these models generate text one piece at a time.
How Can We Address Hallucinations?
There are various methods to mitigate or detect hallucinations.
Mitigation Strategies
Fine-tuning with Domain-Specific Data: The main reason for hallucination lies in knowledge gaps in the model’s training data.
This approach helps by introducing domain specific knowledge and can be very powerful to create models that understand better the specialized medical terminology or various nuances in clinical text.
Retrieval-Augmented Generation (RAG): This method allows the integration of external knowledge sources by retrieving relevant information before generating the answer.
It helps by grounding the model outputs in verified external sources instead of relying only on the model’s training data.
This is the method we will be focusing on in this series
Other noteworthy strategies: advanced prompting methods like Chain-of-Thoughts or Few-Shot Learning can help mitigate hallucinations by guiding the model’s answer in the right direction.
Rules-based guardrails that screen outputs before they reach users add another safety layer.
Hallucination Detection
Source-attribution scoring: This method compares the LLM answer to the retrieved documents to detect how much of the answer is grounded in the source.
Beyond identifying hallucinations, it also allows to highlight the source behind the LLM answer, which helps building trust and transparency.
Semantic Entropy Measurement: This method measures uncertainty about the meaning of generated responses and has been developed specifically to address the risk of hallucinations in critical areas involving patient safety for example
Consistency-Based Methods: This method involves a self-consistency check, where hallucinations can be detected by prompting the model multiple times with the same query and comparing the outputs for consistency.
Some interesting open-access publications to go a bit further:
If you’re interesting in reading recent research on this topic, here’s a few research papers worth reading:
Code Walkthrough: Downloading Medical Research from PubMed Central
To reduce hallucinations in healthcare LLMs, grounding them in reliable medical literature is critical.
Let’s start by building a corpus from one of the best sources available: PubMed Central (PMC).
This script helps you automate the retrieval of open-access medical papers, making it easy to bootstrap a dataset tailored to your task (e.g., infectious diseases).
Here’s how it works:
1.
Setup and Environment
import requestsimport xml.etree.ElementTree as ETimport jsonimport os, re, timefrom dotenv import load_dotenvload_dotenv()api_key = os.getenv("NCBI_API_KEY")email = os.getenv("EMAIL")base_url = "
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
You’ll need to set your NCBI API key and email in a .env file.
You can still call the NCBI API without an API key, but this unlocks higher rate limits and it is free
2.
Search PMC
Because we are interested in full texts to build our knowledge base, we should only download articles that are open access.
To do so, we need to fetch the articles from PMC:
# 1.
Search PMCsearch_url = f"{base_url}esearch.fcgi"search_params = { "db": "pmc", "term": query, "retmax": max_results, "retmode": "json", "api_key": api_key, "email": email}print("Searching PMC...")search_resp = requests.get(search_url, params=search_params)search_resp.raise_for_status()ids = search_resp.json()["esearchresult"]["idlist"]
This code queries PMC with your search terms (for example “infectious diseases”) and returns a list of document identifiers (PMCIDs).
3.
Fetch and Parse Articles
Now we can fetch the full texts using the PMCIDs:
# 2.
Batch fetchfetch_url = f"{base_url}efetch.fcgi"for i in range(0, len(ids), batch_size): = ids[i : i + batch_size] batch_ids fetch_params = { "db": "pmc", "id": ",".join(batch_ids), "retmode": "xml", "api_key": api_key, "email": email, } time.sleep(delay) r = requests.get(fetch_url, params=fetch_params) r.raise_for_status()
Our response is an XML object, so the final step is to parse it and create a dictionary with the relevant sections: pmcid, title, abstract, full_text, publication_date, authors:
root = ET.fromstring(r.content)for idx, article in enumerate(root.findall(".//article")): # Extract article details article_data = { "pmcid": f"PMC{batch_ids[idx]}", "title": "", "abstract": "", "full_text": "", "publication_date": "", "authors": [], } # Extract title title_elem = article.find(".//article-title") if title_elem is not None: article_data["title"] = "".join(title_elem.itertext()).strip() # Extract abstract abstract_parts = article.findall(".//abstract//p") if abstract_parts: article_data["abstract"] = " ".join( "".join(p.itertext()).strip() for p in abstract_parts ) # Extract publication date pub_date = article.find(".//pub-date") if pub_date is not None: year = pub_date.find("year") month = pub_date.find("month") day = pub_date.find("day") date_parts = [] if year is not None: date_parts.append(year.text) if month is not None: date_parts.append(month.text) if day is not None: date_parts.append(day.text) article_data["publication_date"] = "-".join(date_parts) # Extract authors author_elems = article.findall(".//contrib[@contrib-type='author']") for author_elem in author_elems: surname = author_elem.find(".//surname") given_names = author_elem.find(".//given-names") author = {} if surname is not None: author["surname"] = surname.text if given_names is not None: author["given_names"] = given_names.text if author: article_data["authors"].append(author) # Extract full text (combining all paragraphs) body = article.find(".//body") if body is not None: paragraphs = body.findall(".//p") article_data["full_text"] = " ".join( "".join(p.itertext()).strip() for p in paragraphs )
The data can then be saved into a jsonl that will be used in our next step — building our RAG system.
Let’s be mindful of licensing restrictions: While open access literature allows anyone to access and read the content, it doesn’t mean the authors agreed to redistribution of their work.
While this blog post and its content are intended for personal and educational use, if you decide to use this function to build a dataset that will be redistributed or commercialized, it is important to comply with the article’s license agreement.
To do so, let’s define a function that will help us pull the license data from the downloaded article:
def detect_cc_license(lic_elem): """ Inspect <license> … </license> for Creative Commons URLs or keywords and return a normalised string such as 'cc-by', 'cc-by-nc', 'cc0', or 'other'.
""" if lic_elem is None: return "other" # 1) gather candidate strings: any ext-link href + full text candidates: list[str] = [] for link in lic_elem.findall(".//ext-link[@ext-link-type='uri']"): href = link.get("{
http://www.w3.org/1999/xlink}href") or link.get("href") if href: candidates.append(href.lower()) candidates.append("".join(lic_elem.itertext()).lower()) # 2) search for CC patterns for text in candidates: if "creativecommons.org" not in text and "publicdomain" not in text: continue # order matters (most restrictive first) if re.search(r"by[-_]nc[-_]nd", text): return "cc-by-nc-nd" if re.search(r"by[-_]nc[-_]sa", text): return "cc-by-nc-sa" if re.search(r"by[-_]nc", text): return "cc-by-nc" if re.search(r"by[-_]sa", text): return "cc-by-sa" if "/by/" in text: return "cc-by" if "publicdomain/zero" in text or "cc0" in text or "public domain" in text: return "cc0" return "other"
Here’s a short breakdown of what the licenses mean:
Image uploaded by author
Here’s the full function for PubMed download:
def download_pmc_articles(query, ): max_results = 100, batch_size = 20, delay = 0.2, allowed_licenses = {"cc-by", "cc-by-sa", "cc0"}, out_file = "pmc_articles.jsonl" base_url = "
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/" # 1.
Search PMC search_url = f"{base_url}esearch.fcgi" search_params = { "db": "pmc", "term": query, "retmax": max_results, "retmode": "json", "api_key": api_key, "email": email } print("Searching PMC...") search_resp = requests.get(search_url, params=search_params) search_resp.raise_for_status() ids = search_resp.json()["esearchresult"]["idlist"] # 2.
Batch fetch fetch_url = f"{base_url}efetch.fcgi" skipped, saved = 0, 0 with open(out_file, "w") as f: for i in range(0, len(ids), batch_size): batch_ids = ids[i:i+batch_size] fetch_params = { "db": "pmc", "id": ",".join(batch_ids), "retmode": "xml", "api_key": api_key, "email": email } time.sleep(delay) r = requests.get(fetch_url, params=fetch_params) r.raise_for_status() root = ET.fromstring(r.content) for idx, article in enumerate(root.findall(".//article")): # Check license license = detect_cc_license(article.find(".//license")) if license not in allowed_licenses: skipped += 1 continue # skip disallowed license # Extract article details article_data = { "pmcid": f"PMC{batch_ids[idx]}", "title": "", "abstract": "", "full_text": "", "publication_date": "", "authors": [] } # Extract title title_elem = article.find(".//article-title") if title_elem is not None: article_data["title"] = "".join(title_elem.itertext()).strip() # Extract abstract abstract_parts = article.findall(".//abstract//p") if abstract_parts: article_data["abstract"] = " ".join("".join(p.itertext()).strip() for p in abstract_parts) # Extract publication date pub_date = article.find(".//pub-date") if pub_date is not None: year = pub_date.find("year") month = pub_date.find("month") day = pub_date.find("day") date_parts = [] if year is not None: date_parts.append(year.text) if month is not None: date_parts.append(month.text) if day is not None: date_parts.append(day.text) article_data["publication_date"] = "-".join(date_parts) # Extract authors author_elems = article.findall(".//contrib[@contrib-type='author']") for author_elem in author_elems: surname = author_elem.find(".//surname") given_names = author_elem.find(".//given-names") author = {} if surname is not None: author["surname"] = surname.text if given_names is not None: author["given_names"] = given_names.text if author: article_data["authors"].append(author) # Extract full text (combining all paragraphs) body = article.find(".//body") if body is not None: paragraphs = body.findall(".//p") article_data["full_text"] = " ".join("".join(p.itertext()).strip() for p in paragraphs) f.write(json.dumps(article_data) + "\n") saved += 1 print(f"Saved batch {i//batch_size + 1}") print(f"Downloaded {saved} articles to {out_file}, {skipped} articles removed by license filter")
Now you can call your function with your query to create your corpus.
For example:
# Install packages if neededpip install python-dotenv requestsquery = 'bacterial pneumonia treatment'max_results = 500batch_size = 50download_pmc_articles(query, max_results, batch_size)
And that’s it! Now all your articles are saved in a jsonl file and ready to be processed for RAG
What’s Next: Preparing the Data for RAG
In Part 2, we’ll take the domain-specific corpus you just built and use it to power a Retrieval-Augmented Generation (RAG) system — grounding your LLM in real evidence to reduce hallucinations and improve trust.
Join thousands of data leaders on the AI newsletter.
Join over 80,000 subscribers and keep up to date with the latest developments in AI.
From research to projects and ideas.
If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Source:
https://towardsai.net/p/artificial-intelligence/hallucinations-in-healthcare-llms-why-they-happen-and-how-to-prevent-them
#hallucinations #healthcare #llms #why #they #happen #and #how #prevent #them