WWW.MARKTECHPOST.COM
Building Your AI Q&A Bot for Webpages Using Open Source AI Models
In todays information-rich digital landscape, navigating extensive web content can be overwhelming. Whether youre researching for a project, studying complex material, or trying to extract specific information from lengthy articles, the process can be time-consuming and inefficient. This is where an AI-powered Question-Answering (Q&A) bot becomes invaluable.This tutorial will guide you through building a practical AI Q&A system that can analyze webpage content and answer specific questions. Instead of relying on expensive API services, well utilize open-source models from Hugging Face to create a solution thats:Completely free to useRuns in Google Colab (no local setup required)Customizable to your specific needsBuilt on cutting-edge NLP technologyBy the end of this tutorial, youll have a functional web Q&A system that can help you extract insights from online content more efficiently.What Well BuildWell create a system that:Takes a URL as inputExtracts and processes the webpage contentAccepts natural language questions about the contentProvides accurate, contextual answers based on the webpagePrerequisitesA Google account to access Google ColabBasic understanding of PythonNo prior machine learning knowledge requiredStep 1: Setting Up the EnvironmentFirst, lets create a new Google Colab notebook. Go to Google Colab and create a new notebook.Lets start by installing the necessary libraries:# Install required packages!pip install transformers torch beautifulsoup4 requestsThis installs:transformers: Hugging Faces library for state-of-the-art NLP modelstorch: PyTorch deep learning frameworkbeautifulsoup4: For parsing HTML and extracting web contentrequests: For making HTTP requests to webpagesStep 2: Import Libraries and Set Up Basic FunctionsNow lets import all the necessary libraries and define some helper functions:import torchfrom transformers import AutoModelForQuestionAnswering, AutoTokenizerimport requestsfrom bs4 import BeautifulSoupimport reimport textwrap# Check if GPU is availabledevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')print(f"Using device: {device}")# Function to extract text from a webpagedef extract_text_from_url(url): try: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } response = requests.get(url, headers=headers) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') for script_or_style in soup(['script', 'style', 'header', 'footer', 'nav']): script_or_style.decompose() text = soup.get_text() lines = (line.strip() for line in text.splitlines()) chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) text = 'n'.join(chunk for chunk in chunks if chunk) text = re.sub(r's+', ' ', text).strip() return text except Exception as e: print(f"Error extracting text from URL: {e}") return NoneThis code:Imports all necessary librariesSets up our device (GPU if available, otherwise CPU)Creates a function to extract readable text content from a webpage URLStep 3: Load the Question-Answering ModelNow lets load a pre-trained question-answering model from Hugging Face:# Load pre-trained model and tokenizermodel_name = "deepset/roberta-base-squad2"print(f"Loading model: {model_name}")tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)print("Model loaded successfully!")Were using deepset/roberta-base-squad2, which is:Based on RoBERTa architecture (a robustly optimized BERT approach)Fine-tuned on SQuAD 2.0 (Stanford Question Answering Dataset)A good balance between accuracy and speed for our taskStep 4: Implement the Question-Answering FunctionNow, lets implement the core functionality the ability to answer questions based on the extracted webpage content:def answer_question(question, context, max_length=512): max_chunk_size = max_length - len(tokenizer.encode(question)) - 5 all_answers = [] for i in range(0, len(context), max_chunk_size): chunk = context[i:i + max_chunk_size] inputs = tokenizer( question, chunk, add_special_tokens=True, return_tensors="pt", max_length=max_length, truncation=True ).to(device) with torch.no_grad(): outputs = model(**inputs) answer_start = torch.argmax(outputs.start_logits) answer_end = torch.argmax(outputs.end_logits) start_score = outputs.start_logits[0][answer_start].item() end_score = outputs.end_logits[0][answer_end].item() score = start_score + end_score input_ids = inputs.input_ids.tolist()[0] tokens = tokenizer.convert_ids_to_tokens(input_ids) answer = tokenizer.convert_tokens_to_string(tokens[answer_start:answer_end+1]) answer = answer.replace("[CLS]", "").replace("[SEP]", "").strip() if answer and len(answer) > 2: all_answers.append((answer, score)) if all_answers: all_answers.sort(key=lambda x: x[1], reverse=True) return all_answers[0][0] else: return "I couldn't find an answer in the provided content."This function:Takes a question and the webpage content as inputHandles long content by processing it in chunksUses the model to predict the answer span (start and end positions)Processes multiple chunks and returns the answer with the highest confidence scoreStep 5: Testing and ExamplesLets test our system with some examples. Heres the complete code:url = "https://en.wikipedia.org/wiki/Artificial_intelligence"webpage_text = extract_text_from_url(url)print("Sample of extracted text:")print(webpage_text[:500] + "...")questions = [ "When was the term artificial intelligence first used?", "What are the main goals of AI research?", "What ethical concerns are associated with AI?"]for question in questions: print(f"nQuestion: {question}") answer = answer_question(question, webpage_text) print(f"Answer: {answer}")This will demonstrate how the system works with real examples.Output of the above codeLimitations and Future ImprovementsOur current implementation has some limitations:It can struggle with very long webpages due to context length limitationsThe model may not understand complex or ambiguous questionsIt works best with factual content rather than opinions or subjective materialFuture improvements could include:Implementing semantic search to better handle long documentsAdding document summarization capabilitiesSupporting multiple languagesImplementing memory of previous questions and answersFine-tuning the model on specific domains (e.g., medical, legal, technical)ConclusionNow youve successfully built your AI-powered Q&A system for webpages using open-source models. This tool can help you:Extract specific information from lengthy articlesResearch more efficientlyGet quick answers from complex documentsBy utilizing Hugging Faces powerful models and the flexibility of Google Colab, youve created a practical application that demonstrates the capabilities of modern NLP. Feel free to customize and extend this project to meet your specific needs.Useful ResourcesHere is the Colab Notebook. Also,dont forget to follow us onTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our85k+ ML SubReddit. Mohammad AsjadAsjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.Mohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/DeltaProduct: An AI Method that Balances Expressivity and Efficiency of the Recurrence Computation, Improving State-Tracking in Linear Recurrent Neural NetworksMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/PydanticAI: Advancing Generative AI Agent Development through Intelligent Framework DesignMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recommendations by Combining Multi-Step Reasoning with Real-Time Biomedical Tool IntegrationMohammad Asjadhttps://www.marktechpost.com/author/mohammad_asjad/Building a Retrieval-Augmented Generation (RAG) System with FAISS and Open-Source LLMs
0 Reacties
0 aandelen
109 Views