A Step by Step Guide to Build a Trend Finder Tool with Python: Web Scraping, NLP (Sentiment Analysis & Topic Modeling), and Word Cloud Visualization
www.marktechpost.com
Monitoring and extracting trends from web content has become essential for market research, content creation, or staying ahead in your field. In this tutorial, we provide a practical guide to building your trend-finding tool using Python. Without needing external APIs or complex setups, youll learn how to scrape publicly accessible websites, apply powerful NLP (Natural Language Processing) techniques like sentiment analysis and topic modeling, and visualize emerging trends using dynamic word clouds.import requestsfrom bs4 import BeautifulSoup# List of URLs to scrapeurls = ["https://en.wikipedia.org/wiki/Natural_language_processing", "https://en.wikipedia.org/wiki/Machine_learning"] collected_texts = [] # to store text from each pagefor url in urls: response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}) if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') # Extract all paragraph text paragraphs = [p.get_text() for p in soup.find_all('p')] page_text = " ".join(paragraphs) collected_texts.append(page_text.strip()) else: print(f"Failed to retrieve {url}")First with the above code snippet, we demonstrate a straightforward way to scrape textual data from publicly accessible websites using Pythons requests and BeautifulSoup. It fetches content from specified URLs, extracts paragraphs from the HTML, and prepares them for further NLP analysis by combining text data into structured strings.import reimport nltknltk.download('stopwords')from nltk.corpus import stopwordsstop_words = set(stopwords.words('english'))cleaned_texts = []for text in collected_texts: # Remove non-alphabetical characters and lower the text text = re.sub(r'[^A-Za-z\s]', ' ', text).lower() # Remove stopwords words = [w for w in text.split() if w not in stop_words] cleaned_texts.append(" ".join(words))Then, we clean the scraped text by converting it to lowercase, removing punctuation and special characters, and filtering out common English stopwords using NLTK. This preprocessing ensures the text data is clean, focused, and ready for meaningful NLP analysis.from collections import Counter# Combine all texts into one if analyzing overall trends:all_text = " ".join(cleaned_texts)word_counts = Counter(all_text.split())common_words = word_counts.most_common(10) # top 10 frequent wordsprint("Top 10 keywords:", common_words)Now, we calculate word frequencies from the cleaned textual data, identifying the top 10 most frequent keywords. This helps highlight dominant trends and recurring themes across the collected documents, providing immediate insights into popular or significant topics within the scraped content.!pip install textblobfrom textblob import TextBlobfor i, text in enumerate(cleaned_texts, 1): polarity = TextBlob(text).sentiment.polarity if polarity > 0.1: sentiment = "Positive " elif polarity < -0.1: sentiment = "Negative " else: sentiment = "Neutral " print(f"Document {i} Sentiment: {sentiment} (polarity={polarity:.2f})")We perform sentiment analysis on each cleaned text document using TextBlob, a Python library built on top of NLTK. It evaluates the overall emotional tone of each documentpositive, negative, or neutraland prints the sentiment along with a numerical polarity score, providing a quick indication of the general mood or attitude within the text data.from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.decomposition import LatentDirichletAllocation# Adjust these parametersvectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words='english')doc_term_matrix = vectorizer.fit_transform(cleaned_texts)# Fit LDA to find topics (for instance, 3 topics)lda = LatentDirichletAllocation(n_components=3, random_state=42)lda.fit(doc_term_matrix)feature_names = vectorizer.get_feature_names_out()for idx, topic in enumerate(lda.components_): print(f"Topic {idx + 1}: ", [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]])Then, we apply Latent Dirichlet Allocation (LDA)a popular topic modeling algorithmto discover underlying topics in the text corpus. It first transforms cleaned texts into a numerical document-term matrix using scikit-learns CountVectorizer, then fits an LDA model to identify the primary themes. The output lists the top keywords for each discovered topic, concisely summarizing key concepts in the collected data.# Assuming you have your text data stored in combined_textfrom wordcloud import WordCloudimport matplotlib.pyplot as pltimport nltkfrom nltk.corpus import stopwordsimport renltk.download('stopwords')stop_words = set(stopwords.words('english'))# Preprocess and clean the text:cleaned_texts = []for text in collected_texts: text = re.sub(r'[^A-Za-z\s]', ' ', text).lower() words = [w for w in text.split() if w not in stop_words] cleaned_texts.append(" ".join(words))# Generate combined textcombined_text = " ".join(cleaned_texts)# Generate the word cloudwordcloud = WordCloud(width=800, height=400, background_color='white', colormap='viridis').generate(combined_text)# Display the word cloudplt.figure(figsize=(10, 6)) # <-- corrected numeric dimensionsplt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.title("Word Cloud of Scraped Text", fontsize=16)plt.show()Finally, we generate a word cloud visualization displaying prominent keywords from the combined and cleaned text data. By visually emphasizing the most frequent and relevant terms, this approach allows for intuitive exploration of the main trends and themes in the collected web content.Word Cloud Output from the Scraped SiteIn conclusion, weve successfully built a robust and interactive trend-finding tool. This exercise equipped you with hands-on experience in web scraping, NLP analysis, topic modeling, and intuitive visualizations using word clouds. With this powerful yet straightforward approach, you can continuously track industry trends, gain valuable insights from social and blog content, and make informed decisions based on real-time data.Here is the Colab Notebook. Also,dont forget to follow us onTwitterand join ourTelegram ChannelandLinkedIn Group. Dont Forget to join our80k+ ML SubReddit. Asif RazzaqWebsite| + postsBioAsif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.Asif Razzaqhttps://www.marktechpost.com/author/6flvq/Meet Manus: A New AI Agent from China with Deep Research + Operator + Computer Use + Lovable + MemoryAsif Razzaqhttps://www.marktechpost.com/author/6flvq/Tufa Labs Introduced LADDER: A Recursive Learning Framework Enabling Large Language Models to Self-Improve without Human InterventionAsif Razzaqhttps://www.marktechpost.com/author/6flvq/CMU Researchers Introduce PAPRIKA: A Fine-Tuning Approach that Enables Language Models to Develop General Decision-Making Capabilities Not Confined to Particular EnvironmentAsif Razzaqhttps://www.marktechpost.com/author/6flvq/AutoAgent: A Fully-Automated and Highly Self-Developing Framework that Enables Users to Create and Deploy LLM Agents through Natural Language Alone Parlant: Build Reliable AI Customer Facing Agents with LLMs (Promoted)
0 Commentarios ·0 Acciones ·73 Views