Website Feature Engineering at Scale: PySpark, Python &...

shared a link

2025-05-05 23:23:39 -

Website Feature Engineering at Scale: PySpark, Python & Snowflake

Introduction and Problem Imagine you’re staring at a database containing thousands of merchants across multiple countries, each with its own website. Your goal? Identify the top candidates to partner with in a new business proposal. Manually browsing each site is impossible at scale, so you need an automated way to gauge “how good” each merchant’s online presence is. Enter the website quality score: a numeric feature (0-10) that captures key aspects of a site’s professionalism, content depth, navigability, and visible product listings with prices. By integrating this score into your machine learning pipeline, you gain a powerful signal that helps your model distinguish the highest-quality merchants and dramatically improve selection accuracy. Table of Contents Introduction and Problem Technical Implementation Legal & Ethical Considerations Getting Started Fetch HTML Script in Python Assign a Quality Score Script in Pyspark Conclusion Disclaimer Technical Implementation Legal & Ethical Considerations Be a good citizen of the web. This scraper only counts words, links, images, scripts and simple “contact/about/price” flags, it does not extract or store any private or sensitive data. Throttle responsibly: use modest concurrency (e.g. CONCURRENT_REQUESTS ≤ 10), insert small pauses between batches, and avoid hammering the same domain. Retention policy: once you’ve computed your features or scores, purge raw HTML within a reasonable window (e.g. after 7-14 days). For very large runs, or if you plan to share extracted HTML, consider reaching out to site owners for permission or notifying them of your usage. Getting Started Here’s your folder structure once you clone the repository https://github.com/lucasbraga461/feat-eng-websites/ : Code block 1. GitHub repository folder structure ├── src │ ├── helpers │ │ └── snowflake_data_fetch.py │ ├── p1_fetch_html_from_websites.py │ └── process_data │ ├── s1_gather_initial_table.sql │ └── s2_create_table_with_website_feature.sql ├── notebooks │ └── ps_website_quality_score.ipynb ├── data │ └── websites_initial_table.csv ├── README.md ├── requirements.txt └── venv └── .gitignore └── .env Your dataset should be ideally in Snowflake, here’s to give an idea on how you should prepare it, in case it comes from different tables, refer to src/process_data/s1_gather_initial_table.sql, here’s a snippet of it: Code block 2. s1_gather_initial_table.sql CREATE OR REPLACE TABLE DATABASE.SCHEMA.WEBSITES_INITIAL_TABLE AS ( SELECT DISTINCT COUNTRY, WEBSITE_URL FROM DATABASE.SCHEMA.COUNTRY_ARG_DATASET WHERE WEBSITE_URL IS NOT NULL ) UNION ALL ( SELECT DISTINCT COUNTRY, WEBSITE_URL FROM DATABASE.SCHEMA.COUNTRY_BRA_DATASET WHERE WEBSITE_URL IS NOT NULL ) UNION ALL ( [...] SELECT DISTINCT COUNTRY, WEBSITE_URL FROM DATABASE.SCHEMA.COUNTRY_JAM_DATASET WHERE WEBSITE_URL IS NOT NULL ) ; Here’s what this initial table should look like: Figure 1. Initial table Fetch HTML Script in Python Having the data ready, this is how you call it, let’s say you have your data in Snowflake: Code block 3. p1_fetch_html_from_websites.py using Snowflake dataset cd ~/Document/GitHub/feat-eng-websites python3 src/p1_fetch_html_from_websites.py -c BRA --use_snowflake The python script is expecting the snowflake table to be in DATABASE.SCHEMA.WEBSITES_INITIAL_TABLE which can be adjusted for your use case on the code itself. That will open a window on your browser asking you to authenticate to Snowflake. Once you authenticate it, it’ll pull the data from the designated table and proceed with fetching the website content. If you choose to pull this data from a CSV file then don’t use the flag at the end and call it this way: Code block 4. p1_fetch_html_from_websites.py using CSV dataset cd ~/Document/GitHub/feat-eng-websites python3 src/p1_fetch_html_from_websites.py -c BRA GIF 1. Running p1_fetch_html_from_websites.py Here’s why this script is powerful at fetching website content comparing to a more basic approach, see Table 1: Table 1. Advantages of this Fetch HTML script comparing with a basic implementation TechniqueBasic ApproachThis script p1_fetch_html_from_websites.pyHTTP fetchingBlocking requests.get() calls one‐by‐oneAsync I/O with asyncio + aiohttp to issue many requests in parallel and overlap network waitsUser-AgentSingle default UA header for all requestsRotate through a list of real browser UA strings to evade basic bot‐detection and throttlingBatchingLoad & process the entire URL list in one goSplit into chunks via BATCH_SIZE so you can checkpoint, limit memory use, and recover mid-runRetries & TimeoutsRely on library defaults or crash on slow/unresponsive serversExplicit MAX_RETRIES and TIMEOUT settings to retry transient failures and bound per-request wait timesConcurrency limitSequential or unbounded parallel calls (risking overload)CONCURRENT_REQUESTS + aiohttp.TCPConnector + asyncio.Semaphore to throttle max in-flight connectionsEvent loopSingle loop reused, can hit “bound to different loop” errors when restarting useCreate a fresh asyncio event loop per batch to avoid loop/semaphore binding errors and ensure isolation It’s generally better to store raw HTML in a proper database (Snowflake, BigQuery, Redshift, Postgres, etc.) rather than in CSV files. A single page’s HTML can easily exceed spreadsheet limits (e.g. Google Sheets caps at 50,000 characters per cell), and managing hundreds of pages would bloat and slow down CSVs. While we include a CSV option here for quick demos or minimal setups, large‐scale scraping and Feature Engineering are far more reliable and performant when run in a scalable data warehouse like Snowflake. Once you run it for say BRA, ARG and JAM this is how your data folder will look like Code block 5. Folder structure once you ran it for ARG, BRA and JAM ├── data │ ├── website_scraped_data_ARG.csv │ ├── website_scraped_data_BRA.csv │ ├── website_scraped_data_JAM.csv │ └── websites_initial_table.csv Refer to Figure 2 to visualize what the output of the first script generates, i.e. visualize the table website_scraped_data_BRA. Note that one of the columns is html_content, which is a very large field since it takes the whole HTML content of the website. Figure 2. Example of table website_scraped_data_BRA generated with first python script Assign a Quality Score Script in Pyspark Because each page’s HTML can be massive, and you’ll have hundreds or thousands of pages, you can’t efficiently process or store all that raw text in flat files. Instead, we hand off to Spark via Snowpark (Snowflake’s Pyspark engine) for scalable feature extraction. See notebooks/ps_website_quality_score.ipynb for a ready-to-run example: just select the Python kernel in Snowflake and import the built-in Snowpark libraries to spin up your Spark session (see Code Block 6). Code block 6. Folder structure once you ran it for ARG, BRA and JAM import pandas as pd from bs4 import BeautifulSoup import re from tqdm import tqdm import Snowflake.snowpark as snowpark from snowflake.snowpark.functions import col, lit, udf from snowflake.snowpark.context import get_active_session session = get_active_session() Each market speaks its own language and follows different conventions, so we bundle all those rules into a simple country-specific config. For each country we define the contact/about keywords and price‐pattern regexes that signal a “good” merchant site, then point the script at the corresponding Snowflake input and output tables. This makes the feature extractor fully data-driven, reusing the same code for every region with just a change of config. Code block 7. Config file country_configs = { "ARG": { "name": "Argentina", "contact_keywords": ["contacto", "contáctenos", "observaciones"], "about_keywords": ["acerca de", "sobre nosotros", "quiénes somos"], "price_patterns": [r'ARS\s?\d+', r'\$\s?\d+', r'\d+\.\d{2}\s?\$'], "input_table": "DATABASE.SCHEMA.WEBSITE_SCRAPED_DATA_ARG", "output_table": "DATABASE.SCHEMA.WEBSITE_QUALITY_SCORES_ARG" }, "BRA": { "name": "Brazil", "contact_keywords": ["contato", "fale conosco", "entre em contato"], "about_keywords": ["sobre", "quem somos", "informações"], "price_patterns": [r'R\$\s?\d+', r'\d+\.\d{2}\s?R\$'], "input_table": "DATABASE.SCHEMA.WEBSITE_SCRAPED_DATA_BRA", "output_table": "DATABASE.SCHEMA.WEBSITE_QUALITY_SCORES_BRA" }, [...] Before we can register and use our Python scraper logic inside Snowflake, we first create a stage, a persistent storage area, by running the DDL in Code Block 8. This creates a named location @STAGE_WEBSITES under your DATABASE.SCHEMA, where we’ll upload the UDF package (including dependencies like BeautifulSoup and lxml). Once the stage exists, we deploy the extract_features_udf there, making it available to any Snowflake session for HTML parsing and feature extraction. Finally, we set the country_code variable to kick off the pipeline for a specific country, before looping through other country codes as needed. Code block 8. Create a stage folder to keep the UDFs created -- CREATE STAGE DATABASE.SCHEMA.STAGE_WEBSITES; country_code = "BRA" Now at this part of the code, refer to Code block 9, we’ll define the UDF function ‘extract_features_udf’ that will extract information from the HTML content, here’s what this part of the code does: Defines the Snowpark UDF called ‘extract_features_udf’ that lives in the Snowflake stage folder previously created Parses the raw HTML with BeautifulSoup well known library Extract text features: Total word count Page title length Flags for ‘contact’ and ‘about’ pages. Extracts structural features: Number of <a> links Number of <img> images Number of <script> tags Detects product listings by looking for any price pattern in the text Returns a dictionary of all these counts/flags, or zeros if the HTML was empty or any error occurred Code block 9. Function to extract HTML content @udf(name="extract_features_udf", is_permanent=True, replace=True, stage_location="@STAGE_WEBSITES", packages=["beautifulsoup4", "lxml"]) def extract_features(html_content: str, CONTACT_KEYWORDS: str, ABOUT_KEYWORDS: str, PRICE_PATTERNS: list) -> dict: """Extracts text, structure, and product-related features from HTML.""" if not html_content: return { "word_count": 0, "title_length": 0, "has_contact_page": 0, "has_about_page": 0, "num_links": 0, "num_images": 0, "num_scripts": 0, "has_price_listings": 0 } try: soup = BeautifulSoup(html_content, 'lxml') # soup = BeautifulSoup(html_content[:MAX_HTML_SIZE], 'lxml') # Text Features text = soup.get_text(" ", strip=True) word_count = len(text.split()) title = soup.title.string.strip() if soup.title and soup.title.string else "" has_contact = bool(re.search(CONTACT_KEYWORDS, text, re.I)) has_about = bool(re.search(ABOUT_KEYWORDS, text, re.I)) # Structural Features num_links = len(soup.find_all("a")) num_images = len(soup.find_all("img")) num_scripts = len(soup.find_all("script")) # Product Listings Detection # price_patterns = [r'€\s?\d+', r'\d+\.\d{2}\s?€', r'\$\s?\d+', r'\d+\.\d{2}\s?\$'] has_price = any(re.search(pattern, text, re.I) for pattern in PRICE_PATTERNS) return { "word_count": word_count, "title_length": len(title), "has_contact_page": int(has_contact), "has_about_page": int(has_about), "num_links": num_links, "num_images": num_images, "num_scripts": num_scripts, "has_price_listings": int(has_price) } except Exception: return {"word_count": 0, "title_length": 0, "has_contact_page": 0, "has_about_page": 0, "num_links": 0, "num_images": 0, "num_scripts": 0, "has_price_listings": 0} And the final part of the pyspark notebook Process and generate output table does four main things:First it applies the UDF extract_features_udf on the raw HTML, producing a single features column that holds a small dict of counts/flags for each page, see code block 10. Code block 10. Applies UDF to each row df_processed = df.with_column( "features", extract_features(col("HTML_CONTENT")) ) Secondly, it turns each key in the features dict into its own column in the DataFrame (so you get separate word_count, num_links, etc.), see code block 11. Code block 11. Explode that dictionary into real columns df_final = df_processed.select( col("WEBSITE"), col("features")["word_count"].alias("word_count"), ..., col("features")["has_price_listings"].alias("has_price_listings") ) Thirdly, based on business rules defined by me, it builds a 0-10 score by assigning points for each feature (e.g. word-count thresholds, presence of contact/about pages, product listings), see code block 12. Code block 12. Compute a single quality score df_final = df_final.with_column( "quality_score", ( (col("word_count") > 300).cast("int")*2 + (col("word_count") > 100).cast("int") + (col("title_length") > 10).cast("int") + col("has_contact_page") + col("has_about_page") + (col("num_links") > 10).cast("int") + (col("num_images") > 5).cast("int") + col("has_price_listings")*3 ) ) And finally it writes the final table back into Snowflake (replacing any existing table) so you can query or join these quality scores later. Code block 13. Write the output table to Snowflake df_final.write.mode("overwrite").save_as_table(OUTPUT_TABLE) Figure 3. Final table, output of DATABASE.SCHEMA.WEBSITE_QUALITY_SCORES_BRA Conclusion Once computed and stored, the website quality score becomes a straightforward input to virtually any predictive model, whether you’re training a logistic regression, random forest, or deep neural network. As one of your strongest features, it quantifies a merchant’s online maturity and reliability, complementing other data like sales volume or customer reviews. By combining this web-derived signal with your existing metrics, you’ll be able to rank, filter, and recommend partners far more effectively, and ultimately drive better business outcomes. GitHub repository implementation: https://github.com/lucasbraga461/feat-eng-websites/ Disclaimer The screenshots and figures in this article (e.g. of Snowflake query results) have been created by the author. None of the numbers are drawn from real business data but were manually generated for illustrative purposes. Likewise, all SQL scripts are handcrafted examples; they are not extracted from any live environment but are designed to closely resemble what a company using Snowflake might encounter. The post Website Feature Engineering at Scale: PySpark, Python & Snowflake appeared first on Towards Data Science.

0 Comments 0 Shares