<- home

Web Extraction Using LLMs 🤖

The recent couple of years have seen an explosion in applying LLMs to a variety of application use-cases and have made a lot of processes simpler. One of the most obvious use-cases has been making web-extraction more contextual and easy.

Recently, I've been exploring a few spaces to build in and one of them required contextual web-extraction at scale. I started surfing online to find existing options I could use and found a few exciting companies like Firecrawl. I started exploring it and found it pretty neat. They provide clean no-code options to scrape, crawl, map and extract data from the web. However, their pricing seemed absurdly high.

I run a task of scraping products' and product details as a litmus test to evaluate whether a scraping software gets you necessary unstructured information that can further be parsed by a LLM to produce structured data. I tried scraping price/title/description/image urls data from Bluorng, an Indian instagram brand. And Firecrawl did very well. However, all was good until I found their pricing plan. Firecrawl charges $99/month for 1.5 million tokens of structured web extraction. I instantly presumed Firecrawl uses GPT-4o from OpenAI which costs $10/M tokens. However, after a bit of digging deep into their codebase, I found that Firecrawl employs gpt4o-mini for their structured extraction which prices out to $0.6/M tokens. Considering even the heftiest of hefty cloud function and message queue service bills they may have to pay, $99/month for 1.5 million tokens is a pricey bill that my Marwari belonging wouldn't accept.

This is in no way to demean Firecrawl - their product is really well-designed and I really like Caleb Peffer from what I've seen of him online.

So I set out to make my own open-source alternative and these are my findings and a finished pip library + light-weight but heavy utility open-source repository for you to customise to your own needs. Overall, my minification process saw a 3-4x reduction in input tokens.

Preprocessing and Magnification

One of the most important points to consider when scraping any site and especially when extracting structured data from webpages is to get rid of unneeded HTML elements. Elements like header, footer, elements with class names like .social-media, .ad and many more are pretty much useless and just eat up token quantity while also putting the LLM in risk of getting confused.

I juiced up the cleaning with a little bit of regex action. It's very easy to find any elements or attributes that may have a certain value using wildcard entries as such *advertisments* or just find by tags like footer and remove all such entries. The code block below making use of beautiful soup helps you do so:

def process_exclude_tags(self, exclude_tags: List[str]):
    for tag in exclude_tags:
        if tag.startswith("*") and tag.endswith("*"):
            pattern = re.compile(tag[1:-1], re.I)
            elements = self.soup.find_all(
                lambda elem: elem.name and (
                    pattern.search(elem.name) or
                    any(pattern.search(f'{attr}="{value}"') 
                        for attr, value in elem.attrs.items())
                )
            )
            for element in elements:
                element.decompose()
        else:
            for element in self.soup.select(tag):
                element.decompose()

A second way I found useful to reduce the number of tokens being passed to the LLM was to remove attributes from elements that remain post the above operation. Some attributes like aria attributes don't add to useful context. They can be removed. Below are a set of tags I chose to remove from my implementation:

REMOVE_ATTRIBUTES: Set[str] = {
    'aria-labelledby', 'aria-controls', 'data-section-type',
    'data-slick-index', 'data-aspectratio', 'data-section-id',
    'aria-expanded', 'data-index', 'data-product-id', 'aria-label',
    'aria-hidden', 'data-handle', 'data-alpha', 'data-position',
    'tabindex', 'role', 'aria-disabled', 'aria-atomic', 'aria-live',
    'data-autoplay', 'data-speed', 'data-aos', 'data-aos-delay',
    'data-testid', 'data-qa', 'handle-editor-events', 'dir', 'lang',
    'width', 'height', 'type', 'fallback'
}

LLM Extraction

The cleaned HTML code after running through the processes above is ready to be run extraction on. I found the idea of not being able to choose your own models or inference providers on Firecrawl pretty frustrating because the best part about the model race is reduced token costs with big AI labs competing with each other.

So along with having an option to use OpenAI's gpt-4o models, I decided to open functionality to use OpenRouter which basically acts as an aggregator to all top models/inference providers that get released. All the user has to do is import the library and mention what model they'd like to use from OpenRouter as such:

import asyncio
from scrapeneatly import scrape_product

result = await scrape_product(
    url="https://example.com/product",
    fields_to_extract=fields,
    provider="openai",  # or "openrouter"
    api_key="your-api-key",
    model="google/gemini-2.0-flash-001"  #mention any model name as seen on OpenRouter
)

Model Performance Comparison

I also ran some experiments to see how different models do on different tasks. Here are the tasks and different models' responses:

Task 1: Link to Ebay Item

Prompted the model to find retail price, shipping price and a description of the item

OpenAI's gpt4o's Performance:

Ebay - Submariner
description: The Rolex Submariner Mens 40mm Date 16610 SS Vintage Running Automatic Wrist Watch is a pre-owned, vintage timepiece dating back to 1999. It features a mechanical automatic movement within a 40mm stainless steel case. The black dial, sapphire crystal, and bezel show minimal to no use, while the case back has minor scuffing. The Rolex Oyster bracelet exhibits slight user wear with little overall stretch. The watch is powered by the Rolex Caliber 3135 and is well-preserved for its age.
retail_price: $7,600.00
shipping_price: $61.63

API Usage: {'prompt_tokens': 33283, 'completion_tokens': 131, 'total_tokens': 33414}

Gemini's Flash 2.0's Performance:

Data:
description: Black dial shows minimal to no use. Sapphire crystal shows minimal to no use. Bezel and case surround show minimal use overall. Case back shows minor scuffing. Bracelet shows minimal user wear with little stretch overall.
shipping_price: $61.63
retail_price: $7,600.00

Task 2: Grammy's Wikipedia Page

Find the band and artist with most awards

OpenAI's gpt4o's Performance:

Data:
band_with_most_awards: U2
artist_with_most_awards: Beyoncé

API Usage: {'prompt_tokens': 110602, 'completion_tokens': 23, 'total_tokens': 110625}

Gemini's Flash 2.0's Performance:

most_grammys_band: {'name': 'U2', 'grammys': '22'}
most_grammys_artist: {'name': 'Beyoncé', 'grammys': '35'}
API Usage: {'prompt_tokens': 127607, 'completion_tokens': 69, 'total_tokens': 127676}

Task 3: KNNCalcutta (Indian clothing store)

Find the product URLs, title and description for an item

OpenAI's gpt4o's Performance:


Data:
description: The Bluetooth Hoodie features an oversized fit made from a blend of cotton and polyester, weighing 400 GSM. It includes overdye with part screen and part puff printed designs, ideal for those who appreciate unique streetwear style.
title: Bluetooth Hoodie
product_image_urls: https://www.knncalcutta.com/cdn/shop/files/bluetooth_hoodie_front_4000x.png?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/bluetooth_hoodie_back_4000x.png?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2225_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2223_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2216_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2190_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2198_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2193_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2164_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2115_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2161_4000x.jpg?v=1734211250, https://www.knncalcutta.com/cdn/shop/files/MG_2158_4000x.jpg?v=1734211250
            
API Usage: {'prompt_tokens': 14431, 'completion_tokens': 409, 'total_tokens': 14840}
            

Gemini's Flash 2.0's Performance:


title: bluetooth hoodie
description: 
  • <li>Fit : Oversized</li>
  • <li>a fine blend of cotton and polyester</li>
  • <li>400 GSM</li>
  • <li>overdye hoodie with part screen part puff printed<br/></li>
  • <li>for the ones who push the same ink</li>
image_urls: ['https://www.knncalcutta.com/cdn/shop/files/bluetooth_hoodie_front_4000x.png?v=1734211250', 'https://www.knncalcutta.com/cdn/shop/files/bluetooth_hoodie_back_4000x.png?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2225_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2223_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2216_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2190_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2198_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2193_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2164_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2115_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2161_{width}x.jpg?v=1734211250', '//www.knncalcutta.com/cdn/shop/files/MG_2158_{width}x.jpg?v=1734211250'] API Usage: {'prompt_tokens': 19251, 'completion_tokens': 613, 'total_tokens': 19864}

Conclusion

While Flash and small models seemed to do well, the last task was the most interesting one where large models without prompting were able to extract or prefix relative URLs found with https while Flash failed to do so without extensive prompting. Further, it also started replacing the width or what can be magnification of some kind with the variable width which did not resolve with explicit prompting too. Flash level models can also make some mistakes with including some HTML elements in the parameter like in the description part.

To conclude, I designed the scraper to be pretty lightweight and it's doing a good bit of heavy-lifting for an internal process of a research area I'm trying to explore. I'd love for you to try it and reach out to me with suggestions/improvements or flaws.

Cheers!