Intro to web scraping

Web scraping is used for many different purposes which can include archiving, analyzing trends, lead harvesting, insight into competitors, market research and more.

This is going to be a simple introduction to basic web scraping and extracting data from html. When doing this kind of thing its generally best to not abuse the servers you’re trying to pull the data from otherwise you could get rate limited (429 http status) or blocked entirely.

I’m going to be scraping a recipe website because it has great food by a great chef. I won’t mention it though to prevent abuse, but you can still follow along and apply this same technique anywhere as long as you can easily grab the html of the site. Also there’s a bonus at the end if you make it that far.

Getting started

I will be using common unix tools here for a simple demonstration of one way to approach this kind of task. Python, et al. in this example is overkill and everything is already built right in on almost any (li|u)nix platform.

One of the first ways I like to start is by seeing if I can curl the website and get the html I’m looking for.

curl -s https://example.com/

In many cases these days, sites do not render their html properly to a simple curl request. Which either requires a modification of its user agent, headers, cookies, or even using an entirely different toolset and approach. In this case, the recipe website responded with all the html and content of the page.

Upon closer inspection I noticed its a wordpress site. A give away is things like wp-content in the page source which can be useful information.

Sitemaps

Now there are hundereds of links and recipes on this website. How do I gather them all quickly? This is where sitemaps come in to play.

Sitemaps are a listing of all the pages on a website, along with meta data like last update, page priority, etc. They’re often in xml format and help search engines find their way around your site and play a big role in SEO (search engine optimization).

Some common sitemap locations are:

  • example.com/sitemap.xml
  • example.com/sitemap_index.xml
  • example.com/sitemapindex.xml
  • example.com/sitemap-index.xml
  • example.com/sitemap/
  • example.com/sitemap1.xml
  • example.com/post-sitemap
  • example.com/post-sitemap.xml
  • example.com/page-sitemap

After saving the file and opening sitemap.xml locally I get something like this:

<url>
        <loc>https://example.com/recipes/chicken-tenders/</loc>
        <lastmod>2021-08-26T14:27:04+00:00</lastmod>
        <image:image>
                <image:loc>https://example.com/wp-content/Chicken-Tenders.jpg</image:loc>
        </image:image>
</url>
<url>
        <loc>https://example.com/recipes/how-to-cook-steak/</loc>
        <lastmod>2021-08-22T17:47:23+00:00</lastmod>
        <image:image>
                <image:loc>https://example.com/wp-content/How-to-Cook-Steak.jpg</image:loc>
        </image:image>
        <image:image>
</url>

This is just a sample but really there are hundreds of page urls and even more image urls. What I really want is just the page urls:

grep '<loc' sitemap.xml

Which would output:

<loc>https://example.com/recipes/chicken-tenders/</loc>
<loc>https://example.com/recipes/how-to-cook-steak/</loc>

Now let’s extract the url itself

grep -Po '(?<=<loc>).*?(?=</loc>)' sitemap.xml

Here we’re using greps perl regex compatibility to extract whats inside the xml tags. And I already know we’re not supposed to use regex to parse html. Oh no! I’m getting the job done.

Here is the html parser version (using hq):

cat sitemap.xml | hq loc text

Done! Now we have all the urls of all the recipes on the site and we’re ready to grab the contents of every page.

Downloading every recipe

A simple one line for loop will do the trick (I separate it here to look pretty). We’re using wget which is a network downloader, along with a wait time of 5 seconds. And the output goes to the basename of the url, which is the title of the recipe, with an html extension.

for url in $(cat sitemap.xml | hq loc text); 
do
        echo "Downloading..."
        wget -w 5 --random-wait "$url" -O $(basename "$url").html
done

I’m in no rush, I can wait. In fact, I’m making a grilled cheese and caramelized onions while writing this blog post.

Grabbing the ingredients list

<div class="wprm-recipe-ingredients-container wprm-recipe-ingredients-no-images wprm-recipe-22966-ingredients-container wprm-block-text-normal wprm-ingredient-style-regular wprm-recipe-images-before" data-recipe="22966" data-servings="4"><h3 class="wprm-recipe-header wprm-recipe-ingredients-header wprm-block-text-uppercase wprm-align-left wprm-header-decoration-line wprm-header-has-actions wprm-header-has-actions" style=""><span class="ez-toc-section" id="Recipe_Ingredients_US_CustomaryMetric_1x2x3x"></span>Recipe Ingredients<div class="wprm-decoration-line" style="border-color: #e0e0e0"></div>&nbsp;<div class="wprm-unit-conversion-container wprm-unit-conversion-container-22966 wprm-unit-conversion-container-buttons wprm-block-text-normal" style="background-color: #ffffff;border-color: #9f1033;color: #9f1033;border-radius: 3px;"><button href="#" class="wprm-unit-conversion wprmpuc-active" data-system="1" data-recipe="22966" style="background-color: #9f1033;color: #ffffff;" aria-label="Change unit system to US Customary">US Customary</button><button href="#" class="wprm-unit-conversion" data-system="2" data-recipe="22966" style="background-color: #9f1033;color: #ffffff;border-left: 1px solid #9f1033;" aria-label="Change unit system to Metric">Metric</button></div>&nbsp;<div class="wprm-recipe-adjustable-servings-container wprm-recipe-adjustable-servings-22966-container wprm-toggle-container wprm-block-text-normal" style="background-color: #ffffff;border-color: #9f1033;color: #9f1033;border-radius: 3px;"><button href="#" class="wprm-recipe-adjustable-servings wprm-toggle wprm-toggle-active" data-multiplier="1" data-servings="4" data-recipe="22966" style="background-color: #9f1033;color: #ffffff;" aria-label="Adjust servings by 1x">1x</button><button href="#" class="wprm-recipe-adjustable-servings wprm-toggle" data-multiplier="2" data-servings="4" data-recipe="22966" style="background-color: #9f1033;color: #ffffff;border-left: 1px solid #9f1033;" aria-label="Adjust servings by 2x">2x</button><button href="#" class="wprm-recipe-adjustable-servings wprm-toggle" data-multiplier="3" data-servings="4" data-recipe="22966" style="background-color: #9f1033;color: #ffffff;border-left: 1px solid #9f1033;" aria-label="Adjust servings by 3x">3x</button></div><span class="ez-toc-section-end"></span></h3><div class="wprm-recipe-ingredient-group"><h4 class="wprm-recipe-group-name wprm-recipe-ingredient-group-name wprm-block-text-faded">For the Marinade / Sauce: (½ of this for the Marinade and ½ for the Sauce)</h4><ul class="wprm-recipe-ingredients"><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="1"><span class="wprm-recipe-ingredient-amount">1</span>&#32;<span class="wprm-recipe-ingredient-unit">cup</span>&#32;<span class="wprm-recipe-ingredient-name">Garlic Olive Oil</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="2"><span class="wprm-recipe-ingredient-amount">2</span>&#32;<span class="wprm-recipe-ingredient-unit">tablespoons</span>&#32;<span class="wprm-recipe-ingredient-name">Dijon Mustard</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="3"><span class="wprm-recipe-ingredient-amount">8</span>&#32;<span class="wprm-recipe-ingredient-name">Garlic Cloves</span>&#32;<span class="wprm-recipe-ingredient-notes wprm-recipe-ingredient-notes-faded">chopped</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="4"><span class="wprm-recipe-ingredient-amount">2</span>&#32;<span class="wprm-recipe-ingredient-unit">teaspoons</span>&#32;<span class="wprm-recipe-ingredient-name">Red Chili Flakes</span>&#32;<span class="wprm-recipe-ingredient-notes wprm-recipe-ingredient-notes-faded"> (carefully measured 😊)</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="5"><span class="wprm-recipe-ingredient-amount">2</span>&#32;<span class="wprm-recipe-ingredient-unit">tablespoons</span>&#32;<span class="wprm-recipe-ingredient-name">Dried Herbes de Provence</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="6"><span class="wprm-recipe-ingredient-amount">2</span>&#32;<span class="wprm-recipe-ingredient-unit">teaspoons</span>&#32;<span class="wprm-recipe-ingredient-name">Dried Oregano</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="7"><span class="wprm-recipe-ingredient-amount">2</span>&#32;<span class="wprm-recipe-ingredient-unit">teaspoons</span>&#32;<span class="wprm-recipe-ingredient-name">Salt</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="8"><span class="wprm-recipe-ingredient-amount">1</span>&#32;<span class="wprm-recipe-ingredient-unit">teaspoons</span>&#32;<span class="wprm-recipe-ingredient-name">Black Pepper</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="9"><span class="wprm-recipe-ingredient-name">Juice of 2 Lemons</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="10"><span class="wprm-recipe-ingredient-amount">2</span>&#32;<span class="wprm-recipe-ingredient-unit">tablespoons</span>&#32;<span class="wprm-recipe-ingredient-name">White Balsamic Vinegar</span>&#32;<span class="wprm-recipe-ingredient-notes wprm-recipe-ingredient-notes-faded"> Chef used Sicilian Lemon White Balsamic</span></li></ul></div><div class="wprm-recipe-ingredient-group"><h4 class="wprm-recipe-group-name wprm-recipe-ingredient-group-name wprm-block-text-faded">For the Chicken and Vegetables:</h4><ul class="wprm-recipe-ingredients"><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="12"><span class="wprm-recipe-ingredient-amount">8</span>&#32;<span class="wprm-recipe-ingredient-name">Chicken Thighs, boneless and skinless</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="13"><span class="wprm-recipe-ingredient-amount">4 to 6</span>&#32;<span class="wprm-recipe-ingredient-unit">large </span>&#32;<span class="wprm-recipe-ingredient-name">Carrots peeled cut into slices about 3/16 “(6mm) thick max</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="14"><span class="wprm-recipe-ingredient-amount">1</span>&#32;<span class="wprm-recipe-ingredient-unit">pound</span>&#32;<span class="wprm-recipe-ingredient-name">Russet Potatoes, peeled and cut into 6 to 8 wedges</span></li><li class="wprm-recipe-ingredient" style="list-style-type: disc;" data-uid="15"><span class="wprm-recipe-ingredient-name">Fresh Thyme a few sprigs (optional)</span></li></ul></div></div><div class="wprm-spacer"></div><div id="shop-with-instacart-v1" data-affiliate_id="#5065" data-affiliate_platform="recipe_widget"></div>

That right there is a minified single line of html where the ingredients are listed on every page. I simply used grep on the first class name.

There’s multiple ways to extract the ingredients but lets go with some of the easiest.

Using hq

grep "wprm-recipe-ingredients-container" file.html | hq span text

Although the output is not so clean

1
cup
Garlic Olive Oil
2
tablespoons
Dijon Mustard
8
Garlic Cloves
chopped
2
teaspoons
Red Chili Flakes
(carefully measured 😊)
2
tablespoons
Dried Herbes de Provence
2
teaspoons
Dried Oregano
2
teaspoons
Salt
1
teaspoons
Black Pepper
Juice of 2 Lemons
2
tablespoons
White Balsamic Vinegar
Chef used Sicilian Lemon White Balsamic
8
Chicken Thighs, boneless and skinless
4 to 6
large
Carrots peeled cut into slices about 3/16 “(6mm) thick max
1
pound
Russet Potatoes, peeled and cut into 6 to 8 wedges
Fresh Thyme a few sprigs (optional)

Using lynx

Lynx is a text based web browser available on just about every (li|u)nix distro. This tool can format the html properly and dump the result to stdout.

grep "wprm-recipe-ingredients-container" file.html | lynx -localhost -stdin -dump
Recipe Ingredients

For the Marinade / Sauce: (½ of this for the Marinade and ½ for the Sauce)

     * 1 cup Garlic Olive Oil
     * 2 tablespoons Dijon Mustard
     * 8 Garlic Cloves chopped
     * 2 teaspoons Red Chili Flakes (carefully measured )
     * 2 tablespoons Dried Herbes de Provence
     * 2 teaspoons Dried Oregano
     * 2 teaspoons Salt
     * 1 teaspoons Black Pepper
     * Juice of 2 Lemons
     * 2 tablespoons White Balsamic Vinegar Chef used Sicilian Lemon White
       Balsamic

For the Chicken and Vegetables:

     * 8 Chicken Thighs, boneless and skinless
     * 4 to 6 large Carrots peeled cut into slices about 3/16 (6mm) thick
       max
     * 1 pound Russet Potatoes, peeled and cut into 6 to 8 wedges
     * Fresh Thyme a few sprigs (optional)

Look at that! All done with simple bash scripting and common unix tools that have been around for decades. Funny thing about this is when I first started out, as a kid looking for information on scraping, people wrote posts about doing this with C# (in visual studio 🤮), and I think even C++. Talk about overkill. I guess the real hackers were busy managing infrastructure at fortune 500.

Final thoughts

As mentioned before, I said there would be a bonus. I also said I wouldn’t mention the site to prevent abuse. There’s scraping sandbox where you can practice scraping instead. But I will say the recipes from chef Jean Pierre are really great. Check out his youtube channel. He’s a hilarious french-italian chef and actually shows you how to cook properly.