Blog Details

image

Web Scraping with Cheerio: A Comprehensive Guide

Web scraping is basically an automated process of extracting data from different websites using a bot or software. Once fetched, the data is stored in a database – like an Excel sheet, JSON, or XML.

It's an outstanding method for researchers and business people who need to gather large amounts of data quickly and efficiently. For instance, an e-commerce website's owner can use web scraping to keep an eye on the pricing information of a competitor's website.

What is Cheerio?

Now, what is Cheerio all about? Well, Cheerio is JavaScript technology used for web scraping in server-side implementations, and it's designed explicitly for Node.js. It's a lightweight library that allows you to crawl web pages and extract data using CSS-style selectors. Cheerio can load HTML as a string and returns an object to extract data using its built-in methods.

One of the best things about Cheerio is that it runs directly in Node.js without a browser, making it faster and more efficient than other web scraping tools. It was first released in 2010 by Matt Mueller and has gained significant popularity among developers due to its versatility and ease of use.

Why use Cheerio for web scraping?

Let's explore the reasons to use Cheerio:

  • Its jQuery-like syntax makes HTML manipulation and data extraction easier for developers familiar with jQuery.
  • It's super easy to set up, even for beginner developers.
  • Its modularity allows us to extend it with Node modules and customize it according to our needs.
  • The best thing about Cheerio is it doesn't require any browser for data extraction. Because it's designed explicitly for Node.js, we can use it on the server side without any worries.

That's enough for an introduction to web scraping and Cheerio. Let's get our hands dirty and jump into the environment.

Getting started with Cheerio and web scraping

Prerequisites

Before diving into the code, there are a few prerequisites for learning Cheerio:

  • An IDE installed
  • Basic knowledge of JavaScript and Node.js
  • Basic knowledge of Devtools

Setting up the Cheerio environment

To set up your development environment, you'll need to install Node.js on your local machine. For that, head to the Node.js website and download the appropriate installer for your operating system.

Using the NPM package

Once Node.js is installed, you'll use a package manager called npm to install and manage all the third-party libraries in Node.js. Luckily, npm comes bundled with Node.js, so you don't need to install anything extra.

Now that you've set up Node js and npm, you can create a new Node js project using the command line interface. From there, you can add Cheerio js as a dependency and start using it to extract data and manipulate HTML easily.


                      mkdir webscraper
                      cd webscraper
                      npm init -y
                      npm install cheerio
                      

Shell commands

Use simple commands to ensure you have Node.js and Cheerio installed correctly.


                      node -v
                      npm list cheerio
                      

You can review the installation process and fix the issue if anything goes wrong, such as downloading the wrong Node.js installer, network connectivity, missing dependencies, etc.

Next, you need to specify the module format. We'll use ES modules in this tutorial because it allows us to use modern JavaScript features, such as await, that will be used in the later section of this tutorial. To specify the format, we'll go to the package.json and add the following field:

"type": "module",

Let's do that:


                      {
                        "name": "webscraper",
                        "version": "1.0.0",
                        "description": "",
                        "main": "index.js",
                        "type": "module",
                        "scripts": {
                          "test": "echo \"Error: no test specified\" && exit 1"
                        },
                        "author": "",
                        "license": "ISC",
                        "dependencies": {
                          "cheerio": "^1.0.0-rc.10"
                        }
                      }
                      

We've completed our installations and made the environment ready. Now let's learn about the Cheerio API.

The Cheerio API

The Cheerio API refers specifically to the set of methods and functions provided by the Cheerio library. These methods help you to extract and modify HTML very quickly.

The selector API in Cheerio can be accessed through the $() method, which has the following structure:


                      $(selector, [context], [root])
                      

It takes three arguments: the first is compulsory, and the other two are optional.

  • The selector argument specifies the elements you want to select from the HTML document. It can be a string, DOM element, array of elements, or Cheerio objects.
  • The optional context argument is actually the scope of where to begin looking for the target elements. It can be a string, DOM element, array of elements, or Cheerio objects. If no context is specified, it searches the entire document.
  • The optional root argument is usually the markup string you want to traverse or manipulate. It can be used to specify a different root element for the selected elements.

Once the elements have been selected, you can use other functions to manipulate them. For example, use the attr() function to get or set the value of an attribute or the text() function to get or set the text content of an element.

Cheerio Scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs.

How to scrape web pages with Cheerio

Now it's time to dive into some practical examples of using Cheerio for web scraping and HTML parsing. For that, create an index.js file in your directory or by using the command line.

touch index.js

You'll need to load the HTML or XML document you want to parse using the load function.

Load the HTML

You can load the HTML using the load function. This function takes a string containing the HTML as an argument and returns an object.


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                        <body>
                          <h1>Hello from Cheerio</h1>
                        </body>
                      `);
                      

Note: The resulting object is assigned to the variable $, which is a common convention used to refer to jQuery objects in JavaScript.

After executing this line of code, you can manipulate the HTML by calling methods on the $ object provided by Cheerio.

Cheerio selectors

Cheerio makes it easy to select elements using CSS-style selectors. It allows you to select elements based on tag, class, and attribute values.

Select elements with a tag name


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                          <body>
                             <h3>I'm learning Cheerio</h3>
                             <h3>It's super easy</h3>
                          </body>
                      `);
                      const $divItems = $('h3');
                      console.log($divItems.text());   //Output: I'm learning Cheerio It's super easy   
                      

Note: The .text() method in Cheerio is used to extract the text content of an HTML element. It's used on a Cheerio object representing a single element or a collection of elements.

Select elements with a class name


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(` 
                          <body>
                              <h3 class="classSelector"> Learning platforms: </h3>
                              <ul>
                                  <li class="classSelector">Apify</li>
                                  <li>Coursera</li>
                                  <li class="classSelector">Udemy</li>
                                  <li>Freecodecamp</li>
                              </ul>
                          </body>
                      `);
                      const $selection = $('.classSelector');    //Select the classSelector class
                      console.log($selection.text());  //Output: Learning platforms: Apify classSelector
                      

Select elements with an attribute value


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                          <body>
                            <h3>Terms and Conditions: </h3>
                            <form> 
                              <button name="Accept" type="submit">Accept</button>
                              <button name="Reject" type="submit">Reject</button>      
                            </form>
                          </body>
                      `);
                      const $selectedElements = $('[name]'); //Selects both the buttons
                      console.log($selectedElements.text()); 
                      

Note: To select the first element, we can specify the value of the attribute like $('[name=Accept]').

The attribute selectors can be used with any HTML attribute, not just data-* attributes. For example, $('img[alt="example"]') will select all img elements with an alt attribute having a value of "example".

Combining Cheerio selectors

Combining Cheerio selectors is an effective way to select specific elements from an HTML. Here are a few examples:

  • Selecting an element with a particular tag and class: $('h3 .class')
  • Selecting all the elements that match a selector: $('ul > li')
  • Selecting the next sibling element: $('p + ul')
  • Selecting all elements that match multiple selectors: $('h1, h2, h3')
  • Selecting elements based on their position in the document: $('li:nth-child(odd)') (This selects all odd-numbered li elements in the document)

Traversing the DOM

Cheerio provides methods for navigating in any direction of the selected elements. For example, find the child or parent of any selected element. Let's discuss them one by one.

find


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                        <div class="parent">
                          <p>Hello, world!</p>
                          <span>How are you?</span>
                          <p class="day">Have a nice day!</p>
                        </div>
                      `);
                      const welcome = $('.parent').find('.day');
                      console.log(welcome.text()) //Output : Have a nice day!
                      

children


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                        <div class="parent">
                          <p>Hello, world!</p>
                          <span>How are you?</span>
                          <p class="day">Have a nice day!</p>
                          <div>
                            <span class="day">
                                Granchildren of ".parent" class
                              </span>
                            </div>
                        </div>
                      `);
                      const welcome = $('.parent').children('span');
                      console.log(welcome.text()) //Output: How are you?
                      

Note: find searches for matching descendant elements at any level below the selected element, whereas children only looks for direct child elements of the selected element.

Traversing functions

There are many other methods, but we'll not go into all the details. Let's just have a quick look at these:

  • contents: This method works just like children. Additionally, it also selects the comments from the HTML string.
  • parent: It gives us the parent of the selected element.
  • parents: It gives us all the parents of the element till the root element.
  • parentsUntil: We can specify the limit of the parents using this method and how far we want to go upwards.
  • closest: It allows us to select the nearest parent element of a specific type that matches a given Cheerio selector. For example, $('.child').closest('div').
  • next and prev: The next method allows us to select the next element, and the prev method gives us the previous element.

How to loop over elements

If you recall, JavaScript methods like each, map, etc., facilitate looping over elements to perform specific operations. For example, let's look at each function and see how they work.


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                        <body>
                            <h3 class="classSelector"> Learning platforms: </h3>
                            <ul>
                                <li class="classSelector">Apify</li>
                                <li>Coursera</li>
                                <li class="classSelector">Udemy</li>
                                <li>Freecodecamp</li>
                            </ul>
                        </body>`);
                      const listItems = $('li')
                      listItems.each((index, element) => {
                         console.log($(element).text());
                       });
                      

The each method takes a callback function as an argument. It has two parameters: the first one is the index starting from zero, and the second is the current element.

Selecting elements using regular expressions

To use regular expressions with Cheerio, we can use the .match() function provided by JavaScript. The following example finds all the email addresses that have @ in an HTML document:


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                          <body>
                            <ul class="email">
                              <li>Apify@gmail.com</li>
                              <li>ondra@gmail.com</li>
                              <li>Chris</li>
                              <li>amanda@gmail.com</li>
                              <li>notAnEmail</li>
                              <li>notAnEmailAtAll</li>
                              <li>Queue</li>
                              <li>jamesbond@gmail.com</li>
                            </ul>
                          </body>`);
                      const emails = []
                      const userNames = $('.email li');  //Get the emails
                      userNames.each((index,el) => {    // Iterating over the emails
                        const regex = /@/;             //  Expression to be matched
                        const email = $(el).text().match(regex); // Match the text of each li item with the expression
                        if(email){
                          emails.push(email.input)       //Push, if the return value is not NULL
                        }
                      });
                      console.log(emails);
                      

Note: The .match() function returns an object if it matches with an expression or returns null if it doesn't. That's the reason we're using email.input to get the text from the object.

Filtering elements​

With filtering, you can select only the specific elements you want and ignore the rest. Cheerio provides several methods for filtering elements within a selection, like filter, not, has, etc. Let's go through them one by one.

filter

The filter method in Cheerio works just like the filter method of JavaScript. It lets you cherry-pick elements based on a specific selector. It's super handy when you're dealing with large amounts of data and want to narrow down the focus.


                      import cheerio from 'cheerio';
                      var $ = cheerio.load(`
                      <ul class="birds">
                        <li class="parrot">
                          <span class="sharp">Parrots</span> are beautiful birds with <span class="beaks">sharp beaks</span>
                        </li>
                    
                        <li class="sharp">
                          They are superfast.
                        </li>
                        
                        <li class="crow">
                          Crows are smart
                        </li>
                    
                      </ul>
                      `);
                      var $selectedElements = $('li .parrot');  // Get a 'li' with 'parrot' class
                      var $parrot = $('span').filter('.sharp');  //. Filter the span elements with a className 'sharp' 
                      console.log($parrot.text());
                      

not

The not method is opposite to the filter method. With this clever tool, you can easily exclude elements we don't want and focus on the important ones. It can save us much time and effort, especially when dealing with extensive data.


                      import cheerio from 'cheerio';
                      var $ = cheerio.load(`
                      <ul class="birds">
                        <li class="parrot">
                          <span class="sharp">Parrots</span> are beautiful birds with <span class="beaks">sharp beaks</span>
                        </li>
                    
                        <li class="sharp">
                          They are superfast.
                        </li>
                        
                        <li class="crow">
                          Crows are smart
                        </li>
                    
                      </ul>
                      `);
                      var $selectedElements = $('li .parrot') // Get a 'li' with 'parrot' class
                      var $parrot = $('span').not('.sharp');   // Exclude elements with a className 'sharp'
                      console.log($parrot.text());          // Output: sharp beaks
                      

has

While web scraping with Cheerio, you might want to find elements that contain a specific child element, like a span or an image. The has method works the same way. It takes a Cheerio selector as an argument and returns a child element that matches the selector.


                      import cheerio from 'cheerio';
                      const script = `
                      <div class="product">
                        <h2 class="name">Product 1</h2>
                        <span class="price">$10</span>
                        <span class="discount">20% off</span>
                      </div>
                      <div class="product">
                        <h2 class="name">Product 2</h2>
                        <span class="price">$20</span>
                      </div>
                      <div class="product">
                        <h2 class="name">Product 3</h2>
                        <span class="price">$30</span>
                        <span class="discount">10% off</span>
                      </div>`;
                      const $ = cheerio.load(script);
                      const discountedProducts =        // search for a parent element that has a child with a 
                      $('.product').has('.discount');  // class name discount
                                                     
                      console.log(discountedProducts.text())
                      //Output : Product 1
                      //          $10
                      //          20% off
                    
                      //        Product 3
                      //        $30
                      //        10% off
                      

This is handy when web scraping e-commerce websites and looking for discounted products. It allows you to filter out the products that have a discount quickly.

eq

The eq works just like an array indexing from the selected elements. It allows us to select an element from a specific index.


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                          <ul>
                            <li>Item 1</li>
                            <li>Item 2</li>
                          </ul>
                      `);
                        const secondItem = $('li').eq(1);
                        console.log(secondItem.text());    //Output:  Item 2 
                      

Note: It doesn't throw an error if we provide the index that's out of bounds; rather it simply returns nothing. For example, if we provide 4 to the .eq function, it will return nothing.

first and last

The methods first and last work the same way as accessing elements from the array. The first method returns the first element from the selection, and the last method returns the last element.


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(
                          `<ol>
                            <li>Item 1</li>
                            <li>Item 2</li>
                            <li>Item 3</li>
                          </ol>`
                        );
                        const firstItem = $('li').first();
                        const lastItem = $('li').last();
                        console.log(firstItem.text(),"  ",lastItem.text()) //Output: Item 1  Item 3
                      

What would happen if the selection contained just one element? The first and last elements would be the same.

How to extract data from HTML tags using Cheerio

We've learned to extract data from HTML documents in a simple way. However, what if the data you need to extract is large and needs to be structured? That's where Cheerio's objects come in. They allow us to extract data from the HTML in a structured way.

You can specify what data you want to extract by passing the keys and values to the object. The keys in the map object represent the names of the properties you want to create, while the values are the Cheerio selectors you'll use to extract the data.


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                          <div class="book-info">
                              <h2 class="book-title">The Great Gatsby</h2>
                              <p class="author">By F. Scott Fitzgerald</p>
                              <p class="release-date">Released: April 10, 1925</p>
                              <p class="price">Price: $12.99</p>
                          </div>
                      `);
                      const structuredData = {
                          Book_Title: $('.book-title').text(),
                          Author: $('.author').text(),
                          Release_Data:$('.release-date').text(),
                          Price: $('.price').text()
                      };
                      console.log(structuredData);
                      

We're creating a JavaScript object structuredData with properties Book_Title, Author, Release_Data, and Price. We're using the class selectors to select all the elements. The output will be an object with the extracted data:


                      {
                        Book_Title: 'The Great Gatsby',
                        Author: 'By F. Scott Fitzgerald',
                        Release_Data: 'Released: April 10, 1925',
                        Price: 'Price: $12.99'
                      }
                      

What if we have multiple books and want to get all of them from the document? Let's look into that.


                      import cheerio from 'cheerio';
                      const $ = cheerio.load(`
                          <div class="book-info">
                              <h2 class="book-title">The Great Gatsby</h2>
                              <p class="author">By F. Scott Fitzgerald</p>
                              <p class="release-date">Released: April 10, 1925</p>
                              <p class="price">Price: $12.99</p>
                          </div>
                          <div class="book-info">
                              <h2 class="book-title">To Kill a Mockingbird</h2>
                              <p class="author">By Harper Lee</p>
                              <p class="release-date">Released: July 11, 1960</p>
                              <p class="price">Price: $10.99</p>
                          </div>
                          <div class="book-info">
                              <h2 class="book-title">1984</h2>
                              <p class="author">By George Orwell</p>
                              <p class="release-date">Released: June 8, 1949</p>
                              <p class="price">Price: $9.99</p>
                          </div>
                      `);
                      const books = [];
                      const books1= $('.book-info')
                      books1.each((index,book)=>{
                          const structuredData = {
                              Book_Title: $(book).find('.book-title').text(),
                              Author: $(book).find('.author').text(),
                              Release_Data:$(book).find('.release-date').text(),
                              Price: $(book).find('.price').text()
                          };
                          books.push(structuredData)
                      }) 
                      console.log(books);
                      

The output of the code is an array of objects:


                      [
                       {
                        Book_Title: 'The Great Gatsby',
                        Author: 'By F. Scott Fitzgerald',
                        Release_Data: 'Released: April 10, 1925',
                        Price: 'Price: $12.99'
                      },
                      {
                        Book_Title: 'To Kill a Mockingbird',
                        Author: 'By Harper Lee',
                        Release_Data: 'Released: July 11, 1960',
                        Price: 'Price: $10.99'
                      },
                      {
                        Book_Title: '1984',
                        Author: 'By George Orwell',
                        Release_Data: 'Released: June 8, 1949',
                        Price: 'Price: $9.99'
                      }
                      ]
                      

This code uses the each method to iterate over a collection of selected .book-info elements. For each element, the code creates an object structuredData that contains the extracted data. The data is extracted using the find method to search for child elements with the suitable classes.

Finally, the structuredData object is pushed to the books array using the push method. This result is an array of objects, each representing a book and containing the extracted data.

Extracted data: JSON, CSV, XML

How to write extracted data in a file

What if we want to store the scraped data in our local storage? We'll use the example above to store the data in a JSON file. We can use the built-in fs module in Node. It allows us to interact with the file system on a computer. Here's how you can modify the code to write the books array to a JSON file:


                      import cheerio from 'cheerio';
                      import fs from 'fs';           // Import the "fs" module
                  
                      const $ = cheerio.load(`
                          <div class="book-info">
                              <h2 class="book-title">The Great Gatsby</h2>
                              <p class="author">By F. Scott Fitzgerald</p>
                              <p class="release-date">Released: April 10, 1925</p>
                              <p class="price">Price: $12.99</p>
                          </div>
                          <div class="book-info">
                              <h2 class="book-title">To Kill a Mockingbird</h2>
                              <p class="author">By Harper Lee</p>
                              <p class="release-date">Released: July 11, 1960</p>
                              <p class="price">Price: $10.99</p>
                          </div>
                          <div class="book-info">
                              <h2 class="book-title">1984</h2>
                              <p class="author">By George Orwell</p>
                              <p class="release-date">Released: June 8, 1949</p>
                              <p class="price">Price: $9.99</p>
                          </div>
                      `);
                      const books = [];
                      const books1= $('.book-info')
                      books1.each((index,book)=>{
                          const structuredData = {
                              Book_Title: $(book).find('.book-title').text(),
                              Author: $(book).find('.author').text(),
                              Release_Data:$(book).find('.release-date').text(),
                              Price: $(book).find('.price').text()
                          };
                          books.push(structuredData)    //push the object to the array
                      })
                      const jsonData = JSON.stringify(books); //Convert the array to JSON format
                  
                      fs.writeFile('books.json', jsonData, () => {     //Using the fs.writeFile , it's used to write data in a file
                        console.log('Data written to file'); //Display "Data written to file" in the call back function.
                      });
                      

Note: The fs.writeFile takes three arguments:

  • books.json: The file in which we will write data. If the file is already there, it will just write data in it; if it's not, it will first create the file and then write data in it.
  • jsonData: Data to be written in the file.
  • ()=>: A callback function that will be called after the data has been written to the file.

How to handle errors and exceptions with Cheerio

As a student of web scraping with Cheerio, you may encounter errors and exceptions as you work with the code. Don't worry: this is a common experience for most beginners. We're exploring how to handle these errors and exceptions so that you can become a more confident and successful web scraper developer.

To handle errors, you can use try-catch blocks in your code. A try block allows you to try a block of code when you're not sure if the code will execute or not and catch any errors that might occur. If an error occurs, the catch block will execute, allowing you to handle the error appropriately.


                      try {
                        const $ = cheerio.load(html);
                      } catch (err) {
                        console.log(err);
                      }
                      

This code uses a try-catch block to handle errors that might occur while loading an HTML document. Overall, this code ensures that the program does not crash if an error occurs during the loading of the HTML document and provides a way to handle the error.

How to use Axios with Cheerio

Have you ever wondered how to get the HTML document of a web page? Axios allows you to make HTTP requests and get the HTML document of a website.

To use Axios in your scraping project, you need to install it in your environment by entering the following command:

npm install axios

We'll use the Axios library in the following example.

Scraping multiple pages using Cheerio

It's worth noting that there's a difference between pagination and scraping multiple pages. Pagination refers to dividing content into separate pages, often with navigation links to move between them. On the other hand, scraping multiple pages involves extracting data from many pages of a website, whether they're paginated or not.

Why does this matter? When scraping paginated content, you can use the navigation links to move between pages and extract data from each page. However, when scraping multiple pages that are not paginated, you'll require a different approach to identify and extract the data you need.

Let's see how we can scrape multiple pages with Cheerio.

Scraping multiple pages with Cheerio: IMDb - Michael J. Fox page

You can see in the link that the website is using pagination, and each Actor is at a different link. You can use Cheerio to extract different Actors from different pages using a loop. Let's do this.


                      import axios from 'axios';
                      import cheerio from 'cheerio';
                      const Actors = [];    //Declare an empty actors array to store actors' data.
                      for (let i = 0; i < 5; i++) { // It means we will change the last digit of the url to get the data of another actor
                          const url = `https://www.imdb.com/name/nm000015${i}`;  //Change the last digit of the data
                          const response = await axios.get(url, { headers });   // Add headers to axios request
                          const $ = cheerio.load(response.data);
                          const name = $('h1 .sc-afe43def-1').text();        // Get the name of the actor
                          const Date_of_Birth = $('.sc-dec7a8b-2').eq(1).text()          // Get the Date of Birth
                          Actors.push({ name, Date_of_Birth });
                      }
                      console.log(Actors);
                      

But when we try to access the website to fetch data, we get a 403 error because the server has detected that the request is coming from an automated script. It's a very common anti-scraping technique used by most modern websites. So, we need to find a way to tackle this problem. One way is to use headers in our Axios request to act like an actual browser request. In this example, we'll use a user agent that indicates that the request is made from an actual browser rather than an automated script.

Let's see how we can do this:


                      import axios from 'axios';
                      import cheerio from 'cheerio';
                      const Actors = [];    //Declare an empty actors array to store actors' data.
                      for (let i = 0; i < 5; i++) { // It means we will change the last digit of the url to get the data of another actor
                          const url = `https://www.imdb.com/name/nm000015${i}`;  //Change the last digit of the data
                          const response = await axios.get(url, { headers });   // Add headers to axios request
                          const $ = cheerio.load(response.data);
                          const name = $('h1 .sc-afe43def-1').text();        // Get the name of the actor
                          const Date_of_Birth = $('.sc-dec7a8b-2').eq(1).text()          // Get the Date of Birth
                          Actors.push({ name, Date_of_Birth });
                      }
                      console.log(Actors);
                      

Challenges of web scraping with Cheerio

Like all web scraping libraries, Cheerio has its pros and cons. We have already discussed the pros of using Cheerio. Now let's discuss the challenges you might face while working with Cheerio.

Dynamic websites and JavaScript

Let's start with one of the biggest challenges of web scraping with Cheerio: dealing with dynamic websites and JavaScript. Nowadays, modern websites use dynamic content and JavaScript to update or modify the page content without needing an entire page to reload. That can make it tough to scrape the website with Cheerio alone. The content may not be fully loaded or loaded asynchronously after the initial page load.

When Cheerio loads an HTML page, it only sees the static HTML content. Any JavaScript or dynamic content that's loaded later is hidden from Cheerio. That can be a letdown if the website you're trying to scrape uses JavaScript to load or modify content.

Anti-scraping measures

Do you know why some scrapers get blocked and don't work as efficiently as they should? There are various anti-scraping techniques that block web scrapers. Let's discuss a few of these:

  • CAPTCHAs: CAPTCHAs are tests that are designed to differentiate between human users and bots. They require users to complete a task, such as typing in a sequence of letters or identifying objects in an image.
  • IP blocking: Websites block IP addresses that are associated with web scraping. This means that if a particular IP address is identified as a scraper, the website will block all requests from that IP address.
  • User-agent detection: User-agent detection is a technique that identifies the type of browser or device that is being used to access the website. Websites can use this information to identify scrapers, as many scrapers use non-standard user agents.
  • Dynamic web pages: Some websites use dynamic web pages that are generated using JavaScript. These pages can be more difficult to scrape, as the content is generated on the fly and may not be present in the page source.

As a web scraper developer, it's important to be aware of these anti-scraping measures and to take steps to avoid them. This may include using rotating proxies or user agents, cookies, implementing delays between requests, and much more. You can learn about these measures here.

Performance issues

When working with large amounts of data or complex HTML structures, using intermediate results to optimize performance is common. That means storing the results of certain operations and reusing them later rather than re-calculating them each time. jQuery can do that to optimize performance because it's designed to run in a browser with more memory.

However, Cheerio is designed to run in Node.js, which has a more limited memory than the browser. As a result, Cheerio has a hard time saving intermediate results. It can quickly run into memory issues when working with large datasets.

Without the ability to save intermediate results, Cheerio has to parse the entire HTML document to perform each operation. It can be slow and resource-intensive, especially when working with large documents.

Advanced web scraping with Cheerio

Scraping websites with dynamic content

Advanced websites load data at runtime, making it difficult to extract data using Cheerio. That's where other libraries come in to help out. Let's take a look at some of these libraries.

Puppeteer

Before we start learning Puppeteer, let's discuss where Cheerio fails. While Cheerio is an excellent tool, it fails when it comes to parsing websites that load data at runtime and need a browser to open them. Now, what Puppeteer does for us is that it helps us open the website in a browser known as a headless browser that acts like a browser, but in reality, it's not. Once the website is loaded using Puppeteer, we can use Cheerio to load its HTML and do whatever we want. Pretty amazing, right?

Here are some of the core functions provided by Puppeteer:

  • puppeteer.launch([options]): This function launches an instance of Chrome or Chromium with a set of options specified in an object. It returns a Promise that resolves to an instance of the Browser class.
  • browser.newPage(): This function creates a new blank page in the browser instance. It returns a Promise that resolves to an instance of the Page class.
  • page.goto(URL, [options]): This function navigates to the specified URL. It returns a Promise that resolves when the page is loaded.
  • page.content(): This function returns the HTML content of the current page. It returns a Promise that resolves to a string.
  • browser.close() is a function in Puppeteer that is used to close the browser instance that was opened with puppeteer.launch(). It will terminate the browser process and all of its pages.

These are just some of the functions provided by Puppeteer. There are other functions available as well.

To install Puppeteer in your environment, run the following command:

npm install puppeteer

Let's take a closer look at how this works with an example. We want to extract the Top 250 Movies from the IMDb website in this example.

Extracting the top 250 movies from the IMDb website

As we can see from the image above, all the movies are in one body tag <tbody> having a class of lister-list, and each movie is the <tr>. Now, let's start using puppeteer with Cheerio.


                      import puppeteer from 'puppeteer';
                      import cheerio from 'cheerio';
                       (async () => {
                         const browser = await puppeteer.launch(); //Launch the browser  
                         const page = await browser.newPage();    //Open a new page 
                        await page.goto('https://www.imdb.com/chart/top');   //The opened page goes to the link provided in the .goto() function
                        const html = await page.content();       //Get the content of the page using the .content() function
                        const $ = cheerio.load(html);
                        const movies = [];
                        const moviesTr = $('tbody.lister-list tr') //Select all the table rows using the Cheerio selectors
                        moviesTr.each((i, movie) => {             //Iterate over all the movies one by one
                         const title = $(movie).find('.titleColumn a').text()
                         const year = $(movie).find('.titleColumn .secondaryInfo').text()
                         const rating = $(movie).find('.ratingColumn strong').text()
                         movies.push({title, year, rating});   //Push the extracted data
                         });
                      console.log(movies);
                        await browser.close();
                      })();
                      

Playwright

Playwright is also an open-source Node.js library that was developed by the same team that developed Puppeteer. It is a powerful and versatile alternative to Puppeteer. The Puppeteer team needed a tool that could automate not just the Chromium-based browsers but other browsers like Firefox and Safari. So, they developed Playwright, which supports other browsers as well.

To install Playwright in your environment, run the following command:

npm install playwright

Let's implement the above code with Playwright.


                      import { chromium } from 'playwright'; //import chromium from playwright
                      import cheerio from 'cheerio';
                      
                      (async () => {
                      const browser = await chromium.launch(); // Launch the browser
                      const context = await browser.newContext(); // Create a new context
                      const page = await context.newPage(); // Create a new page
                      await page.goto('https://www.imdb.com/chart/top'); // Navigate to the provided URL
                      const html = await page.content(); // Get the page content
                      const $ = cheerio.load(html);
                      const movies = [];
                      const moviesTr = $('tbody.lister-list tr'); // Select all the table rows using the Cheerio selectors
                      moviesTr.each((i, movie) => { // Iterate over all the movies one by one
                      const title = $(movie).find('.titleColumn a').text();
                      const year = $(movie).find('.titleColumn .secondaryInfo').text();
                      const rating = $(movie).find('.ratingColumn strong').text();
                      movies.push({title, year, rating}); // Push the extracted data
                      });
                      console.log(movies);
                      await browser.close(); // Close the browser
                      })();
                      

Playwright is a more flexible and powerful automation framework than Puppeteer, which makes it an excellent alternative for advanced web automation.

How to implement authentication

What if a website requires authentication? That's another challenge, but don't worry: we've got you covered. It's not possible to extract data from such websites without logging in. Therefore, we need to find a way to authenticate ourselves through our code to open the page from where we can extract the data.

For this purpose, we need the help of the Puppeteer library, which will do the authentication task. And obviously, our well-known tool, Cheerio, will extract the data. With Puppeteer and Cheerio, we can automate the authentication process and extract the data we need.

Let's look at an example to understand how to implement authentication with Cheerio and Puppeteer. We'll try to authenticate the newsapi website, which looks like this:

Using Puppeteer and Cheerio to authenticate the newsapi website

Let's dive into the code and see how this magic happens:


                      import puppeteer from 'puppeteer';
                      import cheerio from 'cheerio';
                      (async () => {
                          const browser = await puppeteer.launch(); //Launch the browser using .launch() function
                          const page = await browser.newPage(); //Open a new page using .newPage() function
                          
                          await page.goto('https://newsapi.org/login'); //The opened page goes to the link provided in the .goto() function
                          
                          await page.type('#Email', 'your_email'); //Enter the email and password in the respective input fields
                          await page.type('#Password', 'you_password');   //The await waits for the website to load these fields.
                          
                          await page.click('button[type=submit]'); //Click the submit button
                          
                          await page.waitForNavigation(); //Wait for the website to finish the login process
                          await page.goto('https://newsapi.org/account'); //Go to the home page
                          //Now we can manipulate the website according to our needs
                          const html = await page.content();
                          const $ = cheerio.load(html);
                          console.log($('.mb2').text()); 
                          await browser.close();
                        })();
                      

How to handle asynchronous requests, errors, and retries

When web scraping with Cheerio, it's important to consider errors and retries to ensure our code is reliable and robust. As we saw earlier, error handling allows us to prevent and detect possible errors and whether the error occurs while fetching the website. It would require retries, and that's what we'll cover now. We'll see how things go from requesting a website, handling errors, and trying again. We've already seen the axios library for making HTTP requests. That same axios library allows us to retry the HTTP requests as well.

To use axios-retry, first we need to install it as a dependency and import it into our file.

npm install axios-retry

Let's see an example.


                      const axios = require('axios');
                      const axiosRetry = require('axios-retry');
                      axiosRetry(axios, {
                        retries: 3,
                        retryDelay: (retryCount) => {
                          return retryCount * 2000; // multiple the retry time with 2000 miliseconds
                        },
                      });
                      async function scrapeWebsite() {
                        try {
                          const response = await axios.get('link-to-the-website');
                        } catch (error) {
                          console.error(error);
                        }
                      }
                      scrapeWebsite();
                      

First, scrapeWebsite will be called, and it will try to fetch the website provided in the axios.get function. If the request fails, then the axiosRetry function will be called. It will try to fetch the data three times, and the time between each try increases to a multiple of 2,000 milliseconds. The function will return an error if it gets nothing after three attempts.

How to use a testing framework with Cheerio

It's important to test your code thoroughly before it goes into production. By testing with various inputs and outputs, we can be confident that the code works correctly and meets our requirements. Deploying code without proper testing can lead to unexpected problems, which can cause delays and additional costs to the project. Therefore, testing the code is a good practice to ensure the quality and reliability of the software.

Let's see how we can use a framework to test our Cheerio codes.

Test code with Jest

Jest is a popular framework developed to test JavaScript applications. It was developed by Facebook, and one of its key features is that it can run tests in parallel, which makes it super fast and efficient. It's easy to set up and test the codes.

Let's see how we can add Jest to our environment.

Open the terminal and enter the command below:

npm install --save-dev jest

We have passed this --save-dev optional argument to make that this dependency is only needed in the development phase.

Open the package.json and replace:

"test": ""

with the following:

"test": "node --experimental-vm-modules node_modules/.bin/jest"

Write the tests in the .test file and enter the command below to see the results:

npm test

How to write a test with Jest

Let's use some code that we wrote earlier in this Cheerio tutorial.


                      function getBookInfo(htmlString) {
                        const $ = cheerio.load(htmlString);
                        const structuredData = {
                          bookTitle: $('.book-title').text().trim(),
                          author: $('.author').text().trim(),
                          releaseDate: $('.release-date').text().trim(),
                          price: $('.price').text().trim()
                        };
                        return structuredData;
                      }
                      export default getBookInfo;
                      

We have a function named getBookInfo that accepts an HTML string as an input and returns an object named structuredData. In the end, we're exporting the function so that we can test it in our test.js file.

Let's see how well this code is working. We'll open our IDE and create a file name structureData.test.js and write the following code.


                      import getBookInfo from './index'; //Import the function getBookInfo
                      test('getBookInfo returns the correct book information', () => {
                        const htmlString = `
                          <div class="book-info"> //HTML string that will be used for testing
                            <h2 class="book-title">The Great Gatsby</h2>
                            <p class="author">By F. Scott Fitzgerald</p>
                            <p class="release-date">Released: April 10, 1925</p>
                            <p class="price">Price: $12.99</p>
                          </div>`;
                      //Declare a test having two arguments. The first argument is the description.
                      //The second argument is an anonymous function that will return if the test fails or not.
                        const bookInfo = getBookInfo(htmlString);
                        expect(bookInfo).toEqual({
                          bookTitle: 'The Great Gatsby',
                          author: 'By F. Scott Fitzgerald',
                          releaseDate: 'Released: April 10, 1925',
                          price: 'Price: $12.99'
                        });
                      });
                      //The .expect() function takes the results as an argument
                      //.toEqual() compares the results with the desired outcome.
                      

Like this, we can write as many tests using Jest as we want for our code, like missing values, extra spaces, etc.

Finding and fixing bugs in your Cheerio code

While creating scraping scripts and manipulating the DOM, we might encounter some challenges, but we have tools and strategies to make the process easier.

  • Unit tests: Testing the code before deployment and ensuring it works correctly can save us from major problems. Unit testing ensures that we get the expected results. Different technologies like Jest and Mocha are used for this.
  • Console.log: The console.log is often used for debugging programs by writing it at various positions in the code. The same is used to troubleshoot Cheerio programs.

Debugging using the console.log


                      console.log('Debugging Cheerio code');
                      

Browser developer tools

The browser developer tools can be used to inspect the DOM and spot problems in the code. Such tools include Chrome DevTools.

Best practices for web scraping with Cheerio

The following approaches can be used to optimize our code and improve performance.

  • Handling dynamic content: As we mentioned earlier, dynamic content creates a lot of issues while extracting data, so it's a good practice to always keep in mind that Cheerio may not help us in this scenario. We need to use other libraries to load the HTML and perform operations using Cheerio.
  • Handling complex selectors: It can be tricky to work on websites that use complex selectors, i.e., nested selectors. It's recommended to break down the selectors and select elements very carefully.
  • Handling version compatibility issues: Cheerio has different versions that may not be compatible with specific versions of Node.js or other libraries. Check compatibility before using Cheerio, and update to the latest version if necessary.

You can visit this great blog post to learn more about scraping websites more efficiently using Cheerio.

Alternatives to Cheerio

Let's explore some of the alternatives to web scraping with Cheerio. We'll look at the pros and cons of these libraries as well.

LibraryAdvantagesDisadvantagesMaintenance/Up-to-date?
CheerioLightweight, fast, and easy to useLimited functionality for dynamic web pagesMaintained and up-to-date
PuppeteerFull-fledged automation tool with Chrome DevTools integrationRequires more setup than Cheerio, slower, resource-intensiveMaintained and up-to-date
JSDOMLightweight, allows for easy DOM manipulationLimited functionality for dynamic web pagesMaintained and up-to-date
NightmareJSHigh-level API, supports multiple browsersSlower than Puppeteer, outdated and not maintainedOutdated and not maintained
PhantomJSLightweight, supports multiple browsersOutdated and not maintainedOutdated and not maintained
PlaywrightMulti-browser support, faster than PuppeteerRequires more setup than Cheerio, resource-intensiveMaintained and up-to-date
node-html-parserSupports parsing of dynamic HTML, easy to useLimited functionality for web automationMaintained and up-to-date
got-scrapingLightweight, fast, and easy to useDeveloped specifically to address drawbacks of modern scraping toolsMaintained and up-to-date

Conclusion

Cheerio is a robust and adaptable framework with an easy-to-use API for parsing and manipulating HTML. With its jQuery-like syntax, extracting data from web pages and manipulating the HTML to meet your needs is simple.

Although Cheerio is an excellent scraping tool in many cases, it does have its challenges, such as anti-scraping measures, dynamic websites, and performance issues. Yet, refined approaches and tools can assist you in overcoming these challenges and achieving your web scraping objectives.

As you may have observed, this blog post by no means tries to pitch Cheerio as the ultimate scraping tool. Cheerio has its fair share of shortcomings and areas to improve. Having said that, Cheerio still looks pretty promising. Also, it's always easy to switch to alternative tools after mastering it.

As the saying goes, Perpetuam Uitae Doctrina (lifelong learning). Complimenting this tutorial with practice is essential. Also, a good tutorial is neither perfect nor the only source of information on the subject. So, if you have any feedback for improvement, please let us know. Happy learning!