Web Scrapping Nodejs

Javascript has become one of the most popular and widely used languages due to the massive improvements it has seen and the introduction of the runtime known as NodeJS. Whether it's a web or mobile application, Javascript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to efficiently scrape the web to meet most of your requirements.

Web Scraping Nodejs Cheerio
Open Source Web Scraper
Node Js Website Scraper
Cheerio Web Scraping Nodejs

Since JavaScript is excellent at manipulating the DOM (Document Object Model) inside a web browser, creating data extraction scripts in Node.js can be extremely versatile. Hence, this tutorial focuses on javascript web scraping. In this article, we’re going to illustrate how to perform web scraping with JavaScript and Node.js. Use multiple HTTP clients to assist the web scraping process; Utilize multiple modern and battle-tested libraries to scrape the web; Understanding NodeJS: A brief introduction Javascript is a simple and modern language that was initially created to add dynamic behavior to websites inside the browser. An Introduction to Web Scraping with Node JS In this tutorial you’ll learn how to scrape static websites with Node.js, request, and CheerioJS. However, when you use Node.js runtime environment with JavaScript, you enable it to run scripts on both the client-side and server-side. Here are the steps for web scraping using JavaScript and Node.js: Step 1: Identify the URL that you want to crawl. Step 2: Install the dependencies like Axios and Cheerios by using the below code.

Prerequisites

This post is primarily aimed at developers who have some level of experience with Javascript. However, if you have a firm understanding of Web Scraping but have no experience with Javascript, this post could still prove useful.Below are the recommended prerequisites for this article:

✅ Experience with Javascript
✅ Experience using DevTools to extract selectors of elements
✅ Some experience with ES6 Javascript (Optional)

⭐ Make sure to check out the resources at the end of this article to learn more!

Outcomes

After reading this post will be able to:

Have a functional understanding of NodeJS
Use multiple HTTP clients to assist in the web scraping process
Use multiple modern and battle-tested libraries to scrape the web

Understanding NodeJS: A brief introduction

Javascript is a simple and modern language that was initially created to add dynamic behavior to websites inside the browser. When a website is loaded, Javascript is run by the browser's Javascript Engine and converted into a bunch of code that the computer can understand.

For Javascript to interact with your browser, the browser provides a Runtime Environment (document, window, etc.).

This means that Javascript is not the kind of programming language that can interact with or manipulate the computer or it's resources directly. Servers, on the other hand, are capable of directly interacting with the computer and its resources, which allows them to read files or store records in a database.

When introducing NodeJS, the crux of the idea was to make Javascript capable of running not only client-side but also server-side. To make this possible, Ryan Dahl, a skilled developer took Google Chrome's v8 Javascript Engine and embedded it with a C++ program named Node.

So, NodeJS is a runtime environment that allows an application written in Javascript to be run on a server as well.

As opposed to how most languages, including C and C++, deal with concurrency, which is by employing multiple threads, NodeJS makes use of a single main thread and utilizes it to perform tasks in a non-nlocking manner with the help of the Event Loop.

Putting up a simple web server is fairly simple as shown below:

If you have NodeJS installed and you run the above code by typing(without the < and >) in node <YourFileNameHere>.js opening up your browser, and navigating to localhost:3000, you will see some text saying, “Hello World”. NodeJS is ideal for applications that are I/O intensive.

HTTP clients: querying the web

HTTP clients are tools capable of sending a request to a server and then receiving a response from it. Almost every tool that will be discussed in this article uses an HTTP client under the hood to query the server of the website that you will attempt to scrape.

Request

Request is one of the most widely used HTTP clients in the Javascript ecosystem. However, currently, the author of the Request library has officially declared that it is deprecated. This does not mean it is unusable. Quite a lot of libraries still use it, and it is every bit worth using.

It is fairly simple to make an HTTP request with Request:

You can find the Request library at GitHub, and installing it is as simple as running npm install request.

You can also find the deprecation notice and what this means here. If you don't feel safe about the fact that this library is deprecated, there are other options down below!

Axios

Axios is a promise-based HTTP client that runs both in the browser and NodeJS. If you use TypeScript, then Axios has you covered with built-in types.

Making an HTTP request with Axios is straight-forward. It ships with promise support by default as opposed to utilizing callbacks in Request:

If you fancy the async/await syntax sugar for the promise API, you can do that too. But since top level await is still at stage 3, we will have to make use of an async function instead:

All you have to do is call getForum! You can find the Axios library at Github and installing Axios is as simple as npm install axios.

SuperAgent

Much like Axios, SuperAgent is another robust HTTP client that has support for promises and the async/await syntax sugar. It has a fairly straightforward API like Axios, but SuperAgent has more dependencies and is less popular.

Regardless, making an HTTP request with Superagent using promises, async/await, or callbacks looks like this:

You can find the SuperAgent library at GitHub and installing Superagent is as simple as npm install superagent.

Web Scraping Nodejs Cheerio

For the upcoming few web scraping tools, Axios will be used as the HTTP client.

Note that there are other great HTTP clients for web scrapinglike node-fetch!

Regular expressions: the hard way

The simplest way to get started with web scraping without any dependencies is to use a bunch of regular expressions on the HTML string that you fetch using an HTTP client. But there is a big tradeoff. Regular expressions aren't as flexible and both professionals and amateurs struggle with writing them correctly.

For complex web scraping, the regular expression can also get out of hand. With that said, let's give it a go. Say there's a label with some username in it, and we want the username. This is similar to what you'd have to do if you relied on regular expressions:

In Javascript, match() usually returns an array with everything that matches the regular expression. In the second element(in index 1), you will find the textContent or the innerHTML of the <label>tag which is what we want. But this result contains some unwanted text (“Username: “), which has to be removed.

As you can see, for a very simple use case the steps and the work to be done are unnecessarily high. This is why you should rely on something like an HTML parser, which we will talk about next.

Cheerio: Core jQuery for traversing the DOM

Cheerio is an efficient and light library that allows you to use the rich and powerful API of jQuery on the server-side. If you have used jQuery previously, you will feel right at home with Cheerio. It removes all of the DOM inconsistencies and browser-related features and exposes an efficient API to parse and manipulate the DOM.

As you can see, using Cheerio is similar to how you'd use jQuery.

However, it does not work the same way that a web browser works, which means it does not:

Render any of the parsed or manipulated DOM elements
Apply CSS or load any external resource
Execute Javascript

So, if the website or web application that you are trying to crawl is Javascript-heavy (for example a Single Page Application), Cheerio is not your best bet. You might have to rely on other options mentionned later in this article.

To demonstrate the power of Cheerio, we will attempt to crawl the r/programming forum in Reddit and, get a list of post names.

First, install Cheerio and axios by running the following command:npm install cheerio axios.

Then create a new file called crawler.js, and copy/paste the following code:

getPostTitles() is an asynchronous function that will crawl the Reddit's old r/programming forum. First, the HTML of the website is obtained using a simple HTTP GET request with the axios HTTP client library. Then the HTML data is fed into Cheerio using the cheerio.load() function.

With the help of the browser Dev-Tools, you can obtain the selector that is capable of targeting all of the postcards. If you've used jQuery, the $('div > p.title > a') is probably familiar. This will get all the posts. Since you only want the title of each post individually, you have to loop through each post. This is done with the help of the each() function.

To extract the text out of each title, you must fetch the DOM element with the help of Cheerio (el refers to the current element). Then, calling text() on each element will give you the text.

Now, you can pop open a terminal and run node crawler.js. You'll then see an array of about 25 or 26 different post titles (it'll be quite long). While this is a simple use case, it demonstrates the simple nature of the API provided by Cheerio.

If your use case requires the execution of Javascript and loading of external sources, the following few options will be helpful.

JSDOM: the DOM for Node

JSDOM is a pure Javascript implementation of the Document Object Model to be used in NodeJS. As mentioned previously, the DOM is not available to Node, so JSDOM is the closest you can get. It more or less emulates the browser.

Once a DOM is created, it is possible to interact with the web application or website you want to crawl programmatically, so something like clicking on a button is possible. If you are familiar with manipulating the DOM, using JSDOM will be straightforward.

As you can see, JSDOM creates a DOM. Then you can manipulate this DOM with the same methods and properties you would use while manipulating the browser DOM.

To demonstrate how you could use JSDOM to interact with a website, we will get the first post of the Reddit r/programming forum and upvote it. Then, we will verify if the post has been upvoted.

Start by running the following command to install JSDOM and Axios:npm install jsdom axios

Then, make a file named crawler.js and copy/paste the following code:

upvoteFirstPost() is an asynchronous function that will obtain the first post in r/programming and upvote it. To do this, axios sends an HTTP GET request to fetch the HTML of the URL specified. Then a new DOM is created by feeding the HTML that was fetched earlier.

The JSDOM constructor accepts the HTML as the first argument and the options as the second. The two options that have been added perform the following functions:

runScripts: When set to “dangerously”, it allows the execution of event handlers and any Javascript code. If you do not have a clear idea of the credibility of the scripts that your application will run, it is best to set runScripts to “outside-only”, which attaches all of the Javascript specification provided globals to the window object, thus preventing any script from being executed on the inside.
resources: When set to “usable”, it allows the loading of any external script declared using the <script> tag (e.g, the jQuery library fetched from a CDN).

Once the DOM has been created, you can use the same DOM methods to get the first post's upvote button and then click on it. To verify if it has been clicked, you could check the classList for a class called upmod. If this class exists in classList, a message is returned.

Now, you can pop open a terminal and run node crawler.js. You'll then see a neat string that will tell you if the post has been upvoted. While this example use case is trivial, you could build on top of it to create something powerful (for example, a bot that goes around upvoting a particular user's posts).

If you dislike the lack of expressiveness in JSDOM and your crawling relies heavily on such manipulations or if there is a need to recreate many different DOMs, the following options will be a better match.

Puppeteer: the headless browser

Puppeteer, as the name implies, allows you to manipulate the browser programmatically, just like how a puppet would be manipulated by its puppeteer. It achieves this by providing a developer with a high-level API to control a headless version of Chrome by default and can be configured to run non-headless.

Taken from the Puppeteer Docs (Source)

Puppeteer is particularly more useful than the aforementioned tools because it allows you to crawl the web as if a real person were interacting with a browser. This opens up a few possibilities that weren't there before:

You can get screenshots or generate PDFs of pages.
You can crawl a Single Page Application and generate pre-rendered content.
You can automate many different user interactions, like keyboard inputs, form submissions, navigation, etc.

It could also play a big role in many other tasks outside the scope of web crawling like UI testing, assist performance optimization, etc.

Quite often, you will probably want to take screenshots of websites or, get to know about a competitor's product catalog. Puppeteer can be used to do this. To start, install Puppeteer by running the following command:npm install puppeteer

This will download a bundled version of Chromium which takes up about 180 to 300 MB, depending on your operating system. If you wish to disable this and point Puppeteer to an already downloaded version of Chromium, you must set a few environment variables.

This, however, is not recommended. Ff you truly wish to avoid downloading Chromium and Puppeteer for this tutorial, you can rely on the Puppeteer playground.

Let's attempt to get a screenshot and PDF of the r/programming forum in Reddit, create a new file called crawler.js, and copy/paste the following code:

getVisual() is an asynchronous function that will take a screenshot and PDF of the value assigned to the URL variable. To start, an instance of the browser is created by running puppeteer.launch(). Then, a new page is created. This page can be thought of like a tab in a regular browser. Then, by calling page.goto() with the URL as the parameter, the page that was created earlier is directed to the URL specified. Finally, the browser instance is destroyed along with the page.

Once that is done and the page has finished loading, a screenshot and PDF will be taken using page.screenshot() and page.pdf() respectively. You could also listen to the Javascript load event and then perform these actions, which is highly recommended at the production level.

When you run the code type in node crawler.js to the terminal, after a few seconds, you will notice that two files by the names screenshot.jpg and page.pdf have been created.

Also, we've written a complete guide on how to download a file with Puppeteer. You should check it out!

Nightmare: an alternative to Puppeteer

Nightmare is another a high-level browser automation library like Puppeteer. It uses Electron but is said to be roughly twice as fast as it's predecessor PhantomJS and it's more modern.

If you dislike Puppeteer or feel discouraged by the size of the Chromium bundle, Nightmare is an ideal choice. To start, install the Nightmare library by running the following command:npm install nightmare

Once Nightmare has been downloaded, we will use it to find ScrapingBee's website through a Google search. To do so, create a file called crawler.js and copy/paste the following code into it:

First, a Nightmare instance is created. Then, this instance is directed to the Google search engine by calling goto() once it has loaded. The search box is fetched using its selector. Then the value of the search box (an input tag) is changed to “ScrapingBee”.

After this is finished, the search form is submitted by clicking on the “Google Search” button. Then, Nightmare is told to wait untill the first link has loaded. Once it has loaded, a DOM method will be used to fetch the value of the href attribute of the anchor tag that contains the link.

Finally, once everything is complete, the link is printed to the console. To run the code, type in node crawler.js to your terminal.

Summary

That was a long read! But now you understand the different ways to use NodeJS and it's rich ecosystem of libraries to crawl the web in any way you want. To wrap up, you learned:

✅ NodeJS is a Javascript runtime that allow Javascript to be run server-side. It has a non-blocking nature thanks to the Event Loop.
✅ HTTP clients such as Axios, SuperAgent, Node fetch and Request are used to send HTTP requests to a server and receive a response.
✅ Cheerio abstracts the best out of jQuery for the sole purpose of running it server-side for web crawling but does not execute Javascript code.
✅ JSDOM creates a DOM per the standard Javascript specification out of an HTML string and allows you to perform DOM manipulations on it.
✅ Puppeteer and Nightmare are high-level browser automation libraries, that allow you to programmatically manipulate web applications as if a real person were interacting with them.

While this article tackles the main aspects of web scraping with NodeJS, it does not talk about web scraping without getting blocked.

If you want to learn how to avoid getting blocked, read our complete guide, and if you don't want to deal with this, you can always use our web scraping API.

Happy Scraping!

Resources

Would you like to read more? Check these links out:

NodeJS Website - Contains documentation and a lot of information on how to get started.
Puppeteer's Docs - Contains the API reference and guides for getting started.
Playright An alternative to Puppeteer, backed by Microsoft.
ScrapingBee's Blog - Contains a lot of information about Web Scraping goodies on multiple platforms.

2013-02-10T21:32:50Z

Posted by Miguel Grinberg under Programming, JavaScript.

Web scraping is a technique used to extract data from websites using a computer program that acts as a web browser. The program requests pages from web servers in the same way a web browser does, and it may even simulate a user logging in to obtain access. It downloads the pages containing the desired data and extracts the data out of the HTML code. Once the data is extracted it can be reformatted and presented in a more useful way.

In this article I'm going to show you how to write web scraping scripts in Javascript using Node.js.

Why use web scraping?

Here are a few of examples where web scraping can be useful:

You have several bank accounts with different institutions and you want to generate a combined report that includes all your accounts.
You want to see data presented by a website in a different format. For example, the website shows a table, but you want to see a chart.
A web site presents related data across multiple pages. You want to see all or part of this data combined in a single report.
You are an app developer with apps in iTunes and several Android app stores, and you want to have a report of monthly sales across all app stores.

Web scraping can also be used in ways that are dishonest and sometimes even illegal. Harvesting of email addresses for spam purposes, or sniping Ebay auctions are examples of such uses. As a matter of principle I only use web scraping to collect and organize information that is either available to everyone (stock prices, movie showtimes, etc.) or only available to me (personal bank accounts, etc.). I avoid using this technique for profit, I just do it to simplify the task of obtaining information.

In this article I'm going to show you a practical example that implements this technique. Ready? Then let's get started!

Tools for web scraping

In its most basic form, a web scraping script just needs to have a way to download web pages and then search for data in them. All modern languages provide functions to download web pages, or at least someone wrote a library or extension that can do it, so this is not a problem. Locating and isolating data in HTML pages, however, is difficult. An HTML page has content, layout and style elements all intermixed, so a non trivial effort is required to parse and identify the interesting parts of the page.

For example, consider the following HTML page:

Let's say we want to extract the names of the people that appear in the table with id='data' that is in the page. How do we get to those?

Typically the web page will be downloaded into a string, so it would be simple to just search this string for all the occurrences of <td> and extract what comes after that and until the following </td>.

But this could easily make us find incorrect data. The page could have other tables, either before or after the one we want that use the same CSS classes for some of its cells. Or worst, maybe this simple search algoritm works fine for a while, but one day the layout of the page changes so that the old <td> becomes <td align='left'> making our search find nothing.

Aiseesoft mac pdf converter ultimate. While there is always a risk that a change to the target web page can break a scraping script, it is a good idea to be smart about how items are located in the HTML so that the script does not need to be revised every time the web site changes.

If you have ever written client-side Javascript for the browser using a library like jQuery then you know how the tricky task of locating DOM elements becomes much easier using CSS selectors.

For example, in the browser we could easily extract the names from the above web page as follows:

The CSS selector is what goes inside jQuery's $ function, #data .name in this example, This is saying that we want to locate all the elements that are children of an element with the id data and have a CSS class name. Note that we are not saying anything about the data being in a table in this case. CSS selectors have great flexibility in how you specify search terms for elements, and you can be as specific or vague as you want.

The each function will just call the function given as an argument for all the elements that match the selector, with the this context set to the matching element. If we were to run this in the browser we would see an alert box with the name John, and then another one with the name 'Susan'.

Wouldn't it be nice if we could do something similar outside of the context of a web browser? Well, this is exactly what we are about to do.

Introducing Node.js

Javascript was born as a language to be embedded in web browsers, but thanks to the open source Node.js project we can now write stand-alone scripts in Javascript that can run on a desktop computer or even on a web server.

Manipulating the DOM inside a web browser is something that Javascript and libraries like jQuery do really well so to me it makes a lot of sense to write web scraping scripts in Node.js, since we can use many techniques that we know from DOM manipulation in the client-side code for the web browser.

If you would like to try the examples I will present in the rest of this article then this is the time to download and install Node.js. Installers for Windows, Linux and OS X are available at http://nodejs.org.

Node.js has a large library of packages that simplify different tasks. For web scraping we will use two packages called request and cheerio. The request package is used to download web pages, while cheerio generates a DOM tree and provides a subset of the jQuery function set to manipulate it. To install Node.js packages we use a package manager called npm that is installed with Node.js. This is equivalent to Ruby's gem or Python's easy_install and pip, it simplifies the download and installation of packages.

So let's start by creating a new directory where we will put our web scraping scripts and install these two modules in it:

Node.js modules will be installed in the scraping/node_modules subdirectory and will only be accessible to scripts that are in the scraping directory. It is also possible to install Node.js packages globally, but I prefer to keep things organized by installing modules locally.

Now that we have all the tools installed let's see how we can implement the above scraping example using cheerio. Let's call this script example.js:

The first line imports the cheerio package into the script. The require statement is similar to #include in C/C++, require in Ruby or import in Python.

In the second line we instantiate a DOM for our example HTML, by sending the HTML string to cheerio.load(). The return value is the constructed DOM, which we store in a variable called $ to match how the DOM is accessed in the browser when using jQuery.

Once we have a DOM created we just go about business as if we were using jQuery on the client side. So we use the proper selector and the each iterator to find all the occurrences of the data we want to extract. In the callback function we use the console.log function to write the extracted data. In Node.js console.log writes to the console, so it is handy to dump data to the screen.

Here is how to run the script and what output it produces:

Easy, right? In the following section we'll write a more complex scraping script.

Real world scraping

Let's use web scraping to solve a real problem.

The Tualatin Hills Park and Recreation District (THPRD) is a Beaverton, Oregon organization that offers area residents a number of recreational options, among them swimming. There are eight swimming pools, all in the area, each offering swimming instruction, lap swimming, open swim and a variety of other programs. The problem is that THPRD does not publish a combined schedule for the pools, it only publishes individual schedules for each pool. But the pools are all located close to each other, so many times the choice of pool is less important than what programs are offered at a given time. If I wanted to find the time slots a given program is offered at any of the pools I would need to access eight different web pages and search eight schedules.

For this example we will say that we want to obtain the list of times during the current week when there is an open swim program offered in any of the pools in the district. This requires obtaining the schedule pages for all the pools, locating the open swim entries and listing them.

Before we start, click here to open one of the pool schedules in another browser tab. Feel free to inspect the HTML for the page to familiarize yourself with the structure of the schedule.

The schedule pages for the eight pools have a URL with the following structure:

The id is what selects which pool to show a schedule for. I took the effort to open all the schedules manually to take note of the names of each pool and its corresponding id, since we will need those in the script. We will also use an array with the names of the days of the week. We can scrape these names from the web pages, but since this is information that will never change we can simplify the script by incorporating the data as constants.

Web scraping skeleton

With the above information we can sketch out the structure of our scraping script. Let's call the script thprd.js:

We begin the script importing the two packages that we are going to use and defining the constants for the eight pools and the days of the week.

Then we download the schedule web pages of each of the pools in a loop. For this we construct the URL of each pool schedule and send it to the request function. This is an asynchronous function that takes a callback as its second argument. If you are not very familiar with Javascript this may seem odd, but in this language asynchronous functions are very common. The request() function returns immediately, so it is likely that the eight request() calls will be issued almost simultaneously and will be processed concurrently by background threads.

When a request completes its callback function will be invoked with three arguments, an error code, a response object and the body of the response. Inside the callback we make sure there is no error and then we just send the body of the response into cheerio to create a DOM from it. When we reach this point we are ready to start scraping.

We will look at how to scrape this content later, for now we just print the name of the pool as a placeholder. If you run this first version of our script you'll get a surprise:

What? Why do we get the same pool name eight times? Shouldn't we see all the pool names here?

Javascript scoping

Remember I said above that the request() function is asynchronous? The for loop will do its eight iterations, spawning a background job in each. The loop then ends, leaving the loop variable set to the pool name that was used in the last iteration. When the callback functions are invoked a few seconds later they will all see this value and print it.

I made this mistake on purpose to demonstrate one of the main sources of confusion among developers that are used to traditional languages and are new to Javascript's asynchronous model.

How can we get the correct pool name to be sent to each callback function then?

The solution is to bind the name of the pool to the callback function at the time the callback is created and sent to the request() function, because that is when the pool variable has the correct value.

As we've seen before the callback function will execute some time in the future, after the loop in the main script completed. But the callback function can still access the loop variable even though the callback runs outside of the context of the main script. This is because the scope of Javascript functions is defined at the time the function is created. When we created the callback function the loop variable was in scope, so the variable is accessible to the callback. The url variable is also in the scope, so the callback can also make use of it if necessary, though the same problem will exist with it, its last value will be seen by all callbacks.

So what I'm basically saying is that the scope of a function is determined at the time the function is created, but the values of the variables in the scope are only retrieved at the time the function is called.

We can take advantage of these seemingly odd scoping rules of Javascript to insert any variable into the scope of a callback function. Let's do this with a simple function:

Can you guess what the output of this script will be? The output will be 2, because that's the value of variable a at the time the function stored in variable f is invoked.

To freeze the value of a at the time f is created we need to insert the current value of a into the scope:

Let's analyze this alternative way to create f one step at a time:

We clearly see that the expression enclosed in parenthesis supposedly returns a function, and we invoke that function and pass the current value of a as an argument. This is not a callback function that will execute later, this is executing right away, so the current value of a that is passed into the function is 1.

Here we see a bit more of what's inside the parenthesis. The expression is, in fact, a function that expects one argument. We called that argument a, but we could have used a different name.

In Javascript a construct like the above is called a self-executing function. You could consider this the reverse of a callback function. While a callback function is a function that is created now but runs later, a self-executing function is created and immediately executed. Whatever this function returns will be the result of the whole expression, and will get assigned to f in our example.

Why would you want to use a self-executing function when you can make any code execute directly without enclosing it inside a function? The difference is subtle. By putting code inside a function we are creating a new scope level, and that gives us the chance to insert variables into that scope simply by passing them as arguments to the self-executing function.

We know f should be a function, since later in the script we want to invoke it. So the return value of our self-executing function must be the function that will get assigned to f:

Does it make a bit more sense now? The function that is assigned to f now has a parent function that received a as an argument. That a is a level closer than the original a in the scope of f, so that is the a that the scope of f sees. When you run the modified script you will get a 1 as output.

Here is how the self-executing trick can be applied to our web scraping script:

Open Source Web Scraper

This is pretty much identical to the simpler example above using the pool variable instead of a. Running this script again gives us the expected result:

Scraping the swimming pool schedules

To be able to scrape the contents of the schedule tables we need to discover how these schedules are structured. In rough terms the schedule table is located inside a page that looks like this:

Inside each of these <td> elements that hold the daily schedules there is a <div> wrapper around each scheduled event. Here is a simplified structure for a day:

Each <td> element contains a link at the top that we are not interested in, then a sequence of <div> elements, each containing the information for an event.

One way we can get to these event <div> elements is with the following selector:

The problem with the above selector, though, is that we will get all the events of all the days in sequence, so we will not know what events happen on which day.

Instead, we can separate the search in two parts. First we locate the <td> element that defines a day, then we search for <div> elements within it:

The function that we pass to the each() iterator receives the index number of the found element as a first argument. This is handy because for our outer search this is telling us which day we are in. We do not need an index number in the inner search, so there we do not need to use an argument in our function.

Running the script now shows the pool name, then the day of the week and then the text inside the event <div>, which has the information that we want. The text() function applied to any element of the DOM returns the constant text filtering out any HTML elements, so this gets rid of the <strong> and <br> elements that exist there and just returns the filtered text.

We are now very close. The only remaining problem is that the text we extracted from the <div> element has a lot of whitespace in it. There is whitespace at the start and end of the text and also in between the event time and event description. We can eliminate the leading and trailing whitespace with trim():

This leaves us with a few lines of whitespace in between the event time and the description. To remove that we can use replace():

Note the regular expression that we use to remove the spaces requires at least two whitespace characters. This is because the event description can contain spaces as well, if we search for two or more spaces we will just find the large whitespace block in the middle and not affect the description.

Node Js Website Scraper

When we run the script now this is what we get:

And this is just a CSV version of all the pool schedules combined!

We said that for this exercise we were only interested in obtaining the open swim events, so we need to add one more filtering layer to just print the targeted events:

And now we have completed our task. Here is the final version of our web scraping script:

Running the script gives us this output:

From this point on it is easy to continue to massage this data to get it into a format that is useful. My next step would be to sort the list by day and time instead of by pool, but I'll leave that as an exercise to interested readers.

Final words

I hope this introduction to web scraping was useful to you and the example script serves you as a starting point for your own projects.

If you have any questions feel free to leave them below in the comments section.

Thanks!

Miguel

Hello, and thank you for visiting my blog! If you enjoyed this article, please consider supporting my work on this blog on Patreon!

68 comments

#1Nano said 2013-02-11T01:18:06Z
#2fallanic said 2013-02-11T16:02:43Z
#3roshan agarwal said 2013-04-24T17:14:09Z
#4Kishore said 2013-06-18T15:53:24Z
#5Miguel Grinberg said 2013-06-19T04:08:32Z
#6Matt said 2013-06-26T00:53:46Z
#7Victor said 2013-07-09T12:01:34Z
#8Evis said 2013-07-09T12:21:47Z
#9Miguel Grinberg said 2013-07-09T16:19:19Z
#10Miguel Grinberg said 2013-07-09T16:20:38Z
#11Marko said 2013-07-19T18:46:04Z
#12Jinjo Johnson said 2013-07-31T12:10:15Z
#13Alexandru Cobuz said 2013-08-05T12:26:39Z
#14David Konsumer said 2013-08-12T11:20:53Z
#15David Konsumer said 2013-08-12T11:30:30Z
#16Max said 2013-09-06T18:37:43Z
#17ponk said 2013-10-09T21:59:18Z
#18Miguel Grinberg said 2013-10-10T06:06:28Z
#19Carlos said 2013-10-10T09:45:57Z
#20Trevor said 2013-10-17T18:57:47Z
#21dhar said 2013-10-18T07:42:05Z
#22Miguel Grinberg said 2013-10-18T14:38:29Z
#23Mark Thien said 2013-10-31T16:22:28Z
#24Miguel Grinberg said 2013-11-01T15:11:53Z
#25sotiris said 2013-11-14T12:18:41Z