node <YourFileNameHere>.js
opening up your browser, and navigating to localhost:3000
, you will see some text saying, “Hello World”. NodeJS is ideal for applications that are I/O intensive.npm install request
.getForum
! You can find the Axios library at Github and installing Axios is as simple as npm install axios
.npm install superagent
.match()
usually returns an array with everything that matches the regular expression. In the second element(in index 1), you will find the textContent
or the innerHTML
of the <label>
tag which is what we want. But this result contains some unwanted text (“Username: “), which has to be removed.npm install cheerio axios
.crawler.js
, and copy/paste the following code:getPostTitles()
is an asynchronous function that will crawl the Reddit's old r/programming forum. First, the HTML of the website is obtained using a simple HTTP GET request with the axios HTTP client library. Then the HTML data is fed into Cheerio using the cheerio.load()
function.$('div > p.title > a')
is probably familiar. This will get all the posts. Since you only want the title of each post individually, you have to loop through each post. This is done with the help of the each()
function.el
refers to the current element). Then, calling text()
on each element will give you the text.node crawler.js
. You'll then see an array of about 25 or 26 different post titles (it'll be quite long). While this is a simple use case, it demonstrates the simple nature of the API provided by Cheerio.npm install jsdom axios
crawler.js
and copy/paste the following code:upvoteFirstPost()
is an asynchronous function that will obtain the first post in r/programming and upvote it. To do this, axios sends an HTTP GET request to fetch the HTML of the URL specified. Then a new DOM is created by feeding the HTML that was fetched earlier.window
object, thus preventing any script from being executed on the inside.<script>
tag (e.g, the jQuery library fetched from a CDN).classList
for a class called upmod
. If this class exists in classList
, a message is returned.node crawler.js
. You'll then see a neat string that will tell you if the post has been upvoted. While this example use case is trivial, you could build on top of it to create something powerful (for example, a bot that goes around upvoting a particular user's posts).npm install puppeteer
crawler.js
, and copy/paste the following code:getVisual()
is an asynchronous function that will take a screenshot and PDF of the value assigned to the URL
variable. To start, an instance of the browser is created by running puppeteer.launch()
. Then, a new page is created. This page can be thought of like a tab in a regular browser. Then, by calling page.goto()
with the URL
as the parameter, the page that was created earlier is directed to the URL specified. Finally, the browser instance is destroyed along with the page.page.screenshot()
and page.pdf()
respectively. You could also listen to the Javascript load event and then perform these actions, which is highly recommended at the production level.node crawler.js
to the terminal, after a few seconds, you will notice that two files by the names screenshot.jpg
and page.pdf
have been created.npm install nightmare
crawler.js
and copy/paste the following code into it:goto()
once it has loaded. The search box is fetched using its selector. Then the value of the search box (an input tag) is changed to “ScrapingBee”.href
attribute of the anchor tag that contains the link.node crawler.js
to your terminal.id='data'
that is in the page. How do we get to those?<td>
and extract what comes after that and until the following </td>
.<td>
becomes <td align='left'>
making our search find nothing.$
function, #data .name
in this example, This is saying that we want to locate all the elements that are children of an element with the id data
and have a CSS class name
. Note that we are not saying anything about the data being in a table in this case. CSS selectors have great flexibility in how you specify search terms for elements, and you can be as specific or vague as you want.each
function will just call the function given as an argument for all the elements that match the selector, with the this
context set to the matching element. If we were to run this in the browser we would see an alert box with the name John, and then another one with the name 'Susan'.request
and cheerio
. The request
package is used to download web pages, while cheerio
generates a DOM tree and provides a subset of the jQuery function set to manipulate it. To install Node.js packages we use a package manager called npm
that is installed with Node.js. This is equivalent to Ruby's gem
or Python's easy_install
and pip
, it simplifies the download and installation of packages.scraping/node_modules
subdirectory and will only be accessible to scripts that are in the scraping
directory. It is also possible to install Node.js packages globally, but I prefer to keep things organized by installing modules locally.cheerio
. Let's call this script example.js
:cheerio
package into the script. The require
statement is similar to #include
in C/C++, require
in Ruby or import
in Python.cheerio.load()
. The return value is the constructed DOM, which we store in a variable called $
to match how the DOM is accessed in the browser when using jQuery.each
iterator to find all the occurrences of the data we want to extract. In the callback function we use the console.log
function to write the extracted data. In Node.js console.log
writes to the console, so it is handy to dump data to the screen.id
is what selects which pool to show a schedule for. I took the effort to open all the schedules manually to take note of the names of each pool and its corresponding id
, since we will need those in the script. We will also use an array with the names of the days of the week. We can scrape these names from the web pages, but since this is information that will never change we can simplify the script by incorporating the data as constants.thprd.js
:request
function. This is an asynchronous function that takes a callback as its second argument. If you are not very familiar with Javascript this may seem odd, but in this language asynchronous functions are very common. The request()
function returns immediately, so it is likely that the eight request()
calls will be issued almost simultaneously and will be processed concurrently by background threads.cheerio
to create a DOM from it. When we reach this point we are ready to start scraping.request()
function is asynchronous? The for
loop will do its eight iterations, spawning a background job in each. The loop then ends, leaving the loop variable set to the pool name that was used in the last iteration. When the callback functions are invoked a few seconds later they will all see this value and print it.request()
function, because that is when the pool
variable has the correct value.loop
variable even though the callback runs outside of the context of the main script. This is because the scope of Javascript functions is defined at the time the function is created. When we created the callback function the loop
variable was in scope, so the variable is accessible to the callback. The url
variable is also in the scope, so the callback can also make use of it if necessary, though the same problem will exist with it, its last value will be seen by all callbacks.2
, because that's the value of variable a
at the time the function stored in variable f
is invoked.a
at the time f
is created we need to insert the current value of a
into the scope:f
one step at a time:a
as an argument. This is not a callback function that will execute later, this is executing right away, so the current value of a
that is passed into the function is 1.a
, but we could have used a different name.f
in our example.f
should be a function, since later in the script we want to invoke it. So the return value of our self-executing function must be the function that will get assigned to f
:f
now has a parent function that received a
as an argument. That a
is a level closer than the original a
in the scope of f
, so that is the a
that the scope of f
sees. When you run the modified script you will get a 1 as output.pool
variable instead of a
. Running this script again gives us the expected result:<td>
elements that hold the daily schedules there is a <div>
wrapper around each scheduled event. Here is a simplified structure for a day:<td>
element contains a link at the top that we are not interested in, then a sequence of <div>
elements, each containing the information for an event.<div>
elements is with the following selector:<td>
element that defines a day, then we search for <div>
elements within it:each()
iterator receives the index number of the found element as a first argument. This is handy because for our outer search this is telling us which day we are in. We do not need an index number in the inner search, so there we do not need to use an argument in our function.<div>
, which has the information that we want. The text()
function applied to any element of the DOM returns the constant text filtering out any HTML elements, so this gets rid of the <strong>
and <br>
elements that exist there and just returns the filtered text.<div>
element has a lot of whitespace in it. There is whitespace at the start and end of the text and also in between the event time and event description. We can eliminate the leading and trailing whitespace with trim()
:replace()
: