Web Scraping in Node.js with Multiple Examples

March 08 2017

Web scraping which can be used for things like email collection, creating a news feed reader, comparing product price from multiple e-commerce sites and data mining from search engines is an alternate way to extract data from the websites which doesn't provide an api for access information. So, whenever possible make sure you use the api from getting their data as it doesn't involve parsing the whole page and also less time-consuming. Also, don't perform any kind of illegal scraping which may harm the website owner. Before I'll write some examples, I want to tell you how it works fundamentally. Basically what a simple scraper does is send a GET request to the page, receives the data in html/xml format and then using the parser to extract the data in whatever format you want. WGET is one such utility for doing that using the terminal and there are various free and paid tools available on the web. In this post, I'll use the osmosis package written in node.js which packed with css3/xpath selector and a lightweight http wrapper. There are also other high level web automation frameworks available like webdriver.io, casperjs. But for most cases, it suits my need. This whole post gives you multiple examples to get started with web scraping.

Setting up the project

Install node.js which comes with npm package manager
Create a new folder say webscrap. CD into it.
Run npm init from the terminal to create the package.json file.
Finally run npm i osmosis --save to install the web scraping package. It depends on lightweight http wrapper and xml parser so you don't need extra dependency other than this.

Now open the package.json and create a new start script for npm start command. Final package.json will look something like this:

{   
      "name": "webscrap",
      "version": "1.0.0",
      "main": "index.js",
      "scripts": {
        "start": "node index"
      },
      "dependencies": {
        "osmosis": "^1.1.2"
      }
}

Also, create a new index.js file where we'll do all our work.

Note that at the time of writing, the selectors for selecting the block of data is in working condition, but in future these selector may get invalid response due to updated page but the logic remains the same unless library gets updated.

Scraping Google Title Tag

This is the most basic example which will also introduce you to the osmosis package and let the first node script up and running. Put the below code inside index.js file and do npm start from terminal. It will output the title of the webpage.

const osmosis = require('osmosis');
osmosis
    .get('www.google.com')
    .set({'Title': 'title'})   // or alternate: `.find('title').set('Title')`
    .data(console.log)  // will output {'Title': 'Google'}

Let's see what these methods do. First get method will fetch the webpage in compressed format. Next set method will select the title element given as a css3 selector to the value of object property. And finally data method along with console.log print the output. The set method also accept string as an argument. Although I'll cover most of the methods in this tutorial, you may better read the concise documenation from the official doc.

Getting Related Searches From Google

Suppose we want to get the related searches of analytics keyword from google, we will do the following:

osmosis
    .get('https://www.google.co.in/search?q=analytics')
    .find('#botstuff')
    .set({'related': ['.card-section .brs_col p a']})
    .data(function(data) {
        console.log(data);
    })

That's it. Pretty simple. It will extract all the keywords from the first page of search, store in array and log them in the terminal. And the logic behind it is easy, we first analyse the web page through developer tools, check the block where it's is present, (in this case it is in the #botstuff div block) and store it in array through .card-section .brs_col p a selector which match every related keywords present on that page.

Combining Pagination and Related Searches

And that's is too easy with this library, we just have to add chain-able method to this by finding the href of anchor (<a>) tag. We also limit the pagination to 5, so that google doesn't detect me as a bot. If you need some time interval after every page scrap, you can attach .delay(ms) method after every .paginate().

osmosis
   .get('https://www.google.co.in/search?q=analytics')
   .paginate('#navcnt table tr > td a[href]', 5)
   .find('#botstuff')
   .set({'related': ['.card-section .brs_col p a']})
   .data(console.log)
   .log(console.log) // enable logging
   .error(console.error) // in case there is an error found.

Scraping emails from the shopify.

This example will tell you how you can get the content of a single combining with multiple blocks of front page content. In this case, we will collect emails and app name of all the apps, traversing one by one with a .follow method and then mark the appropriate selectors from developer console. You can combine the below code with .paginate method also to wholly scrap all the content except they don't block you.

osmosis
   .get('http://apps.shopify.com/categories/sales')
   .find('.resourcescontent ul.app-card-grid')
   .follow('li a[href]')
   .find('.resourcescontent')
   .set({
       'appname': '.app-header__details h1',
       'email': '#AppInfo table tbody tr:nth-child(2) td > a'
    })
   .log(console.log)   // enable logging to see what is does.
   .data(console.log)

Probably, you need to save this data in the a file or something which can be done like this. Example from above modified code (saved in json format).

const fs = require('fs');
let savedData = [];
osmosis
   .get(..).find(..).follow(..).find(..)
   .set(..)
   .log(console.log)
   .data(function(data) {
      console.log(data);
      savedData.push(data);
   })
   .done(function() {
      fs.writeFile('data.json', JSON.stringify( savedData, null, 4), function(err) {
        if(err) console.error(err);
        else console.log('Data Saved to data.json file');
      })
   });