Stay Caffeinated - I hate writing web scrapers

I hate writing web scrapers

I have been writing web scrapers for some time now. They are fairly simple things, but extremely tedious to write.

This is a simple scraper that pulls the frontpage of hackernews.

var _ = require( 'underscore' ),
    q = require( 'q' ),
    cheerio = require( 'cheerio' );

var get = function( url ) {
    // ... returns promise that has response to GET url
};

q
    .async( function *() {
        var page_html = yield get( 'http://news.ycombinator.com' );
        var $ = cheerio.load( page_html );

        // Yes, their css class is actually "athing"
        var results = $( '.athing' ).find( '.title>a' );

        return _
            .map( results, ( elm ) => {
                return {
                    text: $( elm ).text(),
                    href: $( elm ).attr( 'href' )
                };
            } )
    } )()
    .then( ( results ) => {
        console.log( results );
        process.exit( 0 );
    } )
    .catch( ( err ) => {
        console.error( err );
        process.exit( 2 );
    } );

Output:

[ { text: 'Introducing React Storybook',
    href: 'https://voice.kadira.io/introducing-react-storybook-ec27f28de1e2' },
  { text: 'How to Build Your Own Rogue GSM BTS for Fun and Profit',
    href: 'https://www.evilsocket.net/2016/03/31/how-to-build-your-own-rogue-gsm-bts-for-fun-and-profit/' },
// ... More posts
]

Full code can be found here

This code gets the job done, but it sucks:

Code duplication If you count the unqiue lines in that snippet, only 3 of them are actually specific to the page that I am scraping. The two querystrings, and the two $( elm ) calls. That means that the other 17 lines of code should have been abstracted, and reusable.
Unscalable This code runs fine if it's just the first page we want. But the code starts getting complicated and unweildly the moment you start trying to page through HN, pull the comments from each individual page, or visit the linked pages and pull body content.
Single Threaded No, I don't mean NodeJS is single threaded. I mean we are only downloading and processing one result at at time. When our small scraping project turns from 1 small page, to thousands of pages accross an entire website, it would be nice for our script to be easily horizontally scalable, such that we can add more processes, and increase our speed.
No robots.txt support This one pretty much goes without explaining. Sites don't like it when your web scraper doesn't follow their rules. That is a quick way to get IP banned.

I love writing a list of complaints like this, because it is directly translatable into a list of requirements. And all successful projects start from a list of product requirements.

Requirements:

Consise code Writing a web scraper/crawlers should only require you to write lines of code that couple directly to the site you are scraping
Scalable The scraper should be as easy to use for a single page, as it is for a site containing many pages of varying types.
Multi threaded The scaper should allow you to "throw more PCs at the problem". Workers should get their jobs from a queue (local or remote) and perform their tasks independent of eachother
Automatic robots.txt support The developer shouldn't even have to know what a robots.txt file is, to write a compliant web crawler

Proposition:
A scraper/crawler that runs on definition files.

A definition file for Hacker News would look something like this:

page:post_listing
    dom:.athing
        dom:.title>a
            eval:this.text();
                result:posts.title
            eval:this.attr( 'href' )
                result:posts.href

The results of this config file would look exactly like the ones for our code snippet above. This may look more complicated now, but let me show you how easy it is for us to follow the "More" link on the page, to get every post on hackernews.

page:post_listing
    dom:.athing
        dom:.title>a
            eval:this.text();
                result:posts.title
            eval:this.attr( 'href' )
                result:posts.href
    dom:a[text=More]
        eval:this.attr( 'href' )
            page:post_listing

This would tell the scraper, that any links who's text match "More" should be added to the post_listing queue, and processed the way defined in the post_listing definition. If your still not convinced, it's super easy for us to pull the comments of every post too.

page:post_listing
    dom:.athing
        dom:.title>a
            eval:this.text();
                result:posts.title
            eval:this.attr( 'href' )
                result:posts.href
                result:comments.post_href
        dom:.subtext>a:last-child
            eval:this.attr( 'href' )
                page: comments_page
    dom:a[text=More]
        eval:this.attr( 'href' )
            page:post_listing
page:comments_page
    dom:.athing
        dom:.comhead a:first-child
            eval:this.text()
                result:comments.username
        dom:.comment .c00
            eval:this.text()
                result:comments.body

Since there is now two types of data (posts, comments) it would output it's results into two separate files/locations. It would follow a standard relational format, where comments reference posts using the comments.post_href -> posts.href foreign key.
This shows how the scraper can easily be used for single sites, or large websites with many types of pages.

If the results were output to CSV, they would look something like:

posts.csv

title,href
Introducing React Storybook,https://voice.kadira.io/introducing-react-storybook-ec27f28de1e2
How to Build Your Own Rogue GSM BTS for Fun and Profit,https://www.evilsocket.net/2016/03/31/how-to-build-your-own-rogue-gsm-bts-for-fun-and-profit/
// ... the rest of the posts

comments.csv

post_href,username,body
https://voice.kadira.io/introducing-react-storybook-ec27f28de1e2,exampleuser,This is so cool!
// ... the rest of the comments on all of the posts

Given the correct adapter this data could easily also be output to a relational database, or any datastore.

This syntax is interpretted and ran by the scraper. Internally the scraper keeps a queue of pages it has yet to visit. Each page is worked on by a worker, and the results are output to a results tree. This results tree is processed and individual data entries are streamed out.

Because each of these workers are separate, individual processes, it is possible to run them in parallel, making the scraper easily horizontally scalable.

I hate writing web scrapers

Recent Posts