Click anywhere to close

I hate writing web scrapers

I have been writing web scrapers for some time now. They are fairly simple things, but extremely tedious to write.

This is a simple scraper that pulls the frontpage of hackernews.
var _ = require( 'underscore' ),
    q = require( 'q' ),
    cheerio = require( 'cheerio' );

var get = function( url ) {
    // ... returns promise that has response to GET url

    .async( function *() {
        var page_html = yield get( '' );
        var $ = cheerio.load( page_html );

        // Yes, their css class is actually "athing"
        var results = $( '.athing' ).find( '.title>a' );

        return _
            .map( results, ( elm ) => {
                return {
                    text: $( elm ).text(),
                    href: $( elm ).attr( 'href' )
            } )
    } )()
    .then( ( results ) => {
        console.log( results );
        process.exit( 0 );
    } )
    .catch( ( err ) => {
        console.error( err );
        process.exit( 2 );
    } );
[ { text: 'Introducing React Storybook',
    href: '' },
  { text: 'How to Build Your Own Rogue GSM BTS for Fun and Profit',
    href: '' },
// ... More posts
Full code can be found here

This code gets the job done, but it sucks:
  • Code duplication If you count the unqiue lines in that snippet, only 3 of them are actually specific to the page that I am scraping. The two querystrings, and the two $( elm ) calls. That means that the other 17 lines of code should have been abstracted, and reusable.
  • Unscalable This code runs fine if it's just the first page we want. But the code starts getting complicated and unweildly the moment you start trying to page through HN, pull the comments from each individual page, or visit the linked pages and pull body content.
  • Single Threaded No, I don't mean NodeJS is single threaded. I mean we are only downloading and processing one result at at time. When our small scraping project turns from 1 small page, to thousands of pages accross an entire website, it would be nice for our script to be easily horizontally scalable, such that we can add more processes, and increase our speed.
  • No robots.txt support This one pretty much goes without explaining. Sites don't like it when your web scraper doesn't follow their rules. That is a quick way to get IP banned.

I love writing a list of complaints like this, because it is directly translatable into a list of requirements. And all successful projects start from a list of product requirements.

  • Consise code Writing a web scraper/crawlers should only require you to write lines of code that couple directly to the site you are scraping
  • Scalable The scraper should be as easy to use for a single page, as it is for a site containing many pages of varying types.
  • Multi threaded The scaper should allow you to "throw more PCs at the problem". Workers should get their jobs from a queue (local or remote) and perform their tasks independent of eachother
  • Automatic robots.txt support The developer shouldn't even have to know what a robots.txt file is, to write a compliant web crawler

A scraper/crawler that runs on definition files.

A definition file for Hacker News would look something like this:
            eval:this.attr( 'href' )

The results of this config file would look exactly like the ones for our code snippet above. This may look more complicated now, but let me show you how easy it is for us to follow the "More" link on the page, to get every post on hackernews.

            eval:this.attr( 'href' )
        eval:this.attr( 'href' )

This would tell the scraper, that any links who's text match "More" should be added to the post_listing queue, and processed the way defined in the post_listing definition. If your still not convinced, it's super easy for us to pull the comments of every post too.

            eval:this.attr( 'href' )
            eval:this.attr( 'href' )
                page: comments_page
        eval:this.attr( 'href' )
        dom:.comhead a:first-child
        dom:.comment .c00

Since there is now two types of data (posts, comments) it would output it's results into two separate files/locations. It would follow a standard relational format, where comments reference posts using the comments.post_href -> posts.href foreign key.
This shows how the scraper can easily be used for single sites, or large websites with many types of pages.

If the results were output to CSV, they would look something like:

Introducing React Storybook,
How to Build Your Own Rogue GSM BTS for Fun and Profit,
// ... the rest of the posts
post_href,username,body,exampleuser,This is so cool!
// ... the rest of the comments on all of the posts

Given the correct adapter this data could easily also be output to a relational database, or any datastore.

This syntax is interpretted and ran by the scraper. Internally the scraper keeps a queue of pages it has yet to visit. Each page is worked on by a worker, and the results are output to a results tree. This results tree is processed and individual data entries are streamed out.

Because each of these workers are separate, individual processes, it is possible to run them in parallel, making the scraper easily horizontally scalable.

Recent Posts

Retro on Colab Using Google Colab to run OpenAI's Gym Retro
Categories:  AI/ML, Colab, Retro
Posted: May 25, 2018
Running Multiple Retro Environments Using retrowrapper to easily run environments as subprocesses
Categories:  AI/ML, Retro
Posted: May 22, 2018
Google Colab + Losswise Using Losswise as a replacement for Tensorboard on Colab
Categories:  AI/ML, Colab
Posted: May 21, 2018
OpenAI Retro - Collision Maps Detecting collision maps from sonic frames using a U-Net
Categories:  Projects, AI/ML
Posted: May 20, 2018 - Fake "AirPods" I just can't believe that Walmart is allowing these on the marketplace
Categories:  Random
Posted: May 19, 2018