Let’s be honest, the documentation for Apache Nutch is scarce. Doing anything more complicated than a single-configuration crawl requires hours of prowling Stack Overflow and a plethora of sick Google-fu moves. Thankfully, I’ve already suffered for you!
A recent project involved configuring Nutch to crawl 50+ different sites, all in different states of web standard conformity, all with different configuration settings. These had to be dynamically added and needed to account for changing configurations. In the following few posts, I’ll share the steps we took to achieve this task.
What is Nutch?
Apache Nutch 2.x is an open-source, mature, scalable, production-ready web crawler based on Apache Hadoop (for data structures) and Apache Gora (for storage abstraction). In these examples, we will be using MongoDB for storage and Elasticsearch for indexing; however, this guide should still be useful to those using different storage and indexing backends.
Basic Nutch Setup
The standard way of using Nutch is to set up a single configuration and then run the crawl steps from the command line. There are two primary files to set up: nutch-site.xml
and regex-urlfilter.txt. There are several more files you can utilize (and we’ll discuss a few of them later), but for the most basic implementation, that’s all you need.
The nutch-site.xml
file is where you set all your configuration options. A mostly complete list of configuration options can be found in nutch-default.xml
; just copy and paste the options you want to set and change them accordingly. There are a few that we’ll need for our project:
http.agent.name
– This is the name of your crawler. This is a required setting for every Nutch setup. It’s good to have all of the settings for `http.agent` set, but this is the only required one.- storage.data.store.class – We’ll be setting this one to
org.apache.gora.mongodb.store.MongoStore
for Mongo DB. - Either
elastic.host
andelastic.port
orelastic.cluster
– this will point Nutch at our Elasticsearch instance.
There are other settings we will consider later, but these are the basics.
The next important file is regex-urlfilter.txt
. This is where you configure the crawler to include and/or exclude specific urls from your crawl. To include a urls matching a regex pattern, prepend your regex with a +
. To exclude, prepend with a -
. We’re going to take a slightly more complicated approach to this, but more on that later.
The Crawl Cycle
Nutch’s crawl cycle is divided into 6 steps: Inject, Generate, Fetch, Parse, Updatedb, and Index. Nutch takes the injected URLs, stores them in the CrawlDB, and uses those links to go out to the web and scrape each URL. Then, it parses the scraped data into various fields and pushes any scraped hyperlinks back into the CrawlDB. Lastly, Nutch takes those parsed fields, translates them, and injects them into the indexing backend of your choice.
How To Run A Nutch Crawl
Inject
For the inject step, we’ll need to create a seeds.txt file containing seed urls. These urls act as a starting place for Nutch to begin crawling. We then run:
$ nutch inject /path/to/file/seeds.txt
Generate
In the generate step, Nutch extracts the urls from pages it has parsed. On the first run, generate only queues the urls from the seed file for crawling. After the first crawl, generate will use hyperlinks from the parsed pages. It has a few relevant arguments:
-topN
will allow you to determine the number of urls crawled with each execution.-noFilter
and-noNorm
will disable the filtering and normalization plugins respectively.
In its most basic form, running generate is simple:
$ nutch generate -topN 10
Fetch
This is where the magic happens. During the fetch step, Nutch crawls the urls selected in the generate step. The most important argument you need is -threads
: this sets the number of fetcher threads per task. Increasing this will make crawling faster, but setting it too high can overwhelm a site and it might shut out your crawler, as well as take up too much memory from your machine. Run it like this:
$ nutch fetch -threads 50
Parse
Parsing is where Nutch organizes the data scraped by the fetcher. It has two useful arguments:
-all
: will check and parse pages from all crawl jobs-force
: will force parser to re-parse all pages
The parser reads content, organizes it into fields, scores the content, and figures out links for the generator. To run it, simply:
$ nutch parse -all
Updatedb
The Updatedb step takes the output from the fetcher and parser and updates the database accordingly. Updatedb markes urls for future generate steps at this point. Nutch 2.x supports several storage backends thanks to it abstracting storage through Apache Gora (MySQL, MongoDB, HBase). No matter your storage backend, however, running it is the same:
$ nutch updatedb -all
Index
Indexing is taking all of that hard work from Nutch and putting it into a searchable interface. Nutch 2.x supports several indexing backends (Solr, Cassandra, Elasticsearch). While we will be using Elasticsearch, the command is the same no matter what indexer you are using:
$ nutch index -all
Congrats, you have done your first crawl! However, we’re not going to be stopping here, oh no. Our implementation has far more moving parts than a simple crawl interface can give, so in the next post, we will be utilizing Nutch 2.3’s RESTful API to add crawl jobs and change configurations dynamically! Stay tuned!