Working With Nutch 2.x – The API, Part 2 – Crawling Dynamically

Welcome to part 2 of our exploration of the Nutch API!

In our last post, we created infrastructure for injecting custom configurations into Nutch via nutchserver. In this post, we will be creating the script that controls crawling those configurations. If you haven’t done so yet, make sure you start the nutchserver:

$ nutch nutchserver

Dynamic Crawling

We’re going to break this us into two files again, one for cron to run and the other that holds a class that does the actual interaction with nutchserver. The class file will be Nutch.py and the executor file will be Crawler.py. We’ll start by setting up the structure of our class in Nutch.py:

import time
import requests
from random import randint

class Nutch(object):
    def __init__(self, configId, batchId=None):
        pass
    def runCrawlJob(self, jobType):
        pass

 

We’ll need the requests module again ($ pip install requests on the command line) to post and get from nutchserver. We’ll use time and randint to generate a batch ID later. The function crawl is what we call to kick off crawling.

Next, we’ll get Crawler.py setup.

We’re going to use argparse again to give Crawler.py some options. The file should start like this:

# Import contrib
import requests
import argparse
import random

# Import custom
import nutch

parser = argparse.ArgumentParser(description="Runs nutch crawls.")
parser.add_argument("--configId", help="Define a config ID if you just want to run one specific crawl.")
parser.add_argument("--batchId", help="Define a batch ID if you want to keep track of a particular crawl. Only works in conjunction with --configId, since batches are configuration specific.")
args = parser.parse_args()

We’re offering two optional arguments for this script. We can set –configId to run a specific configuration and setting –batchId allows us to track as specific crawl for testing or otherwise. Note: with our setup, you must set –configId if you set –batchId.

We’ll need two more things: a function to make calling the crawler easy and logic for calling the function.

We’ll tackle the logic first:

if args.configId:
    if args.batchId:
        nutch = nutch.Nutch(args.configId, args.batchId)
        crawler(args.job, nutch.getNodeID())
    else:
        nutch = nutch.Nutch(args.configId)
        crawler(args.job, nutch.getNodeID())
else:
    configIds = requests.get("http://localhost:8081/config")
    cids = configIds.json()
    random.shuffle(cids)
    for configId in cids:
        if configId != "default":
            nutch = nutch.Nutch(configId)
            crawler(nutch)

If a configId is given, we capture it and initialize our Nutch class (from Nutch.py) with that id. If a batchId is also specified, we’ll initialize the class with both. In both cases, we run our crawler function (shown below).

If neither configId nor batchId is specified, we will crawl all of the injected configurations. First, we get all of the config ID’s that we have injected earlier (see Part 1!). Then, we randomize them. This step is optional but we found that we tend to get more diverse results when initially running crawls if Nutch is not running them in a static order. Last, for each config ID, we run our crawl function:

def crawler(nutch):
    inject = nutch.runCrawlJob("INJECT")
    generate = nutch.runCrawlJob("GENERATE")
    fetch = nutch.runCrawlJob("FETCH")
    parse = nutch.runCrawlJob("PARSE")
    updatedb = nutch.runCrawlJob("UPDATEDB")
    index = nutch.runCrawlJob("INDEX")

You might wonder why we’ve split up the crawl process here. This is because later, if we wish, we can use the response from the Nutch job to keep track of metadata about crawl jobs. We will also be splitting up the crawl process in Nutch.py.

That takes care of Crawler.py. Let’s now fill out our class that actually controls Nutch, Nutch.py. We’ll start by filling out our __init__ constructor:

def __init__(self, configId, batchId=None):
    # Take in arguments
    self.configId = configId
    if batchId:
        self.batchId = batchId
    else:
        randomInt = randint(0, 9999)
        self.currentTime = time.time()
        self.batchId = str(self.currentTime) + "-" + str(randomInt)

    # Job metadata
    config = self._getCrawlConfiguration()
    self.crawlId = "Nutch-Crawl-" + self.configId
    self.seedFile = config["meta.config.seedFile"]

We first take in the arguments and create a batch ID if there is not one.

The batch ID is essential as it links the various steps of the process together. Urls generated under one batch ID must be fetched under the same ID for they will get lost, for example. The syntax is simple, just [Current Unixtime]-[Random 4-digit integer].

We next get some of the important parts of the current configuration that we are crawling and set them for future use.

We’ll query the nutchserver for the current config and extract the seed file name. We also generate a crawlId for the various jobs we’ll run.

Next, we’ll need a series of functions for interacting with nutchserver.

Specifically, we’ll need one to get the crawl configurations, one to create jobs, and one to check the status of a job. The basics of how to interact with Job API can be found at https://wiki.apache.org/nutch/NutchRESTAPI, though be aware that this page is not complete in it’s documentation. Since we referenced it above, we’ll start with getting crawl configurations:

def _getCrawlConfiguration(self):
    r = requests.get('http://localhost:8081/config/' + self.configId)
    return r.json()

 

This is pretty simple: we make a request to the server at /config/[configID] and it returns all of the config options.

Next, we’ll get the job status:

def _getJobStatus(self, jobId):
    job = requests.get('http://localhost:8081/job/' + jobId)
    return job.json()

This one is also simple: we make a request to the server at /job/[jobId] and it returns all the info on the job. We’ll need this later to poll the server for the status of a job. We’ll pass it the job ID we get from our create request, shown below:

def _createJob(self, jobType, args):
    job = {'crawlId': self.crawlId, 'type': jobType, 'confId': self.configId, 'args': args}
    r = requests.post('http://localhost:8081/job/create', json=job)
    return r

Same deal as above, the main thing we are doing is making a request to /job/create, passing it some JSON as the body. The requests module has a nice built-in feature that allows you to pass a python dictionary to a json= parameter and it will convert it to a JSON string for you and pass it to the body of the request.

The dict we are passing has a standard set of parameters for all jobs. We need the crawlId set above; the jobType, which is the crawl step we will pass into this function when we call it; the configId, which is the UUID we made earlier; last, any job-specific arguments–we’ll pass these in when we call the function.

The last thing we need is the logic for setting up, keeping track of, and resolving job creation:

def runCrawlJob(self, jobType): 
    args = ""
    if jobType == 'INJECT':
        args = {'seedDir': self.seedFile}
    elif jobType == "GENERATE":
        args = {"normalize": True,
                "filter": True,
                "crawlId": self.crawlId,
                "batch": self.batchId
                }
    elif jobType == "FETCH" or jobType == "PARSE" or jobType == "UPDATEDB" or jobType == "INDEX":
        args = {"crawlId": self.crawlId,
                "batch": self.batchId
                }
    r = self._createJob(jobType, args)
    time.sleep(1)
    job = self._getJobStatus(r.text)
    if job["state"] == "FAILED":
        return job["msg"]
    else:
        while job["state"] == "RUNNING":
            time.sleep(5)
            job = self._getJobStatus(r.text)
            if job["state"] == "FAILED":
                return job["msg"]
    return r.text

First, we’ll create the arguments we’ll pass to job creation.

All of the job types except Inject require a crawlId and batchId. Inject is special in that the only argument it needs is the path to the seed file. Generate has two special options that allow you to enable or disable use of the normalize and regex url filters. We’re setting them both on by default.

After we build args, we’ll fire off the create job.

Before we begin checking the status of the job, we’ll sleep the script to give the asynchronous call a second to come back. Then we make a while loop to continuously check the job state. When it finishes without failure, we end by returning the ID.

And we’re finished! There are a few more things of note that I want to mention here. An important aspect of the way Nutch was designed is that it is impossible to know how long a given crawl will take. On the one hand, this means that your scripts could be running for several hours at time. However, this also means that it could be done in a few minutes. I mention this because when you first start crawling and also after you have crawled for a long time, you might start seeing Nutch not crawl very many links. In the first case, this is because, as I mentioned earlier, Nutch only crawls the links in the seed file at first, and if there are not many hyperlinks on those first pages, it might take two or three crawl cycles before you start seeing a lot of links being fetched. In the latter case, after Nutch finishes crawling all the pages that match your configuration, it will only recrawl those pages after a set interval. You can modify how this process works, but it will mean that after awhile you will see crawls that only fetch a handful of links.

Another helpful note is that the Nutch log at /path/to/nutch/runtime/local/logs/hadoop.log is great for following the process of crawling. You can set the output depth of most parts of the Nutch process at /path/to/nutch/conf/log4j.properties (you will have to rebuild Nutch if you change this by running ant runtime at the Nutch root).