Now that we know the basics of Nutch, we can dive into our use case. We write scripts that do two things:
- Ingestion of the various configurations
- Execute and control crawls
This post will tackle ingesting the configs. I will specifically be using Python for the examples in this post, but the principles should apply to any language.
Dynamic Configuration Ingestion
In our project, we had 50+ sites we wanted to crawl, all with different configuration needs. We organized these configurations into a nice JSON api that we ingest. In our examples, we will be using Python’s Requests API to get the JSON. We’ll also need a way to create a unique UUID for each configuration, so we’ll use Python’s UUID module. You can use the package installer pip to get them:
$ pip install requests
$ pip install uuid
We’re going to use a class to handle all of the processing for injection. We’ll create a file for this, call it configInjector.py
. The beginning of the file should look something like this:
import os
import uuid
import requests
from shutil import copy2
class ConfigInjector(object):
def __init__(self):
pass
We’re importing os and copy2 so we can create, edit, and copy files that we need. Next, we’re going to want to get the config itself, as well as an ID from the configuration node itself. We’ll make a new file for this, call it inject.py
. This will be the script we actually run from cron for injection. It begins something like this:
import urllib2
import json
import argparse
import configInjector
parser = argparse.ArgumentParser(description="Ingests configs.")
parser.add_argument("confugUrl", help="URL of the JSON config endpoint.")
args = parser.parse_args()
For our imports, we’ll use requests and UUID like earlier as well as urllib2 to download our remote JSON and argparse to give our script an argument for where to download JSON. We’re also importing our own configInjector class file.
The argparse module allows us to pass command line arguments to the Python script. In the code above, we instantiate the argument parser, add our argument (configUrl), and set the results of the argument to args. This allows us to pass in a url for the location of our JSON endpoint.
Now that we have the foundation set up let’s get the data. We’ll use urllib2 to grab the JSON and json.load()
add it to a variable:
response = urllib2.urlopen(args.confugUrl)
configs = json.load(response)
We’ll then loop through it and call our class for each config in the JSON:
for configId in configs:
configInjector.ConfigInjector(configId, configs[configId])
Now that we are getting the configs, let’s fill out our class and process them. We’ll use the __init__
constructor to do the majority of our data transformations. The two major things we want to do is process and inject Nutch config settings and create regex-urlfilters.txt
for each config.
First, we’ll do our transformations. We want to get our config options in order to plug into Nutch, so we’ll just set them as variables in the class:
class ConfigInjector(object, configId, config):
def __init__(self):self.config = config
self.configId = configId
# Config transformations
self.configTitle = self.config["configTitle"]
self.allowExternalDomains = self.config["allowExternalDomains"]
self.uuid = str(uuid.uuid3(uuid.NAMESPACE_DNS, str(self.configId)))
We’re setting three things in this example: a config title and UUID for reference and a configuration state for the Nutch config db.ignore.external.links. We’re using the static configId to generate the UUID so that the same UUID is always used by each individual configuration.
Next, we’ll need to create some files for our seed urls and match patterns. We’re going to create two files, seed-XXXXXX.txt
and regex-urlfilters-XXXXXX.txt
, where XXXXXX is the configId. For the seed files, we’ll create our own directory (called seeds), but for the regex files, we must store them in $NUTCH_HOME/runtime/local/conf
in order for Nutch to find them (this is due to Nutch’s configuration of the Java CLASSPATH). First, we’ll set the filenames based upon configId (this goes in the __init__
function):
self.regexFileName = 'regex-urlfilter-' + self.nodeId + '.txt'
self.seedFileName = 'seed-' + self.nodeId + '.txt'
We also want to call the functions we are about to write here, so that when we call the class, we immediately run all the necessary functions to inject the config (again, in the __init__
function):
# Run processes
self._makeConfigDirectories()
self._configureSeedUrlFile()
self._copyRegexUrlfilter()
self._configureRegexUrlfilter()
self._prepInjection()
Next, we’ll setup the directories (the underscore at the beginning of the function name just tells python not to load this function when being imported because it will only be used internally):
def _makeConfigDirectories(self):
if not os.path.exists('/path/to/nutch/runtime/local/conf/'):
os.makedirs('/path/to/nutch/runtime/local/conf/')
if not os.path.exists('/path/to/nutch/seeds/'):
os.makedirs('/path/to/nutch/seeds/')
This simply checks to make sure the directories are there and makes them if they aren’t. Next, we’ll create the seed files:
def _configureSeedUrlFile(self):
furl = open('/path/to/nutch/seeds/' + self.seedFileName, "w")
for url in self.config["seedUrls"]:
furl.write(url + "\n")
Basically, we are opening a file (or creating one if it doesn’t exist–this is how “w” functions) and writing each url from the JSON config to each line. We must end each url with a newline (\n) for Nutch to understand the file.
Now we’ll make the regex file. We’ll do it in two steps so that we can take advantage of what Nutch has pre-built. We’re going to copy Nutch’s built-in regex-urlfilters.txt
so that we can use all of its defaults and add any defaults we would like to all configs. Before we do that, we have an important edit to make to regex-urlfilters.txt
: remove the .+
from the end of the file in both /path/to/nutch/conf
and /path/to/nutch/runtime/local/conf
. We’ll add it back in the file ourselves, but if we leave it there, the filters won’t work at all because Nutch uses the first match when determining whether to fetch a url, and .+
means “match any”. For our use, we’re going to add this back on the end of the file after we write our regex to it.
We’ll copy regex-urlfilters.txt
in this function:
def _copyRegexUrlfilter(self):
frurl = '/path/to/nutch/conf/regex-urlfilter.txt'
fwurl = '/path/to/nutch/runtime/local/conf/' + self.regexFileName
copy2(frurl, fwurl)
Then, we write our filters from the config to it:
def _configureRegexUrlfilter(self):
notMatchPatterns = self.config["notMatchPatterns"]
matchPatterns = self.config["matchPatterns"]
regexUrlfilter = open('/path/to/nutch/runtime/local/conf/' + self.regexFileName, "a")
if notMatchPatterns:
for url in notMatchPatterns:
regexUrlfilter.write("-^" + url + "\n")
if matchPatterns:
for url in matchPatterns:
regexUrlfilter.write("+^" + url + "\n")regexUrlfilter.write("+.\n")
regexUrlfilter.close()
A few things are going on here: we are opening and appending to the file we just copied (that’s how “a” works) and then, for each “do not match” pattern we have, we are adding it to the file, followed by the match patterns. This is because, as we said before, Nutch will use the first regex match it gets, so exclusion needs to go first to avoid conflicts. We then write .+
so that Nutch accepts anything else–you can leave it off if you would prefer Nutch exclude anything not matched, which is its default behavior.
As a quick side note, it is important to mention that designing it this way means that each time we inject our configuration into Nutch, we will be wiping out and recreating these files. This is the easiest pathway we found for implementation, and it affords no disadvantages except that you cannot manually manipulate these files in any permanent way. Just be aware.
Now that we have our files in place, the last thing we have to do is inject the configuration into Nutch itself. This will be our first use of the Nutchserver API. If you have not already, open a console on the server that hosts Nutch and run:
$ nutch nutchserver
Optionally, you can add a –port argument to specify the port, but we’ll use the default: 8081. Then we’ll prep the data for injection into the API:
def _prepInjection(self):
config = {}
# Custom config values
config["meta.config.configId"] = self.configId
config["meta.config.configTitle"] = self.configTitle
config["meta.config.seedFile"] = '/path/to/nutch/seeds/' + self.seedFileName
# Crawl metadata
config["nutch.conf.uuid"] = self.uuid
# Crawl Config
config["urlfilter.regex.file"] = self.regexFileName
config["db.ignore.external.links"] = self.allowExternalDomains
self._injectConfig(config)
Note that we are creating both our own custom variables for later use (we named them “meta.config.X
”) and setting actual Nutch configuration settings. Another note: urlfilter.regex.file
takes a string with the filename only. You CANNOT specify a path for this setting, which is why we store the regex files in /path/to/nutch/runtime/local/conf
, where the CLASSPATH already points.
Lastly, we’ll do the actual injection. The self._injectConfig(config)
at the end of the _prepInjection
function starts injection:
def _injectConfig(self, config):
job = {"configId": self.uuid,"force": "true","params": config}
r = requests.post('http://localhost:8081/config/' + self.uuid, json = job)
return r
All we do here is set up the JSON to push to the API and then inject. Every configuration we send to the API must have a UUID as it’s configId (which we will reference later when creating crawl jobs). We set force to true so that configurations will get overwritten when they change upstream and then we pass in our configuration parameters.
We then use the requests python module to make the actual injection. This is significantly easier than using something like CURL. We post to a url containing the uuid and have the JSON as the body (requests has a handy json
argument that converts Python dictionaries to json before adding it to the body). Lastly, we return the post response for later use if needed.
And that’s it! We have successfully posted our dynamic custom configuration to nutchserver and created the relevant files. In the next post, we’ll show you how to crawl a site using these configurations.