As I regularly do; look around for new datasets that I can explore and process with Surfiki, I came across the following:
The Open Code – Digital Copy of DC’s Laws
As the author Tom MacWright mentions on his site:
“I couldn’t be happier to write that the project to bring DC’s laws into the digital era and put them in everyone’s hands made a big breakthrough: today you can download an unofficial copy of the Code (current through December 11, 2012) from the DC Council’s website. Not only that, but the licensing for this copy is officially CC0, a Creative Commons license that aims to be a globally-effective Public Domain designation.”
That sounds like a GREAT invitation from my reading his post, it seems that this was difficult to acquire. He mentions many people, communications and time all working together to make this available to the public. He goes on to mention:
“What else is there to build? A great smartphone interface. Topic-specific bookmarks. Text analysis. Great, instant search. Mirrored archives to everywhere. Printable copies for DIY and for print-on-demand services. And lots more.
We’re not waiting till then to start though: dc-decoded is a project I started today to finish the long-awaited task of bringing DC’s laws into The State Decoded. The openlawdc organization on GitHub is dedicated to working on these problems, and it’s open: you can and should get access if you want to contribute.”
As Intridea is a DC based firm, it made perfect sense for us to run this data through our own Data Intelligence Processing Engine; Surfiki. As well, it is a perfect opportunity to introduce the Surfiki Developers API. For which, we are making publicly available as of RIGHT NOW However, we are manually approving users and apps as we move forward. This, to assist us in future scaling and better insight in to the bandwidth required for concurrent and frequent developer operations. I encourage anyone and everyone to create an account. Eventually, all requests will be granted, and will be upon a first come first serve basis.
I think it is best that I first explain how we processed the data from The Open Code – Digital Copy of DC’s Laws. Followed by a general introduction in to Surfiki and the Surfiki Developers API.
The initial distribution of The Open Code – Digital Copy of DC’s Laws was in Microsoft Word Documents. This may have been updated to a more “ingestion” friendly format by now, although I am not sure. The total number of documents was 51, ranging in size from, 317K to 140MB. You may think, “Hey, that’s NOT a lot of data”… Sure, that’s true, however I don’t think it matters much for this project. From what I gather, it was important to just get the data out there and available, regardless of size. As well, Surfiki doesn’t throw fits due to small data or big data anyway.
First order of business was converting these Microsoft Word documents to text. While Surfiki can indeed read through Microsoft Word documents, it generally takes a little longer. Therefore, any preprocessing is a good thing to do. Here is a small Python script that will convert the documents.
#!/usr/bin/env python # -*- coding: utf-8 -*- import glob, re, os f = glob.glob('docs/*.doc') + glob.glob('docs/*.DOC') outDir = 'textdocs' if not os.path.exists(outDir): os.makedirs(outDir) for i in f: os.system("catdoc -w '%s' > '%s'" % (i, outDir + '/' + re.sub(r'.*/([^.]+).doc', r'1.txt', i, flags=re.IGNORECASE)))
With that completed we now have them as text files. I decided to take a peak in to the text files and noticed that there are a lot of “END OF DOCUMENT” lines. For this I assume is representative of singular documents within the larger contextual document. (I know, I know… Genius assumption )
This “END OF DOCUMENT” looks like the following:
For legislative history of D.C. Law 4-171, see Historical and Statutory Notes following § 3-501. DC CODE § 3-502 Current through December 11, 2012 END OF DOCUMENT
And from my initial count script, there are about 19K lines that read: “END OF DOCUMENT”. Therefore, I want to split these up in to individual files. The reason for this is I want Surfiki to process these as specific documents for search and trending purposes. Therefore, with the following Python script, I split them in to individual documents. As well, I cleaned out the ‘§’ character. Note: Surfiki uses both structured storage and unstructured storage for all data. The reason behind this is both business purposes as well as redundancy. Business purposes, structured storage allows us to connect with common enterprise offerings, such as SQL Server, Oracle , etc., for data consumption and propagation. As for redundancy, since for a temporal period we persist all data concurrently between the two mediums, where process may abase, we can recover within seconds and workflow can resume unimpeded.
Note: docnames.txt is just a static list of the initial text documents converted from Microsoft Word documents. I chose that method rather than walking the path.
#!/usr/bin/env python # -*- coding: utf8 -*- def replace_all(text, dic): for i, j in dic.iteritems(): text = text.replace(i, j) return text with open('docnames.txt') as f: set_count = 0 for lines in f: filename = str(lines.rstrip()) with open(filename, mode="r") as docfile: file_count = 0 smallfile_prefix = "File_" smallfile = open(smallfile_prefix + str(file_count) + "_" + str(set_count) + '.txt', 'w') for line in docfile: reps = {'§':''} line = replace_all(line, reps) if line.startswith("END OF DOCUMENT"): smallfile.close() file_count += 1 smallfile = open(smallfile_prefix + str(file_count) + "_" + str(set_count) + '.txt', 'w') else: smallfile.write(line) smallfile.close() set_count += 1
After the above processing (few seconds of time), I now have over 19K files, ranging from 600B to 600K, PERFECT! I am ready to push these to Surfiki.
It’s important to understand Surfiki works with all types of data as well as locations of data. Web data (Pages, Feeds, Posts, Comments, Facebook, Twitter , etc.) As well it works with static data locations, such as file systems, buckets , etc. Streams and Databases… In this case, we are working with static data; text documents in a cloud storage bucket. Without getting to detailed in to the mechanism that I push these files, on a basic level, they are pushed to the bucket where an agent is watching and once they start arriving the processing begins.
Since this is textual information, the type of processing is important. In this case I want to use the standard set of NLP textual processing within Surfiki. Versus any customized algorithms, such as specific topic based classifiers, or statistical classifiers , etc. The following is what will be processed within Surfiki for this data.
- Sentiment – Positive, Negative and Neutral
- Perspective – Objective or Subjective
- Gunning Fog Index
- Reading Ease
- Lexical Density
- Counts including: words/sentence, syllables/sentence, syllables/word
As well, we provide the following for all data.
- Keywords (Literal) – Literal extraction of keywords
- Keywords (Semantic) – Semantic conceptual generation of keywords
- Trends – n-grams – Uni, Bi and Tri
- Trends Aggregate – n-gram weighted distribution
- Graph – n-gram relationships/time (Available likely on Monday)
- Time – Insert and document time extraction
These will all be available in the Surfiki Developers API
Once you go over to Surfiki Developers API and read through the documentation, you will find how simple it is to use. We are adding datasets on a regular basis so please check back often. As well, our near real-time Surface Web data is always available, as are previously processed data sets. If you have ideas or even data sets we should look at, please just let us know by submitting a comment on this post.
If you want to contact us about using Surfiki within your organization, that would be great as well. We can put it behind your firewall, or operate it in the cloud. It is available for most architectural implementations.