On Friday I attended the opening event of the "PDF Liberation Hackathon." Occurring simultaneously in several cities across the United States, this was an effort to write code that would extract useful information from PDFs without requiring a tremendous amount of manual processing.
I have long looked askance at the ubiquitous PDF, for philosophical more than technical reasons. PDFs are replicas of printed paper documents, and as such they reinforce notions of printing and publication that existed prior to the Web. Although we can be confident that a PDF will print cleanly and read easily, there is nothing innovative or interactive about them. For this reason "liberation" is exciting. I am not a coder and was not planning to stay for the entire weekend. But I wanted to learn what the coders were thinking, and promote my idea for more interactive PDFs.
A gentleman named Marc Joffe, who runs an organization devoted to analyzing the risks of government bonds, organized the event. He began by going over a helpful slide deck that is posted to Slideshare.
From this presentation I learned that there are three main ways to analyze PDFs:
- Optical character recognition (OCR) for PDFs of documents that were originally printed on paper. If paper is the original source, there is no embedded metadata to process.
- Metadata extraction from born digital PDFs.
- Transforming unstructured text and numbers into a structured form that can be analyzed. A related IT concept here is extract-transform-load.
The last category, of making unstructured data meaningful, is very intriguing. One of the promises of large scale digitization efforts like Google Books is text mining on a scale never before possible. This would enable discerning trends and patterns in word usage over time, in a way impossible for even the most dedicated scholar to detect.
From my understanding of Joffe's overview, it would require using OCR techniques at first, since the publications in the Google corpus were initially printed. After that they could be analyzed using "extract-transform-load" techniques.
Through this overview I gained a better appreciation of the techniques that are may underpin many "big data" analyses. The data source at the Hackathon was PDFs, but there is also a tremendous amount of unstructured text that is published straight to the Web. Extract-transform-load could be a useful way to work with this material as well.
I am enough of a humanist to be skeptical of over-reaching claims about the explanatory power of big data. There's a seductive aspect to thinking that data mining--of which text mining is one form--can answer the biggest questions for us. The truth is that we are still responsible for knowing ourselves and growing into better people. Sure, big data will be useful for commercial applications...but less so for the more spiritual or philosophical aspects of life.
But I now have a more open mind about this. After Joffe completed his slide presentation, talk turned to the possibility of conducting "sentiment analysis" on the content in PDFs. In my conception, this means understanding the reason why something was written or an action was taken. A few weeks ago I wrote about the ambiguous nature of the reason for the "likes" on my Facebook pictures--in essence, I yearned for sentiment analysis!
So perhaps this is one way to get at the deeper truths lying in big data. Sentiment analysis could be a useful filterer of an otherwise unapproachable amount of content, a curator to help determine where to focus one's attention. There is no way to shortcut the hard work of growing into a more full person, but there's no harm in taking more efficient paths to what will help you get there.
----
For those interested, winners of the hackathon are posted here.
Recent Comments