Data comes in many formats, some more useful than others. Many researchers -- particularly those who work in archives -- have to convert images or PDFs of text into usable, editable text using optical character recognition (OCR). Adobe Acrobat has very basic OCR functionality, which may be sufficient for clearly-typed text in English or a small handful of other languages. If your document has complex layout (e.g. tables or columns) or text formatting, or uses languages not supported by Adobe Acrobat, ABBYY FineReader may be a better option.

ABBYY FineReader is currently available for use on the D-Lab Collaboratory computers. If you need access to it outside of the D-Lab, please sign up to be a test user for the OCR experimental virtual research desktop, offered through Research IT’s Analytics Environments on Demand (AEoD) Service. This experimental virtual research desktop, available through March 2017, can be accessed from your own laptop, anywhere with an internet connection.
 
When you sign up to be a test user, you’ll get access to the virtual research desktop, the sign-up calendar, and the documentation for how to use it. Your feedback will play an important role in determining whether and how the OCR desktop will be supported beyond March, so please email Quinn (quinnd@berkeley.edu) with any comments.

A few caveats: ABBYY FineReader supports 190 languages, 48 with dictionary support to improve recognition. However, it does not support handwriting, and recognition tends to be poor when text is embedded in images (e.g. map labels).

 

 

Author: 

Quinn Dombrowski

Quinn Dombrowski is the Digital Humanities Coordinator in the Research IT group. She is the director of the DiRT Directory of digital research tools, and is writing a book on Drupal for humanists.