Search This Blog

Thursday, February 3, 2011

ABBYY

If I could enumerate my first OCR ABBYY experience in the tone of a survival log, it would go something like this:


Day 1: Today I awoke, confused and not without a vague fear, on some alien beach. Though, I survived the crash. I guess I should count myself fortunate. While I am destitute and lost, I am ALIVE. Though what kind of life awaits?
Day 2:  Upon initial survey of my beach (as I have come to call it) and its close environs, my spirits are much improved.  This place is a variable Eden and if I am to be a lone Adam, at least I will be just as well nourished.  Tomorrow I will build a shelter and put my claim on this place in earnest.   
Day 3: Disaster! The fruit I collected is poisonous.  I am double wrecked.  On top of these misfortunes, my shelter is slow in coming and storm is on the horizon. 
Day 4:  ...Why me!? All is lost...

Maybe that was a bit dramatic.  Anyways, here's my evaluation of my first experience OCRing  about 70 typed pages.  It started off pretty good, running thought the pages I had previously scanned, using the spell checker to review and adjust the low confidence characters.  Then I realized that ABBYY assigned different classes to features of a scanned objects. So, ABBYY will sometimes see a header or a title and assign it a "title text" value and then assign the body a "body text" value.  Fine.  Except that text considered "title" is ascribed a value that makes it UNMOVABLE.  I am sure that there is a perfectly good reason for this...no, no, its really stupid.  Why would anyone, ever, want static text on a computer? Plus, its inconsistent, some pages it will identify a title text box and on others it will treat the entire document as body text.

Words are hardly adiquate to express my frustration after I completed OCRing a 37 pp document only to discover that the "title" text was completely useless.  When converting the text to a word document, ABBYY passes the "title" text's opaque and untouchable (to me) properties over.  This has dire effects on page and text position.  Sometimes body text would be superimposed over the title text.  So I had to go back and repeat about an hour of work to hack the document into editable form. I was also unsuccessful at finding a way to default all character types to "body" to automate, and expedite the process.  

I know most of my gripes grow out of my ignorance of and inexperience with the software.  I will say, though, that even when I did find a good routine for processing the text, I still felt like I was hacking the software--fighting with it to make it do what I wanted. Not intuitive.  Of course, when I say hacking, I mean in the traditional sense of the word that implies curious, healthy exploration of technology, not the sense that people that drink deeply of the fox news cool-aid would understand.  I love to hack, but not when I have a pile of menial labor to grind through.  I felt like an operator removed, like I was driving ABBY from the back seat with broom sticks and mirrors.

Not fun...

The transcripts however...  What wonderful tails of oil conquest, labor issues, drunk dogs, Howard Hughes and snakes.  Reading the transcripts, I felt a distant connection with my departed paternal grandparents.  They were not involved in the oil industry, but lived in Amarillo.  Their reason, sensibility, practicality, ingenuity was detectable  in the rhythm and spirit of speech of the various transcribed conversations.  The experience brought me face to face with importance of this work--albeit a personal reason.  So, ABBY be damned, I will soldier on.

No comments:

Post a Comment