A little code can prevent a lot of OCR typos


I’m Ben Denckla and this page advertises my ebook consulting services. My services center on writing custom software to create ebooks from sources, that is, from the files that were used to create the paper book.

If you are considering OCR since it is not clear how to turn your sources into an ebook, I may be able to help.

Below I review some of your options, including my services.

OCR is far from perfect

Typically, even when sources are available, an ebook is created by running OCR (optical character recognition) software on a scan of a paper book. OCR works amazingly well, but it is far from perfect. So, most ebooks have typos in them introduced by OCR.

If you are interested in the details, here is a link to a page I am working on classifying, discussing, and giving examples of many types of OCR errors.

Is OCR good enough?

Below are some questions to consider in deciding whether OCR is good enough for a particular book.

Adding proofreading and spell-checking to OCR

Most books are proofread and spell-checked before they are printed on paper. But, the economics of ebooks seems to be that it is usually too expensive to do P&S (proofreading and spell checking) a second time for an ebook created using OCR.

Replacing OCR with conversion from sources

To avoid OCR entirely, I write custom software to create ebooks from sources. By “sources” I mean whatever computer files were used to create the paper book.

For a quote, write to me (Ben Denckla) at EbooksFromSources@yahoo.com.

Can you use generic software?

In many cases, you don’t need custom software to create an ebook from sources. For example, recent versions of Adobe InDesign have features to create ebooks. So if your paper book was created using InDesign, it makes sense to try to use InDesign to create your ebook.

What I specialize in is “middle-aged” books: those produced within the digital era, but not recently enough to use off-the-shelf software that can create ebooks.

Won’t it cost even more than OCR + P&S?

Custom software can be more expensive than OCR + P&S, the first time you do it. But, if you want to convert a bunch of books that have the same source file format, it can be cheaper. So, whether or not custom conversion software beats OCR + P&S depends on the answers to the following questions.

In addition, even if you have only one book to convert, I may have already developed software to convert its format or something close to it. Or, even if I’ve never seen something like your format, I may choose to absorb the fixed cost for you if I think I can use the code in the future.

Finally, keep in mind that OCR + P&S is a lot closer to perfect than OCR alone, but it is still not perfect. To guarantee that certain types of errors will not appear, you must create the ebook from sources. So, directly comparing the cost of OCR + P&S to the cost of custom conversion from sources is not totally fair, since the benefits of custom conversion are greater.

Sources are only close to perfect

Even using sources only guarantees that certain types of errors will not appear. There are some types of errors that are latent within the sources, even though they do not appear in the paper book. These errors mainly have to do with line breaks. Only P&S can catch these errors, and, since it is a human endeavor, it is of course not guaranteed to catch them.

My areas of expertise

Though I’m game to consider any job, here are some particular areas of expertise I have.


