OCR binarization and image pre-processing for searching historical documents

We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising w...

Full description

Saved in:

Bibliographic Details
Published in:	Pattern recognition Vol. 40; no. 2; pp. 389 - 397
Main Authors:	Gupta, Maya R., Jacobson, Nathaniel P., Garcia, Eric K.
Format:	Journal Article
Language:	English
Published:	Oxford Elsevier Ltd 01-02-2007 Elsevier Science
Subjects:	Applied sciences Binarization Detection, estimation, filtering, equalization, prediction Exact sciences and technology Halftoning Image processing Information, signal and communications theory Miscellaneous Multiresolutional OCR Pattern recognition Signal and communications theory Signal processing Signal, noise Telecommunications and information theory Halftoning Binarization Multiresolutional OCR Performance evaluation Filtering Keyword Image processing Dithering Noise reduction Pattern recognition Printed document Multiresolution analysis Implementation Despeckling Optical character recognition Binary image Signal processing Error diffusion
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarization, a multiresolutional version of Otsu's binarization, and denoising by despeckling. The OCR in the ABBYY FineReader 7.1 SDK is used as a black box metric to compare methods. Results for 12 pages from six newspapers of differing quality show that performance varies widely by image, but that the classic Otsu method and Otsu-based methods perform best on average.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2006.04.043