File Text Recognition and Management System Based on Tesseract-OCR

Through the research of image preprocessing technology, this paper designs and implements a web archive file recognition management system based on open source Tesseract character recognition technology. The system first preprocesses the image with grayscale and binarization. Secondly, in order to i...

Full description

Saved in:

Bibliographic Details
Published in:	2021 3rd International Conference on Applied Machine Learning (ICAML) pp. 236 - 239
Main Authors:	Ma, Tao, Yue, Min, Yuan, Chao, Yuan, Haibo
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-07-2021
Subjects:	Archives Elasticsearch File systems Full-Text Search Gray-scale Handwriting recognition Image recognition Machine learning OpenCV Tesseract Text recognition Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Through the research of image preprocessing technology, this paper designs and implements a web archive file recognition management system based on open source Tesseract character recognition technology. The system first preprocesses the image with grayscale and binarization. Secondly, in order to improve the recognition accuracy of handwritten content, we trained the text recognition library of Tesseract. Finally, the characters are recognized and stored for later use. Archivists can use this system to convert paper documents into electronic documents, which can significantly improve the management level and digital efficiency of the file system.
DOI:	10.1109/ICAML54311.2021.00057