Paper
8 December 2023 Data extraction from scanned invoice documents in multiple languages
Author Affiliations +
Proceedings Volume 12943, International Workshop on Signal Processing and Machine Learning (WSPML 2023); 1294318 (2023) https://doi.org/10.1117/12.3019910
Event: International Workshop on Signal Processing and Machine Learning (WSPML 2023), 2023, Hangzhou, ZJ, China
Abstract
This work provides an open-source method for extracting rel- evant information from scanned documents, such as bills, bank accounts, and invoices. The solution supports documents in 10 different languages and can extract data from these documents irrespective of their template or structure. We have pre-existing solutions based on OpenCV and deep learning technologies, but none provide a generic solution with high accu- racy and support for multiple languages. The proposed method identifies the language of the input document using a pre-trained fast-text model. The document is segmented into different text regions using Run Length Smoothing Algorithm (RLSA). The output of RLSA is passed through a custom pattern recognition algorithm to filter out the regions having the possibility of relevant data based on invoices or account statements. The filtered segments are passed through the Tesseract OCR module for raw text extraction. Based on the identified language of the document, extracted raw text is mapped against the language-specific entity libraries, and final key-value pairs are stored in JSON or CSV files. After being tested on more than 1000 documents, our proposed solution had an average accuracy of 90.27% for all language documents.
(2023) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Nakul Aggarwal, Swarnalata Patra, Snehlata Sinha, Amardeep Jaiman, and Debasmita Ghosh "Data extraction from scanned invoice documents in multiple languages", Proc. SPIE 12943, International Workshop on Signal Processing and Machine Learning (WSPML 2023), 1294318 (8 December 2023); https://doi.org/10.1117/12.3019910
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
Back to Top