Preview

Russian Digital Libraries Journal

Advanced search

Data Extraction from Similarly Structured Scanned Documents

https://doi.org/10.26907/1562-5419-2021-24-4-667-688

Abstract


Currently, the major part of transmitted and stored data is unstructured, and the amount of unstructured data is growing rapidly each year, although it is hardly searchable, unqueryable, and its processing is not automated. At the same time, there is a growth of electronic document management systems. This paper proposes a solution for extracting data from paper documents considering their structure and layout based on document photos. By examining different approaches, including neural networks and plain algorithmic methods, we present their results and discuss them.

Keywords


About the Authors

R. D. Saitgareev
Kazan (Volga region) Federal University
Russian Federation


B. R. Giniyatullin
Kazan (Volga region) Federal University
Russian Federation


V. Y. Toporov
Kazan (Volga region) Federal University
Russian Federation


A. A. Atnagulov
Kazan (Volga region) Federal University
Russian Federation


F. R. Aglyamov
Kazan (Volga region) Federal University
Russian Federation


References

1. Развитие электронного документооборота в России. Статистика, факты, перспективы // Taxcom. URL: https://taxcom.ru/baza-znaniy/ elektronnyy-dokumentooborot/stati/razvitie-elektronnogo-dokumentooborota-v-rossii-statistika-fakty-perspektivy/ (дата обращения 24.02.2021).

2. СЭД (рынок России) // TAdviser. URL: https://www.tadviser.ru/index.php/Статья:СЭД_(рынок_России) (дата обращения 08.03.2021).

3. AI Unleashes the Power of Unstructured Data // CIO.

4. URL: https://www.cio.com/article/3406806/ai-unleashes-the-power-of-unstructured-data.html (дата обращения 23.03.2021).

5. Structured vs. Unstructured Data // Datamation. URL: https://www.datamation.com/big-data/structured-vs-unstructured-data/ (дата обращения 23.03.2021).

6. Structured and Unstructured Documents: What are the Differences? // Optiform

7. URL: https://www.optiform.com/news/structured-unstructured-documents/ (дата обращения 23.03.2021).

8. McKendrick J. The Post-Relational Reality Sets in: 2011 Survey on Unstructured Data // Unisphere Research. 2011.

9. Rusu O. and al. Converting unstructured and semi-structured data into knowledge // 2013 11th RoEduNet International Conference. IEEE, 2013. P. 1–4.

10. Mori S., Suen C. Y., Yamamoto K. Historical review of OCR research and development // Proceedings of the IEEE. 1992. V. 80, No. 7. P. 1029–1058.

11. Memon J. and al. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR) // IEEE Access. 2020. V. 8. P. 142642–142668.

12. Vihar Kurama. Table Detection, Information Extraction and Structuring using Deep Learning // Nanonets. URL: https://nanonets.com/blog/table-extraction-deep-learning/ (дата обращения 23.02.2021).

13. Hwang W. and al. Spatial Dependency Parsing for Semi-Structured Document Information Extraction // arXiv. 2020.

14. Xu Y. and al. Layoutlm: Pre-training of text and layout for document image understanding // Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020. P. 1192–1200.

15. Ye Y. and al. A unified scheme of text localization and structured data extraction for joint OCR and data mining // 2018 IEEE International Conference on Big Data (Big Data). IEEE. 2018. P. 2373–2382.

16. Luo S. and al. Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images // arXiv. 2021.

17. Haase F., Kirchhoff S. Taxy. io@ FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning // Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation. 2020. P. 163–168.

18. Rahman M. M., Finin T. Unfolding the Structure of a Document using Deep Learning // arXiv. 2019.

19. Dos Santos J. E. B. Automatic content extraction on semi-structured documents //2011 International Conference on Document Analysis and Recognition. IEEE. 2011. P. 1235–1239.

20. Alexander Jung. Imgaug Documentation Release 0.4.0 // Readthedocs. URL: https://imgaug.readthedocs.io/en/latest/ (дата обращения 02.27.2021).

21. Visvalingam M., Whyatt J. D. The Douglas‐Peucker algorithm for line simplification: re‐evaluation through visualization // Computer Graphics Forum. Oxford, UK: Blackwell Publishing Ltd, 1990. V. 9, No. 3. P. 213–225.

22. Intersection over Union (IoU) for object detection // PyImageSearch. URL: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ (дата обращения 27.02.2021).


Review

For citations:


Saitgareev R.D., Giniyatullin B.R., Toporov V.Y., Atnagulov A.A., Aglyamov F.R. Data Extraction from Similarly Structured Scanned Documents. Russian Digital Libraries Journal. 2021;24(4):667-688. (In Russ.) https://doi.org/10.26907/1562-5419-2021-24-4-667-688

Views: 53


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1562-5419 (Online)