References

ellibs

Электронные библиотеки

Russian Digital Libraries Journal

1562-5419

Казанский (Приволжский) федеральный университет

10.26907/1562-5419-2021-24-4-667-688

ellibs-292

Research Article

Статьи

Извлечение данных из сканированных документов со сходной структурой

Data Extraction from Similarly Structured Scanned Documents

Саитгареев

Р. Д.

Saitgareev

R. D.

srustem3@yandex.ru

Гиниятуллин

Б. Р.

Giniyatullin

B. R.

bulat.giniiatullin@gmail.com

Топоров

В. Ю.

Toporov

V. Y.

vladislavtoporov@gmail.com

Атнагулов

А. А.

Atnagulov

A. A.

i@atnartur.ru

Аглямов

Ф. Р.

Aglyamov

F. R.

aglyamov.fox@gmail.com

Казанский (Приволжский) Федеральный университетKazan (Volga region) Federal University

2021

28082021

244667688

2021

Саитгареев Р.Д., Гиниятуллин Б.Р., Топоров В.Ю., Атнагулов А.А., Аглямов Ф.Р.

Saitgareev R.D., Giniyatullin B.R., Toporov V.Y., Atnagulov A.A., Aglyamov F.R.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://ellibs.elpub.ru/jour/article/view/292

На текущий момент времени значительная часть передаваемых и хранимых данных не структурирована. Количество неструктурированных данных растет большими темпами каждый год, несмотря на то, что по таким данным трудно производить поиск, к ним нельзя совершать запросы и в целом их обработка не автоматизирована. В то же время наблюдается развитие систем электронного документооборота. Настоящая работа предлагает инструмент для извлечения данных из фотографий бумажных документов, принимая во внимание их структуру и разметку. Представлены результаты разных испытанных подходов, включая нейронные сети и алгоритмический метод, а также проведен анализ полученных результатов.

Currently, the major part of transmitted and stored data is unstructured, and the amount of unstructured data is growing rapidly each year, although it is hardly searchable, unqueryable, and its processing is not automated. At the same time, there is a growth of electronic document management systems. This paper proposes a solution for extracting data from paper documents considering their structure and layout based on document photos. By examining different approaches, including neural networks and plain algorithmic methods, we present their results and discuss them.

нейронные сетимашинное обучениеизвлечение структурыизвлечение структуры документовнеструктурированные данныераспознавание текста

OCR

References1

Развитие электронного документооборота в России. Статистика, факты, перспективы // Taxcom. URL: https://taxcom.ru/baza-znaniy/ elektronnyy-dokumentooborot/stati/razvitie-elektronnogo-dokumentooborota-v-rossii-statistika-fakty-perspektivy/ (дата обращения 24.02.2021).

СЭД (рынок России) // TAdviser. URL: https://www.tadviser.ru/index.php/Статья:СЭД_(рынок_России) (дата обращения 08.03.2021).

AI Unleashes the Power of Unstructured Data // CIO.

URL: https://www.cio.com/article/3406806/ai-unleashes-the-power-of-unstructured-data.html (дата обращения 23.03.2021).

Structured vs. Unstructured Data // Datamation. URL: https://www.datamation.com/big-data/structured-vs-unstructured-data/ (дата обращения 23.03.2021).

Structured and Unstructured Documents: What are the Differences? // Optiform

URL: https://www.optiform.com/news/structured-unstructured-documents/ (дата обращения 23.03.2021).

McKendrick J. The Post-Relational Reality Sets in: 2011 Survey on Unstructured Data // Unisphere Research. 2011.

Rusu O. and al. Converting unstructured and semi-structured data into knowledge // 2013 11th RoEduNet International Conference. IEEE, 2013. P. 1–4.

Mori S., Suen C. Y., Yamamoto K. Historical review of OCR research and development // Proceedings of the IEEE. 1992. V. 80, No. 7. P. 1029–1058.

Memon J. and al. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR) // IEEE Access. 2020. V. 8. P. 142642–142668.

Vihar Kurama. Table Detection, Information Extraction and Structuring using Deep Learning // Nanonets. URL: https://nanonets.com/blog/table-extraction-deep-learning/ (дата обращения 23.02.2021).

Hwang W. and al. Spatial Dependency Parsing for Semi-Structured Document Information Extraction // arXiv. 2020.

Xu Y. and al. Layoutlm: Pre-training of text and layout for document image understanding // Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020. P. 1192–1200.

Ye Y. and al. A unified scheme of text localization and structured data extraction for joint OCR and data mining // 2018 IEEE International Conference on Big Data (Big Data). IEEE. 2018. P. 2373–2382.

Luo S. and al. Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images // arXiv. 2021.

Haase F., Kirchhoff S. Taxy. io@ FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning // Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation. 2020. P. 163–168.

Rahman M. M., Finin T. Unfolding the Structure of a Document using Deep Learning // arXiv. 2019.

Dos Santos J. E. B. Automatic content extraction on semi-structured documents //2011 International Conference on Document Analysis and Recognition. IEEE. 2011. P. 1235–1239.

Alexander Jung. Imgaug Documentation Release 0.4.0 // Readthedocs. URL: https://imgaug.readthedocs.io/en/latest/ (дата обращения 02.27.2021).

Visvalingam M., Whyatt J. D. The Douglas‐Peucker algorithm for line simplification: re‐evaluation through visualization // Computer Graphics Forum. Oxford, UK: Blackwell Publishing Ltd, 1990. V. 9, No. 3. P. 213–225.

Intersection over Union (IoU) for object detection // PyImageSearch. URL: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ (дата обращения 27.02.2021).

The authors declare that there are no conflicts of interest present.