<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">ellibs</journal-id><journal-title-group><journal-title xml:lang="ru">Электронные библиотеки</journal-title><trans-title-group xml:lang="en"><trans-title>Russian Digital Libraries Journal</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">1562-5419</issn><publisher><publisher-name>Казанский (Приволжский) федеральный университет</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.26907/1562-5419-2021-24-4-667-688</article-id><article-id custom-type="elpub" pub-id-type="custom">ellibs-292</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Извлечение данных из сканированных документов со сходной структурой</article-title><trans-title-group xml:lang="en"><trans-title>Data Extraction from Similarly Structured Scanned  Documents</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Саитгареев</surname><given-names>Р. Д.</given-names></name><name name-style="western" xml:lang="en"><surname>Saitgareev</surname><given-names>R. D.</given-names></name></name-alternatives><email xlink:type="simple">srustem3@yandex.ru</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Гиниятуллин</surname><given-names>Б. Р.</given-names></name><name name-style="western" xml:lang="en"><surname>Giniyatullin</surname><given-names>B. R.</given-names></name></name-alternatives><email xlink:type="simple">bulat.giniiatullin@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Топоров</surname><given-names>В. Ю.</given-names></name><name name-style="western" xml:lang="en"><surname>Toporov</surname><given-names>V. Y.</given-names></name></name-alternatives><email xlink:type="simple">vladislavtoporov@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Атнагулов</surname><given-names>А. А.</given-names></name><name name-style="western" xml:lang="en"><surname>Atnagulov</surname><given-names>A. A.</given-names></name></name-alternatives><email xlink:type="simple">i@atnartur.ru</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Аглямов</surname><given-names>Ф. Р.</given-names></name><name name-style="western" xml:lang="en"><surname>Aglyamov</surname><given-names>F. R.</given-names></name></name-alternatives><email xlink:type="simple">aglyamov.fox@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Казанский (Приволжский) Федеральный университет</institution></aff><aff xml:lang="en"><institution>Kazan (Volga region) Federal University</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2021</year></pub-date><pub-date pub-type="epub"><day>28</day><month>08</month><year>2021</year></pub-date><volume>24</volume><issue>4</issue><fpage>667</fpage><lpage>688</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Саитгареев Р.Д., Гиниятуллин Б.Р., Топоров В.Ю., Атнагулов А.А., Аглямов Ф.Р., 2021</copyright-statement><copyright-year>2021</copyright-year><copyright-holder xml:lang="ru">Саитгареев Р.Д., Гиниятуллин Б.Р., Топоров В.Ю., Атнагулов А.А., Аглямов Ф.Р.</copyright-holder><copyright-holder xml:lang="en">Saitgareev R.D., Giniyatullin B.R., Toporov V.Y., Atnagulov A.A., Aglyamov F.R.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://ellibs.elpub.ru/jour/article/view/292">https://ellibs.elpub.ru/jour/article/view/292</self-uri><abstract><p>На текущий момент времени значительная часть передаваемых и хранимых данных не структурирована. Количество неструктурированных данных растет большими темпами каждый год, несмотря на то, что по таким данным трудно производить поиск, к ним нельзя совершать запросы и в целом их обработка не автоматизирована. В то же время наблюдается развитие систем электронного документооборота.


Настоящая работа предлагает инструмент для извлечения данных из фотографий бумажных документов, принимая во внимание их структуру и разметку. Представлены результаты разных испытанных подходов, включая нейронные сети и алгоритмический метод, а также проведен анализ полученных результатов.
</p></abstract><trans-abstract xml:lang="en"><p>Currently, the major part of transmitted and stored data is unstructured, and the amount of unstructured data is growing rapidly each year, although it is hardly searchable, unqueryable, and its processing is not automated. At the same time, there is a growth of electronic document management systems. This paper proposes a solution for extracting data from paper documents considering their structure and layout based on document photos. By examining different approaches, including neural networks and plain algorithmic methods, we present their results and discuss them.
</p></trans-abstract><kwd-group xml:lang="ru"><kwd>нейронные сети</kwd><kwd>машинное обучение</kwd><kwd>извлечение структуры</kwd><kwd>извлечение структуры документов</kwd><kwd>неструктурированные данные</kwd><kwd>распознавание текста</kwd></kwd-group><kwd-group xml:lang="en"><kwd>OCR</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Развитие электронного документооборота в России. Статистика, факты, перспективы // Taxcom. URL: https://taxcom.ru/baza-znaniy/ elektronnyy-dokumentooborot/stati/razvitie-elektronnogo-dokumentooborota-v-rossii-statistika-fakty-perspektivy/ (дата обращения 24.02.2021).</mixed-citation><mixed-citation xml:lang="en">Развитие электронного документооборота в России. Статистика, факты, перспективы // Taxcom. URL: https://taxcom.ru/baza-znaniy/ elektronnyy-dokumentooborot/stati/razvitie-elektronnogo-dokumentooborota-v-rossii-statistika-fakty-perspektivy/ (дата обращения 24.02.2021).</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">СЭД (рынок России) // TAdviser. URL: https://www.tadviser.ru/index.php/Статья:СЭД_(рынок_России) (дата обращения 08.03.2021).</mixed-citation><mixed-citation xml:lang="en">СЭД (рынок России) // TAdviser. URL: https://www.tadviser.ru/index.php/Статья:СЭД_(рынок_России) (дата обращения 08.03.2021).</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">AI Unleashes the Power of Unstructured Data // CIO.</mixed-citation><mixed-citation xml:lang="en">AI Unleashes the Power of Unstructured Data // CIO.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">URL: https://www.cio.com/article/3406806/ai-unleashes-the-power-of-unstructured-data.html (дата обращения 23.03.2021).</mixed-citation><mixed-citation xml:lang="en">URL: https://www.cio.com/article/3406806/ai-unleashes-the-power-of-unstructured-data.html (дата обращения 23.03.2021).</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Structured vs. Unstructured Data // Datamation. URL: https://www.datamation.com/big-data/structured-vs-unstructured-data/ (дата обращения 23.03.2021).</mixed-citation><mixed-citation xml:lang="en">Structured vs. Unstructured Data // Datamation. URL: https://www.datamation.com/big-data/structured-vs-unstructured-data/ (дата обращения 23.03.2021).</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Structured and Unstructured Documents: What are the Differences? // Optiform</mixed-citation><mixed-citation xml:lang="en">Structured and Unstructured Documents: What are the Differences? // Optiform</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">URL: https://www.optiform.com/news/structured-unstructured-documents/ (дата обращения 23.03.2021).</mixed-citation><mixed-citation xml:lang="en">URL: https://www.optiform.com/news/structured-unstructured-documents/ (дата обращения 23.03.2021).</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">McKendrick J. The Post-Relational Reality Sets in: 2011 Survey on Unstructured Data // Unisphere Research. 2011.</mixed-citation><mixed-citation xml:lang="en">McKendrick J. The Post-Relational Reality Sets in: 2011 Survey on Unstructured Data // Unisphere Research. 2011.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Rusu O. and al. Converting unstructured and semi-structured data into knowledge // 2013 11th RoEduNet International Conference. IEEE, 2013. P. 1–4.</mixed-citation><mixed-citation xml:lang="en">Rusu O. and al. Converting unstructured and semi-structured data into knowledge // 2013 11th RoEduNet International Conference. IEEE, 2013. P. 1–4.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Mori S., Suen C. Y., Yamamoto K. Historical review of OCR research and development // Proceedings of the IEEE. 1992. V. 80, No. 7. P. 1029–1058.</mixed-citation><mixed-citation xml:lang="en">Mori S., Suen C. Y., Yamamoto K. Historical review of OCR research and development // Proceedings of the IEEE. 1992. V. 80, No. 7. P. 1029–1058.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Memon J. and al. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR) // IEEE Access. 2020. V. 8. P. 142642–142668.</mixed-citation><mixed-citation xml:lang="en">Memon J. and al. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR) // IEEE Access. 2020. V. 8. P. 142642–142668.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Vihar Kurama. Table Detection, Information Extraction and Structuring using Deep Learning // Nanonets. URL: https://nanonets.com/blog/table-extraction-deep-learning/ (дата обращения 23.02.2021).</mixed-citation><mixed-citation xml:lang="en">Vihar Kurama. Table Detection, Information Extraction and Structuring using Deep Learning // Nanonets. URL: https://nanonets.com/blog/table-extraction-deep-learning/ (дата обращения 23.02.2021).</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Hwang W. and al. Spatial Dependency Parsing for Semi-Structured Document Information Extraction // arXiv. 2020.</mixed-citation><mixed-citation xml:lang="en">Hwang W. and al. Spatial Dependency Parsing for Semi-Structured Document Information Extraction // arXiv. 2020.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Xu Y. and al. Layoutlm: Pre-training of text and layout for document image understanding // Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining. 2020. P. 1192–1200.</mixed-citation><mixed-citation xml:lang="en">Xu Y. and al. Layoutlm: Pre-training of text and layout for document image understanding // Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining. 2020. P. 1192–1200.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Ye Y. and al. A unified scheme of text localization and structured data extraction for joint OCR and data mining // 2018 IEEE International Conference on Big Data (Big Data). IEEE. 2018. P. 2373–2382.</mixed-citation><mixed-citation xml:lang="en">Ye Y. and al. A unified scheme of text localization and structured data extraction for joint OCR and data mining // 2018 IEEE International Conference on Big Data (Big Data). IEEE. 2018. P. 2373–2382.</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Luo S. and al. Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images // arXiv. 2021.</mixed-citation><mixed-citation xml:lang="en">Luo S. and al. Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images // arXiv. 2021.</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">Haase F., Kirchhoff S. Taxy. io@ FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning // Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation. 2020. P. 163–168.</mixed-citation><mixed-citation xml:lang="en">Haase F., Kirchhoff S. Taxy. io@ FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning // Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation. 2020. P. 163–168.</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Rahman M. M., Finin T. Unfolding the Structure of a Document using Deep Learning // arXiv. 2019.</mixed-citation><mixed-citation xml:lang="en">Rahman M. M., Finin T. Unfolding the Structure of a Document using Deep Learning // arXiv. 2019.</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">Dos Santos J. E. B. Automatic content extraction on semi-structured documents //2011 International Conference on Document Analysis and Recognition. IEEE. 2011. P. 1235–1239.</mixed-citation><mixed-citation xml:lang="en">Dos Santos J. E. B. Automatic content extraction on semi-structured documents //2011 International Conference on Document Analysis and Recognition. IEEE. 2011. P. 1235–1239.</mixed-citation></citation-alternatives></ref><ref id="cit20"><label>20</label><citation-alternatives><mixed-citation xml:lang="ru">Alexander Jung. Imgaug Documentation Release 0.4.0 // Readthedocs. URL: https://imgaug.readthedocs.io/en/latest/ (дата обращения 02.27.2021).</mixed-citation><mixed-citation xml:lang="en">Alexander Jung. Imgaug Documentation Release 0.4.0 // Readthedocs. URL: https://imgaug.readthedocs.io/en/latest/ (дата обращения 02.27.2021).</mixed-citation></citation-alternatives></ref><ref id="cit21"><label>21</label><citation-alternatives><mixed-citation xml:lang="ru">Visvalingam M., Whyatt J. D. The Douglas‐Peucker algorithm for line simplification: re‐evaluation through visualization // Computer Graphics Forum. Oxford, UK: Blackwell Publishing Ltd, 1990. V. 9, No. 3. P. 213–225.</mixed-citation><mixed-citation xml:lang="en">Visvalingam M., Whyatt J. D. The Douglas‐Peucker algorithm for line simplification: re‐evaluation through visualization // Computer Graphics Forum. Oxford, UK: Blackwell Publishing Ltd, 1990. V. 9, No. 3. P. 213–225.</mixed-citation></citation-alternatives></ref><ref id="cit22"><label>22</label><citation-alternatives><mixed-citation xml:lang="ru">Intersection over Union (IoU) for object detection // PyImageSearch. URL: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ (дата обращения 27.02.2021).</mixed-citation><mixed-citation xml:lang="en">Intersection over Union (IoU) for object detection // PyImageSearch. URL: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/ (дата обращения 27.02.2021).</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
