References

ellibs

Электронные библиотеки

Russian Digital Libraries Journal

1562-5419

Казанский (Приволжский) федеральный университет

10.26907/1562-5419-2025-28-3-654-681

ellibs-580

Research Article

Статьи

Методика сравнения программных решений распознавания текстов научных публикаций по качеству извлечения метаданных

Procedure for Comparing Text Recognition Software Solutions For Scientific Publications by the Quality of Metadata Extraction

Кузнецов

Илия Игоревич

Kuznetsov

Ilia Igorevich

iliya-kuznetsov@mail.ru

Новиков

Олег Пантелеевич

Novikov

Oleg Panteleevich

novikovop55@rambler.ru

Ильин

Дмитрий Юрьевич

Ilin

Dmitry Yurievich

i@dmitryilin.com

Российский государственный университет им. А.Н. Косыгина (Технологии. Дизайн. Искусство)A. N. Kosygin Moscow State Textile University

МИРЭА – Российский технологический университетMIREA – Russian Technological University

2025

23062025

283654680

2025

Кузнецов И.И., Новиков О.П., Ильин Д.Ю.

Kuznetsov I.I., Novikov O.P., Ilin D.Y.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://ellibs.elpub.ru/jour/article/view/580

Метаданные научных публикаций используются для построения каталогов, определения цитируемости публикаций и решения других задач. Автоматизация извлечения метаданных из PDF-файлов позволяет ускорить выполнение обозначенных задач, а от качества извлеченных данных зависит возможность их дальнейшего использования. Проанализированы существующие программные решения, в итоге отобраны три: GROBID, CERMINE, ScientificPdfParser. Предложена методика сравнения этих программных решений распознавания текстов научных публикаций по качеству извлечения метаданных. На основе методики проведен эксперимент по извлечению четырех типов метаданных (название, аннотация, дата публикации, имена авторов). Для сравнения программных решений использован набор из 112457 публикаций с разбиением на 23 предметные области, сформированный на основе данных Semantic Scholar. Приведен пример выбора эффективного программного решения извлечения метаданных в условиях заданных приоритетов для предметных областей и типов метаданных с использованием взвешенной суммы. Определено, что для приведенного примера CERMINE показывает эффективность на 10,5% выше, чем GROBID, и на 9,6% выше, чем ScientificPdfParser.

Metadata of scientific publications are used to build catalogs, determine the citation of publications, and perform other tasks. Automation of metadata extraction from PDF files provides means to speed up the execution of the designated tasks, while the possibility of further use of the obtained data depends on the quality of extraction. Existing software solutions were analyzed, after which three of them were selected: GROBID, CERMINE, ScientificPdfParser. A procedure for comparing software solutions for recognizing texts of scientific publications by the quality of metadata extraction is proposed. Based on the procedure, an experiment was conducted to extract 4 types of metadata (title, abstract, publication date, author names). To compare software solutions, a dataset of 112,457 publications divided into 23 subject areas formed on the basis of Semantic Scholar data was used. An example of choosing an effective software solution for metadata extraction under the conditions of specified priorities for subject areas and types of metadata using a weighted sum is given. It was determined that for the given example CERMINE shows efficiency 10.5% higher than GROBID and 9.6% higher than ScientificPdfParser.

распознавание текстанаучные публикацииметаданныекачество извлечения данныхметодика

text recognitionscientific publicationsmetadatadata extraction qualityprocedure

References1

Qayyum F., Afzal M. T. Identification of important citations by exploiting research articles’ metadata and cue-terms from content // Scientometrics. 2019. Vol. 118. P. 21-43.

Liu X., Zhang J., Guo C. Full‐text citation analysis: A new method to enhance scholarly networks //Journal of the American Society for Information Science and Technology. 2013. Т. 64. №. 9. P. 1852-1863.

Saier T., Färber M. unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata // Scientometrics. 2020. Vol. 125. No. 3. P. 3085-3108.

Safder I. et al. Deep learning-based extraction of algorithmic metadata in full-text scholarly documents // Information processing & management. 2020. Vol. 57. No. 6. P. 102269.

O’Leary N. A. et al. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets // Scientific data. 2024. Vol. 11. No. 1. P. 732.

Safder I., Hassan S. U. Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications // Scientometrics. 2019. Vol. 119. P. 257-277.

Joshi B., Symeonidou A., Danish S.M., Hermsen F. An End-to-End Pipeline for Bibliography Extraction from Scientific Articles // Proceedings of the Second Workshop on Information Extraction from Scientific Publications. 2023. P. 101-106.

Ma A. et al. A deep-learning based citation count prediction model with paper metadata semantic features // Scientometrics. 2021. Vol. 126. No. 8. P. 6803-6823.

Lo K. et al. PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents // Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. P. 495-507.

Po D. K. Similarity based information retrieval using Levenshtein distance algorithm // International Journal of Advances in Scientific Research and Engineering. 2020. Vol. 6. No. 04. P. 06-10.

Nurcahyawati V., Mustaffa Z. Online Media as a Price Monitor: Text Analysis using Text Extraction Technique and Jaro-Winkler Similarity Algorithm // 2020 Emerging Technology in Computing, Communication and Electronics (ETCCE). IEEE, 2020. P. 1-6.

Foppiano L. et al. Automatic extraction of materials and properties from superconductors scientific literature // Science and Technology of Advanced Materials: Methods. 2023. Vol. 3. No. 1. P. 2153633.

Petersen T. et al. Geo-quantities: A framework for automatic extraction of measurements and spatial context from scientific documents // Proceedings of the 17th International Symposium on Spatial and Temporal Databases. 2021. P. 166-169.

Chraibi A. et al. Extraction of measurements from medical reports // 10ème conférence Francophone en Gestion et Ingénierie des Systèmes Hospitaliers, GISEH2020. 2020.

Haviana S. F. C., Subroto I. M. I. Obtaining Reference’s Topic Congruity in Indonesian Publications using Machine Learning Approach // 2019 6th International Con-ference on Electrical Engineering, Computer Science and Informatics (EECSI). IEEE. 2019. P. 428-431.

Ermakova L. Bordignon F., Turenne N., Noel M. Is the Abstract a Mere Teaser? Evaluating generosity of article abstracts in the environmental sciences // Frontiers in Research Metrics and Analytics. 2018. Vol 3. P. 16.

El-Ebshihy A. et al. A platform for argumentative zoning annotation and scien-tific summarization // Proceedings of the 31st ACM International Conference on Infor-mation & Knowledge Management. 2022. P. 4843-4847.

Choi W. et al. Building an annotated corpus for automatic metadata extraction from multilingual journal article references // PloS one. 2023. Vol. 18. No. 1. P. E0280637.

Krause J. et al. Bootstrapping multilingual metadata extraction: a showcase in cyrillic // Proceedings of the Second Workshop on Scholarly Document Processing. 2021. P. 66-72.

Shapiro I., Saier T., Färber M. Sequence Labeling for Citation Field Extraction from Cyrillic Script References // Proceedings of the Workshop on Scientific Document Understanding; co-located with 36th AAAI Conference on Artificial Inteligence (AAAI 2022). 2022.

Indrawati A., Yoganingrum A., Yuwono P. Evaluating the quality of the indo-nesian scientific journal references using ParsCit, CERMINE and GROBID // Library Phi-losophy and Practice. 2019. P. 1-14.

Meuschke N. et al. A benchmark of pdf information extraction tools using a multi-task and multi-domain evaluation framework for academic documents // Interna-tional Conference on Information. Cham : Springer Nature Switzerland, 2023. P. 383-405.

Guo Z., Jin H. Reference metadata extraction from scientific papers // 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE. 2011. P. 45-49.

Beel J., Langer S., Genzmehr M., Muller C. Docear’s PDF inspector: title extraction from PDF files // Proceedings of the 13th ACM/IEEE-CS joint conference on Dig-ital libraries. New York, NY, USA: ACM, 2013. P. 443–444.

Jensen Z. et al. A machine learning approach to zeolite synthesis enabled by automatic literature data extraction // ACS central science. 2019. Vol. 5. No. 5. P. 892-899.

Färber M., Albers A., Schüber F. Identifying used methods and datasets in scientific publications // Proceedings of the Workshop on Scientific Document Under-standing co-located with 35th AAAI Conference on Artificial Inteligence (AAAI 2021). 2021.

Suryawati E., Widyantoro D. H. Combination of heuristic, rule-based and machine learning for bibliography extraction // 2017 5th International Conference on In-strumentation, Communications, Information Technology, and Biomedical Engineering (ICICI-BME). IEEE. 2017. P. 276-281.

Tkaczyk D. et al. CERMINE: automatic extraction of structured metadata from scientific literature // International Journal on Document Analysis and Recognition (IJDAR). 2015. Vol. 18. P. 317-335.

Romary L., Lopez P. Grobid-information extraction from scientific publica-tions // ERCIM News. 2015. Vol. 100.

Councill I. G., Giles C. L., Kan M. Y. ParsCit: an Open-source CRF Reference String Parsing Package // Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008. 2008. Vol. 8. P. 661-667.

Prasad A., Kaur M., Kan M. Y. Neural ParsCit: a deep learning-based reference string parser // International journal on digital libraries. 2018. Vol. 19. P. 323-337.

Constantin A., Pettifer S., Voronkov A. PDFX: fully-automated PDF-to-XML conversion of scientific literature // Proceedings of the 2013 ACM symposium on Doc-ument engineering. 2013. P. 177-180.

The authors declare that there are no conflicts of interest present.