References

ellibs

Электронные библиотеки

Russian Digital Libraries Journal

1562-5419

Казанский (Приволжский) федеральный университет

ellibs-93

Research Article

Статьи

Извлечение заголовков из PDF-документов научной тематики

Title extraction from english scientific books in PDF format

Филиппов

Д. С.

dmitriyfil1995@gmail.com

Казанский (Приволжский) федеральный университетRussian Federation

2018

28062018

213-4392411

2018

Филиппов Д.С.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://ellibs.elpub.ru/jour/article/view/93

Актуальность представленного исследования обусловлена бедностью существующих подходов к извлечению заголовков из PDF-документов, предложенных в более ранних исследованиях, которые используют либо машинное обучение, либо простые эвристики. Цель настоящего исследования – предоставить более проработанные подходы к общей задаче извлечения заголовка документа и предложить лучший алгоритм выделения его из документов научной тематики. Основная методика, использованная нами при выборе решения, – рассмотреть, как можно большее количество различных ситуаций относительно форматирования заголовка, возникающих в разных документах, и предложить решение для каждой из них, а затем обобщить их в полноценный подход. Результаты выбранного подхода показали его эффективность по сравнению с методами других исследователей, если в нашем распоряжении находятся документы с различными вариациями оформления, структурной организации и форматирования. Данное исследование показало, что глубокое исследование задачи – перспективный путь для разработки лучших решений и инструментов. Статья будет полезна исследователям и разработчикам, которые часто встречаются с проблемой извлечения заголовков как одной из подзадач анализа документов.

Relevance of the issue under study is due to tenuity of methods proposed by other researchers that use simple heuristics or machine learning algorithms. The purpose of the article is to provide better way to extract titles from scientific PDF documents and offer better and more reasonable approach to title selection generally. The leading approach to the study is regard as many cases and problems appeared during extraction as possible and find an approach to solve all of them. The results showed the efficiency of chosen approach in case of having a document set with all of considered problems. The research highlights that deep analysis of current task problem is a perspective to make the best solutions and tools. The article may be useful for all researchers and developers who often encounter the problem of document structural analysis or title detection as secondary task of a main program workflow.

анализ текстовавтоматическая обработка документов

Pdf processingtitle extractionheader extractionstrategy based approachtitle heuristicstructural analysisstyle informationtext analysisdocument analysisinformation extraction

References1

Lipinski M., Yao K., Breitinger C., Beel J., Gipp B. Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents // 13th ACM/IEEE-CS Joint Conf. on Digital Libraries, Indianapolis, USA, 2013. ACM: 2013, P. 385–386.

Beel J., Langer S., Genzmehr M., Müller M. Docear's PDF Inspector: Title Extraction from PDF Files // 13th ACM/IEEE-CS Joint Conf. on Digital Libraries, Indianapolis, USA, 2013. ACM: 2013, P. 443–444.

Marinai S. Metadata Extraction from PDF Papers for Digital Library Ingest // 10th Int. Conf. on Document Analysis and Recognition (ICDAR). 2009, P. 251–255.

Васильев А., Самусев С., Шамина О., Козлов Д. Создание электронной библиотеки русскоязычных научных статей // сб. работ участников конкурса науч. проектов по информ. поиску под ред. П. И. Браславский, Екатеринбург, Россия, 2007. Изд. Урал. ун-та, 2007. P. 37–45.

Beel J., Gipp B., Shaker A., Friedrich N. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) // Research and Advanced Technology for Digital Libraries. 2010. P. 413–416.

Hu Y., Li H., Cao Y., Teng L., Meyerzon D., Zheng Q. Automatic extraction of titles from general documents using machine learning // 5th ACM/IEEE-CS Joint Conf. on Digital Libraries, New York, USA, 2005. ACM: 2005, P. 145–154.

Elizarov A. M., Kirillovich A. V., Lipachev E. K., Nevzorova O. A., Solovyev V. D., Zhiltsov N. G. Mathematical knowledge representation: semantic models and formalisms // Lobachevskii Journal of Mathematics. 2014. No 4. P. 348–354.

Elizarov A. M., Lipachev E. K., Nevzorova O. A., Solovyev V. D. Methods and means for semantic structuring of electronic mathematical documents // Doklady Mathematics. 2014. № 1. P. 521-524.

Solovyev V. D., Zhiltsov N. G. Logical Structure Analysis of Scientific Publications in Mathematics // Int. Conf. on Web Intelligence, Mining and Semantics, Sogndal, Norway, 2011. ACM: 2011, P. 21:1–21:9.

Han H., Giles C.L., Manavoglu E., Zha H., Zhang Z., Fox E.A. Automatic document metadata extraction using support vector machines // 3rd ACM/IEEE-CS Joint Conf. on Digital Libraries, Houston, USA, 2003. ACM: 2003, P. 37–48.

Peng F., McCallum A. Information Extraction from Research Papers Using Conditional Random Fields // Inf. Process. Manage. 2006. No 4. P. 963–979.

Nakagawa K., Nomura A., Suzuki M. Extraction of logical structure from articles in mathematics // Int. Conf. on Mathematical Knowledge Management, 2004. Springer: 2004, P. 276–289.

Beel J., Gipp B., Langer S., Genzmehr M., Wilde E., Nürnberger A., Pitman J. Introducing Mr. DLib, a Machine-readable Digital Library // 11th Annual Int. ACM/IEEE Joint Conf. on Digital Libraries, Ottawa, Ontario, Canada, 2011. ACM: 2011, P. 463–464.

Granitzer M., Hristakeva M., Knight R. and Jack K. A Comparison of Metadata Extraction Techniques for Crowdsourced Bibliographic Metadata Management // 27th Annual ACM Symposium on Applied Computing, Trento, Italy, 2012. ACM: 2012, P. 962–964.

Yilmazel O., Finneran C. M., Liddy E. D. MetaExtract: an NLP system to automatically assign metadata // 4th ACM/IEEE-CS Joint Conf. on Digital Libraries, Tuscon, USA, 2004. ACM: 2004, P. 241–242.

Mayank S., Barnopriyo B., Priyank P., Manvi G., Sidhartha S. OCR++: A Robust Framework For Information Extraction from Scholarly Articles // arXiv preprint arXiv:1609.06423. 2016. P. 1–9.

The authors declare that there are no conflicts of interest present.