<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">ellibs</journal-id><journal-title-group><journal-title xml:lang="ru">Электронные библиотеки</journal-title><trans-title-group xml:lang="en"><trans-title>Russian Digital Libraries Journal</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">1562-5419</issn><publisher><publisher-name>Казанский (Приволжский) федеральный университет</publisher-name></publisher></journal-meta><article-meta><article-id custom-type="elpub" pub-id-type="custom">ellibs-93</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Извлечение заголовков из PDF-документов научной тематики</article-title><trans-title-group xml:lang="en"><trans-title>Title extraction from english scientific books in PDF format</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Филиппов</surname><given-names>Д. С.</given-names></name></name-alternatives><email xlink:type="simple">dmitriyfil1995@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff xml:lang="ru" id="aff-1"><institution>Казанский (Приволжский) федеральный университет</institution><country>Russian Federation</country></aff><pub-date pub-type="collection"><year>2018</year></pub-date><pub-date pub-type="epub"><day>28</day><month>06</month><year>2018</year></pub-date><volume>21</volume><issue>3-4</issue><fpage>392</fpage><lpage>411</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Филиппов Д.С., 2018</copyright-statement><copyright-year>2018</copyright-year><copyright-holder xml:lang="ru">Филиппов Д.С.</copyright-holder><copyright-holder xml:lang="en">Филиппов Д.С.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://ellibs.elpub.ru/jour/article/view/93">https://ellibs.elpub.ru/jour/article/view/93</self-uri><abstract><p>Актуальность представленного исследования обусловлена бедностью существующих подходов к извлечению заголовков из PDF-документов, предложенных в более ранних исследованиях, которые используют либо машинное обучение, либо простые эвристики. Цель настоящего исследования – предоставить более проработанные подходы к общей задаче извлечения заголовка документа и предложить лучший алгоритм выделения его из документов научной тематики. Основная методика, использованная нами при выборе решения, – рассмотреть, как можно большее количество различных ситуаций относительно форматирования заголовка, возникающих в разных документах, и предложить решение для каждой из них, а затем обобщить их в полноценный подход. Результаты выбранного подхода показали его эффективность по сравнению с методами других исследователей, если в нашем распоряжении находятся документы с различными вариациями оформления, структурной организации и форматирования. Данное исследование показало, что глубокое исследование задачи – перспективный путь для разработки лучших решений и инструментов. Статья будет полезна исследователям и разработчикам, которые часто встречаются с проблемой извлечения заголовков как одной из подзадач анализа документов.
</p></abstract><trans-abstract xml:lang="en"><p>Relevance of the issue under study is due to tenuity of methods proposed by other researchers that use simple heuristics or machine learning algorithms. The purpose of the article is to provide better way to extract titles from scientific PDF documents and offer better and more reasonable approach to title selection generally. The leading approach to the study is regard as many cases and problems appeared during extraction as possible and find an approach to solve all of them. The results showed the efficiency of chosen approach in case of having a document set with all of considered problems. The research highlights that deep analysis of current task problem is a perspective to make the best solutions and tools. The article may be useful for all researchers and developers who often encounter the problem of document structural analysis or title detection as secondary task of a main program workflow.
</p></trans-abstract><kwd-group xml:lang="ru"><kwd>анализ текстов</kwd><kwd>автоматическая обработка документов</kwd></kwd-group><kwd-group xml:lang="en"><kwd>Pdf processing</kwd><kwd>title extraction</kwd><kwd>header extraction</kwd><kwd>strategy based approach</kwd><kwd>title heuristic</kwd><kwd>structural analysis</kwd><kwd>style information</kwd><kwd>text analysis</kwd><kwd>document analysis</kwd><kwd>information extraction</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Lipinski M., Yao K., Breitinger C., Beel J., Gipp B. Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents // 13th ACM/IEEE-CS Joint Conf. on Digital Libraries, Indianapolis, USA, 2013. ACM: 2013, P. 385–386.</mixed-citation><mixed-citation xml:lang="en">Lipinski M., Yao K., Breitinger C., Beel J., Gipp B. Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents // 13th ACM/IEEE-CS Joint Conf. on Digital Libraries, Indianapolis, USA, 2013. ACM: 2013, P. 385–386.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Beel J., Langer S., Genzmehr M., Müller M. Docear's PDF Inspector: Title Extraction from PDF Files // 13th ACM/IEEE-CS Joint Conf. on Digital Libraries, Indianapolis, USA, 2013. ACM: 2013, P. 443–444.</mixed-citation><mixed-citation xml:lang="en">Beel J., Langer S., Genzmehr M., Müller M. Docear's PDF Inspector: Title Extraction from PDF Files // 13th ACM/IEEE-CS Joint Conf. on Digital Libraries, Indianapolis, USA, 2013. ACM: 2013, P. 443–444.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Marinai S. Metadata Extraction from PDF Papers for Digital Library Ingest // 10th Int. Conf. on Document Analysis and Recognition (ICDAR). 2009, P. 251–255.</mixed-citation><mixed-citation xml:lang="en">Marinai S. Metadata Extraction from PDF Papers for Digital Library Ingest // 10th Int. Conf. on Document Analysis and Recognition (ICDAR). 2009, P. 251–255.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Васильев А., Самусев С., Шамина О., Козлов Д. Создание электронной библиотеки русскоязычных научных статей // сб. работ участников конкурса науч. проектов по информ. поиску под ред. П. И. Браславский, Екатеринбург, Россия, 2007. Изд. Урал. ун-та, 2007. P. 37–45.</mixed-citation><mixed-citation xml:lang="en">Васильев А., Самусев С., Шамина О., Козлов Д. Создание электронной библиотеки русскоязычных научных статей // сб. работ участников конкурса науч. проектов по информ. поиску под ред. П. И. Браславский, Екатеринбург, Россия, 2007. Изд. Урал. ун-та, 2007. P. 37–45.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Beel J., Gipp B., Shaker A., Friedrich N. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) // Research and Advanced Technology for Digital Libraries. 2010. P. 413–416.</mixed-citation><mixed-citation xml:lang="en">Beel J., Gipp B., Shaker A., Friedrich N. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) // Research and Advanced Technology for Digital Libraries. 2010. P. 413–416.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Hu Y., Li H., Cao Y., Teng L., Meyerzon D., Zheng Q. Automatic extraction of titles from general documents using machine learning // 5th ACM/IEEE-CS Joint Conf. on Digital Libraries, New York, USA, 2005. ACM: 2005, P. 145–154.</mixed-citation><mixed-citation xml:lang="en">Hu Y., Li H., Cao Y., Teng L., Meyerzon D., Zheng Q. Automatic extraction of titles from general documents using machine learning // 5th ACM/IEEE-CS Joint Conf. on Digital Libraries, New York, USA, 2005. ACM: 2005, P. 145–154.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Elizarov A. M., Kirillovich A. V., Lipachev E. K., Nevzorova O. A., Solovyev V. D., Zhiltsov N. G. Mathematical knowledge representation: semantic models and formalisms // Lobachevskii Journal of Mathematics. 2014. No 4. P. 348–354.</mixed-citation><mixed-citation xml:lang="en">Elizarov A. M., Kirillovich A. V., Lipachev E. K., Nevzorova O. A., Solovyev V. D., Zhiltsov N. G. Mathematical knowledge representation: semantic models and formalisms // Lobachevskii Journal of Mathematics. 2014. No 4. P. 348–354.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Elizarov A. M., Lipachev E. K., Nevzorova O. A., Solovyev V. D. Methods and means for semantic structuring of electronic mathematical documents // Doklady Mathematics. 2014. № 1. P. 521-524.</mixed-citation><mixed-citation xml:lang="en">Elizarov A. M., Lipachev E. K., Nevzorova O. A., Solovyev V. D. Methods and means for semantic structuring of electronic mathematical documents // Doklady Mathematics. 2014. № 1. P. 521-524.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Solovyev V. D., Zhiltsov N. G. Logical Structure Analysis of Scientific Publications in Mathematics // Int. Conf. on Web Intelligence, Mining and Semantics, Sogndal, Norway, 2011. ACM: 2011, P. 21:1–21:9.</mixed-citation><mixed-citation xml:lang="en">Solovyev V. D., Zhiltsov N. G. Logical Structure Analysis of Scientific Publications in Mathematics // Int. Conf. on Web Intelligence, Mining and Semantics, Sogndal, Norway, 2011. ACM: 2011, P. 21:1–21:9.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Han H., Giles C.L., Manavoglu E., Zha H., Zhang Z., Fox E.A. Automatic document metadata extraction using support vector machines // 3rd ACM/IEEE-CS Joint Conf. on Digital Libraries, Houston, USA, 2003. ACM: 2003, P. 37–48.</mixed-citation><mixed-citation xml:lang="en">Han H., Giles C.L., Manavoglu E., Zha H., Zhang Z., Fox E.A. Automatic document metadata extraction using support vector machines // 3rd ACM/IEEE-CS Joint Conf. on Digital Libraries, Houston, USA, 2003. ACM: 2003, P. 37–48.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Peng F., McCallum A. Information Extraction from Research Papers Using Conditional Random Fields // Inf. Process. Manage. 2006. No 4. P. 963–979.</mixed-citation><mixed-citation xml:lang="en">Peng F., McCallum A. Information Extraction from Research Papers Using Conditional Random Fields // Inf. Process. Manage. 2006. No 4. P. 963–979.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Nakagawa K., Nomura A., Suzuki M. Extraction of logical structure from articles in mathematics // Int. Conf. on Mathematical Knowledge Management, 2004. Springer: 2004, P. 276–289.</mixed-citation><mixed-citation xml:lang="en">Nakagawa K., Nomura A., Suzuki M. Extraction of logical structure from articles in mathematics // Int. Conf. on Mathematical Knowledge Management, 2004. Springer: 2004, P. 276–289.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Beel J., Gipp B., Langer S., Genzmehr M., Wilde E., Nürnberger A., Pitman J. Introducing Mr. DLib, a Machine-readable Digital Library // 11th Annual Int. ACM/IEEE Joint Conf. on Digital Libraries, Ottawa, Ontario, Canada, 2011. ACM: 2011, P. 463–464.</mixed-citation><mixed-citation xml:lang="en">Beel J., Gipp B., Langer S., Genzmehr M., Wilde E., Nürnberger A., Pitman J. Introducing Mr. DLib, a Machine-readable Digital Library // 11th Annual Int. ACM/IEEE Joint Conf. on Digital Libraries, Ottawa, Ontario, Canada, 2011. ACM: 2011, P. 463–464.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Granitzer M., Hristakeva M., Knight R. and Jack K. A Comparison of Metadata Extraction Techniques for Crowdsourced Bibliographic Metadata Management // 27th Annual ACM Symposium on Applied Computing, Trento, Italy, 2012. ACM: 2012, P. 962–964.</mixed-citation><mixed-citation xml:lang="en">Granitzer M., Hristakeva M., Knight R. and Jack K. A Comparison of Metadata Extraction Techniques for Crowdsourced Bibliographic Metadata Management // 27th Annual ACM Symposium on Applied Computing, Trento, Italy, 2012. ACM: 2012, P. 962–964.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Yilmazel O., Finneran C. M., Liddy E. D. MetaExtract: an NLP system to automatically assign metadata // 4th ACM/IEEE-CS Joint Conf. on Digital Libraries, Tuscon, USA, 2004. ACM: 2004, P. 241–242.</mixed-citation><mixed-citation xml:lang="en">Yilmazel O., Finneran C. M., Liddy E. D. MetaExtract: an NLP system to automatically assign metadata // 4th ACM/IEEE-CS Joint Conf. on Digital Libraries, Tuscon, USA, 2004. ACM: 2004, P. 241–242.</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Mayank S., Barnopriyo B., Priyank P., Manvi G., Sidhartha S. OCR++: A Robust Framework For Information Extraction from Scholarly Articles // arXiv preprint arXiv:1609.06423. 2016. P. 1–9.</mixed-citation><mixed-citation xml:lang="en">Mayank S., Barnopriyo B., Priyank P., Manvi G., Sidhartha S. OCR++: A Robust Framework For Information Extraction from Scholarly Articles // arXiv preprint arXiv:1609.06423. 2016. P. 1–9.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
