Title extraction from english scientific books in PDF format

Title extraction from english scientific books in PDF format

Д. Филиппов

Full Text:

PDF (Rus) PDF (Rus)

Abstract

Relevance of the issue under study is due to tenuity of methods proposed by other researchers that use simple heuristics or machine learning algorithms. The purpose of the article is to provide better way to extract titles from scientific PDF documents and offer better and more reasonable approach to title selection generally. The leading approach to the study is regard as many cases and problems appeared during extraction as possible and find an approach to solve all of them. The results showed the efficiency of chosen approach in case of having a document set with all of considered problems. The research highlights that deep analysis of current task problem is a perspective to make the best solutions and tools. The article may be useful for all researchers and developers who often encounter the problem of document structural analysis or title detection as secondary task of a main program workflow.

Keywords

Pdf processing, title extraction, header extraction, strategy based approach, title heuristic, structural analysis, style information, text analysis, document analysis, information extraction

About the Author

Д. Филиппов

Казанский (Приволжский) федеральный университет
Russian Federation

References

1. Lipinski M., Yao K., Breitinger C., Beel J., Gipp B. Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents // 13th ACM/IEEE-CS Joint Conf. on Digital Libraries, Indianapolis, USA, 2013. ACM: 2013, P. 385–386.

2. Beel J., Langer S., Genzmehr M., Müller M. Docear's PDF Inspector: Title Extraction from PDF Files // 13th ACM/IEEE-CS Joint Conf. on Digital Libraries, Indianapolis, USA, 2013. ACM: 2013, P. 443–444.

3. Marinai S. Metadata Extraction from PDF Papers for Digital Library Ingest // 10th Int. Conf. on Document Analysis and Recognition (ICDAR). 2009, P. 251–255.

4. Васильев А., Самусев С., Шамина О., Козлов Д. Создание электронной библиотеки русскоязычных научных статей // сб. работ участников конкурса науч. проектов по информ. поиску под ред. П. И. Браславский, Екатеринбург, Россия, 2007. Изд. Урал. ун-та, 2007. P. 37–45.

5. Beel J., Gipp B., Shaker A., Friedrich N. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) // Research and Advanced Technology for Digital Libraries. 2010. P. 413–416.

6. Hu Y., Li H., Cao Y., Teng L., Meyerzon D., Zheng Q. Automatic extraction of titles from general documents using machine learning // 5th ACM/IEEE-CS Joint Conf. on Digital Libraries, New York, USA, 2005. ACM: 2005, P. 145–154.

7. Elizarov A. M., Kirillovich A. V., Lipachev E. K., Nevzorova O. A., Solovyev V. D., Zhiltsov N. G. Mathematical knowledge representation: semantic models and formalisms // Lobachevskii Journal of Mathematics. 2014. No 4. P. 348–354.

8. Elizarov A. M., Lipachev E. K., Nevzorova O. A., Solovyev V. D. Methods and means for semantic structuring of electronic mathematical documents // Doklady Mathematics. 2014. № 1. P. 521-524.

9. Solovyev V. D., Zhiltsov N. G. Logical Structure Analysis of Scientific Publications in Mathematics // Int. Conf. on Web Intelligence, Mining and Semantics, Sogndal, Norway, 2011. ACM: 2011, P. 21:1–21:9.

10. Han H., Giles C.L., Manavoglu E., Zha H., Zhang Z., Fox E.A. Automatic document metadata extraction using support vector machines // 3rd ACM/IEEE-CS Joint Conf. on Digital Libraries, Houston, USA, 2003. ACM: 2003, P. 37–48.

11. Peng F., McCallum A. Information Extraction from Research Papers Using Conditional Random Fields // Inf. Process. Manage. 2006. No 4. P. 963–979.

12. Nakagawa K., Nomura A., Suzuki M. Extraction of logical structure from articles in mathematics // Int. Conf. on Mathematical Knowledge Management, 2004. Springer: 2004, P. 276–289.

13. Beel J., Gipp B., Langer S., Genzmehr M., Wilde E., Nürnberger A., Pitman J. Introducing Mr. DLib, a Machine-readable Digital Library // 11th Annual Int. ACM/IEEE Joint Conf. on Digital Libraries, Ottawa, Ontario, Canada, 2011. ACM: 2011, P. 463–464.

14. Granitzer M., Hristakeva M., Knight R. and Jack K. A Comparison of Metadata Extraction Techniques for Crowdsourced Bibliographic Metadata Management // 27th Annual ACM Symposium on Applied Computing, Trento, Italy, 2012. ACM: 2012, P. 962–964.

15. Yilmazel O., Finneran C. M., Liddy E. D. MetaExtract: an NLP system to automatically assign metadata // 4th ACM/IEEE-CS Joint Conf. on Digital Libraries, Tuscon, USA, 2004. ACM: 2004, P. 241–242.

16. Mayank S., Barnopriyo B., Priyank P., Manvi G., Sidhartha S. OCR++: A Robust Framework For Information Extraction from Scholarly Articles // arXiv preprint arXiv:1609.06423. 2016. P. 1–9.

Review

For citations:

Title extraction from english scientific books in PDF format. Russian Digital Libraries Journal. 2018;21(3-4):392-411.

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1562-5419 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Russian Digital Libraries Journal

Title extraction from english scientific books in PDF format

Full Text:

Abstract

Keywords

About the Author

References

Review

For citations:

Cookies policy