<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">ellibs</journal-id><journal-title-group><journal-title xml:lang="ru">Электронные библиотеки</journal-title><trans-title-group xml:lang="en"><trans-title>Russian Digital Libraries Journal</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">1562-5419</issn><publisher><publisher-name>Казанский (Приволжский) федеральный университет</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.26907/1562-5419-2025-28-6-1306-1323</article-id><article-id custom-type="elpub" pub-id-type="custom">ellibs-621</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Формирование структурированных представлений научных журналов для интеграции в граф знаний и семантического поиска</article-title><trans-title-group xml:lang="en"><trans-title>Formation of Structured Representations of Scientific Journals for Integration into a Knowledge Graph and Semantic Search</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Атаева</surname><given-names>Ольга Муратовна</given-names></name><name name-style="western" xml:lang="en"><surname>Ataeva</surname><given-names>Olga Muratovna</given-names></name></name-alternatives><email xlink:type="simple">oataeva@frccsc.ru</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Кобук</surname><given-names>Михаил Геннадьевич</given-names></name><name name-style="western" xml:lang="en"><surname>Kobuk</surname><given-names>Mikhail Gennadievich</given-names></name></name-alternatives><email xlink:type="simple">mikhail.kobuk@mail.ru</email><xref ref-type="aff" rid="aff-2"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Федеральный исследовательский центр «Информатика и управление» Российской академии наук</institution></aff><aff xml:lang="en"><institution>Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences</institution></aff></aff-alternatives><aff-alternatives id="aff-2"><aff xml:lang="ru"><institution>Московский университет имени С.Ю. Витте</institution></aff><aff xml:lang="en"><institution>S. Witte University of Moscow</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2025</year></pub-date><pub-date pub-type="epub"><day>19</day><month>12</month><year>2025</year></pub-date><volume>28</volume><issue>6</issue><fpage>1306</fpage><lpage>1323</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Атаева О.М., Кобук М.Г., 2025</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="ru">Атаева О.М., Кобук М.Г.</copyright-holder><copyright-holder xml:lang="en">Ataeva O.M., Kobuk M.G.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://ellibs.elpub.ru/jour/article/view/621">https://ellibs.elpub.ru/jour/article/view/621</self-uri><abstract><p>Работа посвящена проблеме развития библиотеки научных предметных областей SciLibRu, как продолжения семантического описания научных трудов проекта LibMeta. В основе этой библиотеки лежит концептуальная модель данных, структура и семантика которой сформированы на принципах онтологического моделирования. Такой подход обеспечивает строгое описание предметной области, формализацию взаимосвязей между сущностями и возможность дальнейшего автоматизированного анализа данных. Целью настоящего исследования были разработка и экспериментальное применение методов структуризации содержимого научных журналов в формате LaTeX для их интеграции в онтологию библиотеки и обеспечения семантического поиска.


Предложен алгоритм трансляции в формат XML данных, представленных множеством файлов, для интеграции в онтологию библиотеки. Реализован модуль векторного поиска, основанный на вычислении эмбеддингов с использованием языковых моделей. Выявлены закономерности распределения эмбеддингов и факторы, влияющие на точность ранжирования результатов поиска. Проведено тестирование двух названых компонентов.


Разработанный метод составляет основу для автоматического включения содержимого научных журналов в граф знаний SciLibRu и создания обучающих корпусов для языковых моделей, ограниченных рамками научных предметных областей. Полученные результаты способствуют развитию систем навигации по графу знаний журналов, а также рекомендательных механизмов и инструментов интеллектуального поиска по русскоязычным научным текстам.
</p></abstract><trans-abstract xml:lang="en"><p>This paper examines the development of the SciLibRu library of scientific subject areas, as a continuation of the semantic description of scientific works from the library LibMeta project. This library is based on a conceptual data model, the structure and semantics of which are formed based on the principles of ontological modeling. This approach ensures a strict description of the subject area, formalization of the relationships between entities, and the possibility of further automated data analysis. The goal of the study is to develop and experimentally apply methods for structuring scientific journal data in LaTeX format for their integration into the library ontology and to support semantic search.


An algorithm for translating data represented by multiple files into XML format is proposed for integration into the library ontology. A vector search module based on embedding calculation using language models is implemented. Patterns in the distribution of embeddings and factors influencing the accuracy of search results ranking are identified. Testing of the two components is conducted.


The developed method forms the basis for automatically incorporating scientific journal data into the SciLibRu knowledge graph and creating training corpora for language models limited to scientific subject areas. The obtained results contribute to the development of journal knowledge graph navigation systems, recommendation engines, and intelligent search tools for Russian-language scientific texts.
</p></trans-abstract><kwd-group xml:lang="ru"><kwd>полуструктурированные данные</kwd><kwd>онтология текста</kwd><kwd>LaTeX</kwd><kwd>векторное представление текста</kwd><kwd>полнотекстовый поиск</kwd><kwd>семантический поиск</kwd></kwd-group><kwd-group xml:lang="en"><kwd>semi-structured data</kwd><kwd>text structuring</kwd><kwd>LaTeX</kwd><kwd>vector representations of text</kwd><kwd>full-text search</kwd><kwd>semantic search</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Hoftich M. TEX4ht: LATEX to Web Publishing // TUGboat. 2019. Vol. 40, No. 1. P. 76–81.</mixed-citation><mixed-citation xml:lang="en">Hoftich M. TEX4ht: LATEX to Web Publishing // TUGboat. 2019. Vol. 40, No. 1. P. 76–81.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Frankston C. et al. Using HTML Papers on arXiv: Why It’s Important, and How We Made It Happen // arXiv preprint 2024. https://doi.org/10.48550/arXiv.2402.08954 (In Russ.)</mixed-citation><mixed-citation xml:lang="en">Frankston C. et al. Using HTML Papers on arXiv: Why It’s Important, and How We Made It Happen // arXiv preprint 2024. https://doi.org/10.48550/arXiv.2402.08954 (In Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Serebryakov V.A., Galochkin M.P., Gonchar D.R., Furugyan M.G. Theory and Implementation of Programming Languages. 2nd ed. Moscow: MZ-Press, 2006. 352 p. (In Russ.)</mixed-citation><mixed-citation xml:lang="en">Serebryakov V.A., Galochkin M.P., Gonchar D.R., Furugyan M.G. Theory and Implementation of Programming Languages. 2nd ed. Moscow: MZ-Press, 2006. 352 p. (In Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Hopcroft J., Motwani R., Ullman J. Introduction to Automata Theory, Languages, and Computation. Moscow: Williams, 2002. 528 p. (In Russ.)</mixed-citation><mixed-citation xml:lang="en">Hopcroft J., Motwani R., Ullman J. Introduction to Automata Theory, Languages, and Computation. Moscow: Williams, 2002. 528 p. (In Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Aho A.V., Lam M.S., Sethi R., Ullman J.D. Compilers: Principles, Techniques, and Tools. 2nd ed. Moscow: Williams, 2008. 1184 p. (In Russ.)</mixed-citation><mixed-citation xml:lang="en">Aho A.V., Lam M.S., Sethi R., Ullman J.D. Compilers: Principles, Techniques, and Tools. 2nd ed. Moscow: Williams, 2008. 1184 p. (In Russ.)</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. Distributed Representations of Words and Phrases and their Compositionality // Advances in Neural Information Processing Systems (NIPS 26). 2013. P. 3111–3119. URL: https://dl.acm.org/doi/10.5555/2999792.2999959 (date accessed: 08.11.2025)</mixed-citation><mixed-citation xml:lang="en">Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. Distributed Representations of Words and Phrases and their Compositionality // Advances in Neural Information Processing Systems (NIPS 26). 2013. P. 3111–3119. URL: https://dl.acm.org/doi/10.5555/2999792.2999959 (date accessed: 08.11.2025)</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Pennington J., Socher R., Manning C. GloVe: Global Vectors for Word Representation // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. P. 1532–1543. https://doi.org/10.3115/v1/D14-1162</mixed-citation><mixed-citation xml:lang="en">Pennington J., Socher R., Manning C. GloVe: Global Vectors for Word Representation // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. P. 1532–1543. https://doi.org/10.3115/v1/D14-1162</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Joulin A., Grave E., Bojanowski P., Mikolov T. Bag of Tricks for Efficient Text Classification // Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Valencia, Spain, April 2017. P. 427–431. https://doi.org/10.18653/v1/E17-2068</mixed-citation><mixed-citation xml:lang="en">Joulin A., Grave E., Bojanowski P., Mikolov T. Bag of Tricks for Efficient Text Classification // Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Valencia, Spain, April 2017. P. 427–431. https://doi.org/10.18653/v1/E17-2068</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Feng F., Yang Y., Cer D., Arivazhagan N., Wang W. Language-agnostic BERT Sentence Embedding // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). Dublin, Ireland, May 2022. Р. 878–891. https://doi.org/10.18653/v1/2022.acl-long.62</mixed-citation><mixed-citation xml:lang="en">Feng F., Yang Y., Cer D., Arivazhagan N., Wang W. Language-agnostic BERT Sentence Embedding // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). Dublin, Ireland, May 2022. Р. 878–891. https://doi.org/10.18653/v1/2022.acl-long.62</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Zmitrovich D. et al. A Family of Pretrained Transformer Language Models for Russian // arXiv preprint 2023. https://doi.org/10.48550/arXiv.2309.10931</mixed-citation><mixed-citation xml:lang="en">Zmitrovich D. et al. A Family of Pretrained Transformer Language Models for Russian // arXiv preprint 2023. https://doi.org/10.48550/arXiv.2309.10931</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Kuratov Y., Arkhipov M. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language // arXiv preprint 2019. https://doi.org/10.48550/arXiv.1905.07213</mixed-citation><mixed-citation xml:lang="en">Kuratov Y., Arkhipov M. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language // arXiv preprint 2019. https://doi.org/10.48550/arXiv.1905.07213</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Nikolich A., Puchkova A. Fine-tuning GPT-3 for Russian Text Summarization // arXiv preprint 2021. https://doi.org/10.48550/arXiv.2108.03502</mixed-citation><mixed-citation xml:lang="en">Nikolich A., Puchkova A. Fine-tuning GPT-3 for Russian Text Summarization // arXiv preprint 2021. https://doi.org/10.48550/arXiv.2108.03502</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Kutuzov A., Kuzmenko E. WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models // In: Ignatov D. et al. (Eds.) Analysis of Images, Social Networks and Texts (AIST 2016). Communications in Computer and Information Science. Vol. 661. Springer, Cham, 2017. https://doi.org/10.1007/978-3-319-52920-2_15</mixed-citation><mixed-citation xml:lang="en">Kutuzov A., Kuzmenko E. WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models // In: Ignatov D. et al. (Eds.) Analysis of Images, Social Networks and Texts (AIST 2016). Communications in Computer and Information Science. Vol. 661. Springer, Cham, 2017. https://doi.org/10.1007/978-3-319-52920-2_15</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Kasenchak R.T. What is Semantic Search? and Why Is It Important? // Information Services and Use. 2019. Vol. 39. No. 3. Р. 205–213. https://doi.org/10.3233/ISU-190045</mixed-citation><mixed-citation xml:lang="en">Kasenchak R.T. What is Semantic Search? and Why Is It Important? // Information Services and Use. 2019. Vol. 39. No. 3. Р. 205–213. https://doi.org/10.3233/ISU-190045</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Shelke P. et al. A Systematic and Comparative Analysis of Semantic Search Algorithms // International Journal on Recent and Innovation Trends in Computing and Communication. 2023. Vol. 11, No. 11s. P. 222–229. https://doi.org/10.17762/ijritcc.v11i11s.8094</mixed-citation><mixed-citation xml:lang="en">Shelke P. et al. A Systematic and Comparative Analysis of Semantic Search Algorithms // International Journal on Recent and Innovation Trends in Computing and Communication. 2023. Vol. 11, No. 11s. P. 222–229. https://doi.org/10.17762/ijritcc.v11i11s.8094</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Weckmüller D., Dunkel A., Burghardt D. Embedding-Based Multilingual Semantic Search for Geo-Textual Data in Urban Studies // Journal of Geovisualization and Spatial Analysis. 2025. Vol. 9. No. 31. P. 1–18. https://doi.org/10.1007/s41651-025-00232-5</mixed-citation><mixed-citation xml:lang="en">Weckmüller D., Dunkel A., Burghardt D. Embedding-Based Multilingual Semantic Search for Geo-Textual Data in Urban Studies // Journal of Geovisualization and Spatial Analysis. 2025. Vol. 9. No. 31. P. 1–18. https://doi.org/10.1007/s41651-025-00232-5</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">Siddharth Pratap Singh. Vector Search in the Era of Semantic Understanding: A Comprehensive Review of Applications and Implementations // International Journal of Computer Engineering and Technology. 2024. Vol. 15. No. 6. P. 1794–1805. https://doi.org/10.34218/IJCET_15_06_153</mixed-citation><mixed-citation xml:lang="en">Siddharth Pratap Singh. Vector Search in the Era of Semantic Understanding: A Comprehensive Review of Applications and Implementations // International Journal of Computer Engineering and Technology. 2024. Vol. 15. No. 6. P. 1794–1805. https://doi.org/10.34218/IJCET_15_06_153</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Zhou Y. et al. Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words // 2022. https://doi.org/10.48550/arXiv.2205.05092</mixed-citation><mixed-citation xml:lang="en">Zhou Y. et al. Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words // 2022. https://doi.org/10.48550/arXiv.2205.05092</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">Healy J., McInnes L. Uniform manifold approximation and projection // Nature Reviews Methods Primers. 2024, Vol. 4. No. 82. P. 1–15. https://doi.org/10.1038/s43586-024-00363-x</mixed-citation><mixed-citation xml:lang="en">Healy J., McInnes L. Uniform manifold approximation and projection // Nature Reviews Methods Primers. 2024, Vol. 4. No. 82. P. 1–15. https://doi.org/10.1038/s43586-024-00363-x</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
