<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">ellibs</journal-id><journal-title-group><journal-title xml:lang="ru">Электронные библиотеки</journal-title><trans-title-group xml:lang="en"><trans-title>Russian Digital Libraries Journal</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">1562-5419</issn><publisher><publisher-name>Казанский (Приволжский) федеральный университет</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.26907/1562-5419-2024-27-5-730-744</article-id><article-id custom-type="elpub" pub-id-type="custom">ellibs-569</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Автоматическое аннотирование html-документов по стандарту Microdata</article-title><trans-title-group xml:lang="en"><trans-title>Automatic Annotation of HTML Documents using the Microdata Standard</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Ибрагимов</surname><given-names>Тимур Фердинандович</given-names></name><name name-style="western" xml:lang="en"><surname>Ibragimov</surname><given-names>Timur Ferdinandovich</given-names></name></name-alternatives><email xlink:type="simple">i.timur0701@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Ференец</surname><given-names>Александр Андреевич</given-names></name><name name-style="western" xml:lang="en"><surname>Ferenets</surname><given-names>Alexander Andreevich</given-names></name></name-alternatives><email xlink:type="simple">ist.kazan@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Казанский (Приволжский) федеральный университет</institution></aff><aff xml:lang="en"><institution>Kazan (Volga region) Federal University</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2024</year></pub-date><pub-date pub-type="epub"><day>28</day><month>05</month><year>2025</year></pub-date><volume>27</volume><issue>5</issue><fpage>730</fpage><lpage>744</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Ибрагимов Т.Ф., Ференец А.А., 2025</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="ru">Ибрагимов Т.Ф., Ференец А.А.</copyright-holder><copyright-holder xml:lang="en">Ibragimov T.F., Ferenets A.A.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://ellibs.elpub.ru/jour/article/view/569">https://ellibs.elpub.ru/jour/article/view/569</self-uri><abstract><p>Описана разработка на основе методов машинного обучения приложения для автоматического аннотирования веб-страниц по стандарту Microdata с возможностью расширения для других стандартов и с внедрением данных в JSX-файлы. Собраны и подготовлены датасеты для обучения моделей Machine Learning (ML). Собраны и проанализированы метрики модели ML.
</p></abstract><trans-abstract xml:lang="en"><p>The development of an application based on machine learning methods for automatic annotation of web pages according to the Microdata standard is described, with the possibility of extension to other standards and injecting data to JSX files. Datasets were collected and prepared for training Machine Learning (ML) models. The ML model metrics were collected and analyzed.
</p></trans-abstract><kwd-group xml:lang="ru"><kwd>Microdata</kwd><kwd>семантическая разметка</kwd><kwd>HTML5</kwd><kwd>поисковая оптимизация (SEO)</kwd><kwd>поисковые системы</kwd><kwd>машинное обучение</kwd><kwd>schema.org</kwd><kwd>семантический веб</kwd><kwd>стандарты разметки</kwd><kwd>автоматизация SEO</kwd></kwd-group><kwd-group xml:lang="en"><kwd>Microdata</kwd><kwd>semantic markup</kwd><kwd>HTML5</kwd><kwd>search engine optimization (SEO)</kwd><kwd>search engines</kwd><kwd>machine learning</kwd><kwd>schema.org</kwd><kwd>semantic web</kwd><kwd>markup standards</kwd><kwd>SEO automation</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">HTML5 (HyperText Markup Language). URL: https://html.spec.whatwg.org/multipage/introduction.html.</mixed-citation><mixed-citation xml:lang="en">HTML5 (HyperText Markup Language). URL: https://html.spec.whatwg.org/multipage/introduction.html.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">JSX. URL: https://www.typescriptlang.org/docs/handbook/jsx.html.</mixed-citation><mixed-citation xml:lang="en">JSX. URL: https://www.typescriptlang.org/docs/handbook/jsx.html.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Microdata. URL: https://html.spec.whatwg.org/multipage/microdata.html.</mixed-citation><mixed-citation xml:lang="en">Microdata. URL: https://html.spec.whatwg.org/multipage/microdata.html.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">JSON-LD. URL: https://json-ld.org.</mixed-citation><mixed-citation xml:lang="en">JSON-LD. URL: https://json-ld.org.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Brinkmann A., Primpeli A., Bizer Ch. The Web Data Commons Schema.org Data Set Series. URL: https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/Brinkmann-etal-TheWDCSchemaorgDataSetSeries-WWW2023.pdf.</mixed-citation><mixed-citation xml:lang="en">Brinkmann A., Primpeli A., Bizer Ch. The Web Data Commons Schema.org Data Set Series. URL: https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Research/Web-based_Systems/pub/Brinkmann-etal-TheWDCSchemaorgDataSetSeries-WWW2023.pdf.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Schemas. URL: https://schema.org/docs/schemas.html.</mixed-citation><mixed-citation xml:lang="en">Schemas. URL: https://schema.org/docs/schemas.html.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">RDFa. URL: https://www.w3.org/TR/html-rdfa.</mixed-citation><mixed-citation xml:lang="en">RDFa. URL: https://www.w3.org/TR/html-rdfa.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Microformats. URL: https://microformats.org.</mixed-citation><mixed-citation xml:lang="en">Microformats. URL: https://microformats.org.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Local Business Schema Generator – MicroData &amp; JSON-LD. URL: https://microdatagenerator.org/localbusiness-microdata-generator.</mixed-citation><mixed-citation xml:lang="en">Local Business Schema Generator – MicroData &amp; JSON-LD. URL: https://microdatagenerator.org/localbusiness-microdata-generator.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Structured Data Markup Helper. URL: https://www.google.com/webmasters/markup-helper/u/0.</mixed-citation><mixed-citation xml:lang="en">Structured Data Markup Helper. URL: https://www.google.com/webmasters/markup-helper/u/0.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Entity SEO Tools. URL: https://inlinks.com/.</mixed-citation><mixed-citation xml:lang="en">Entity SEO Tools. URL: https://inlinks.com/.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Web-segment. URL: https://github.com/liaocyintl/web-segment.</mixed-citation><mixed-citation xml:lang="en">Web-segment. URL: https://github.com/liaocyintl/web-segment.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Utiu N., Ionescu V.-S. Learning Web Content Extraction with DOM Features. URL: http://dx.doi.org/10.1109/ICCP.2018.8516632.</mixed-citation><mixed-citation xml:lang="en">Utiu N., Ionescu V.-S. Learning Web Content Extraction with DOM Features. URL: http://dx.doi.org/10.1109/ICCP.2018.8516632.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Peters M. E., Lecocq D. Content extraction using diverse feature sets // WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, May 13–17, 2013. Association for Computing Machinery, New York, NY, United States: 2013, pages 89–90.</mixed-citation><mixed-citation xml:lang="en">Peters M. E., Lecocq D. Content extraction using diverse feature sets // WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, May 13–17, 2013. Association for Computing Machinery, New York, NY, United States: 2013, pages 89–90.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Gongqing Wu, Li Li, Xuegang Hu, Xindong Wu Web news extraction via path ratios. URL: https://dl.acm.org/doi/abs/10.1145/2505515.2505558.</mixed-citation><mixed-citation xml:lang="en">Gongqing Wu, Li Li, Xuegang Hu, Xindong Wu Web news extraction via path ratios. URL: https://dl.acm.org/doi/abs/10.1145/2505515.2505558.</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Vadrevu S., Gelgi F., Davulcu H. Semantic partitioning of web pages // Web Information Systems Engineering–WISE 2005: 6th International Conference on Web Information Systems Engineering, New York, NY, USA, November 20–22, 2005. Proceedings 6. – Springer Berlin Heidelberg, 2005. P. 107–118.</mixed-citation><mixed-citation xml:lang="en">Vadrevu S., Gelgi F., Davulcu H. Semantic partitioning of web pages // Web Information Systems Engineering–WISE 2005: 6th International Conference on Web Information Systems Engineering, New York, NY, USA, November 20–22, 2005. Proceedings 6. – Springer Berlin Heidelberg, 2005. P. 107–118.</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">Extraction Results from the October 2022 Common Crawl Corpus. URL: https://webdatacommons.org/structureddata/#results-2022-1.</mixed-citation><mixed-citation xml:lang="en">Extraction Results from the October 2022 Common Crawl Corpus. URL: https://webdatacommons.org/structureddata/#results-2022-1.</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Common Crawl September/October 2022 Crawl Archive (CC-MAIN-2022-40). URL: https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-40/index.html.</mixed-citation><mixed-citation xml:lang="en">Common Crawl September/October 2022 Crawl Archive (CC-MAIN-2022-40). URL: https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-40/index.html.</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">SPARQL Query Language. URL: https://www.w3.org/TR/sparql11-query.</mixed-citation><mixed-citation xml:lang="en">SPARQL Query Language. URL: https://www.w3.org/TR/sparql11-query.</mixed-citation></citation-alternatives></ref><ref id="cit20"><label>20</label><citation-alternatives><mixed-citation xml:lang="ru">BERT: https://arxiv.org/abs/1810.04805.</mixed-citation><mixed-citation xml:lang="en">BERT: https://arxiv.org/abs/1810.04805.</mixed-citation></citation-alternatives></ref><ref id="cit21"><label>21</label><citation-alternatives><mixed-citation xml:lang="ru">Babel. URL: https://babeljs.io/.</mixed-citation><mixed-citation xml:lang="en">Babel. URL: https://babeljs.io/.</mixed-citation></citation-alternatives></ref><ref id="cit22"><label>22</label><citation-alternatives><mixed-citation xml:lang="ru">TypeScript. URL: https://www.typescriptlang.org/.</mixed-citation><mixed-citation xml:lang="en">TypeScript. URL: https://www.typescriptlang.org/.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
