<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">ellibs</journal-id><journal-title-group><journal-title xml:lang="ru">Электронные библиотеки</journal-title><trans-title-group xml:lang="en"><trans-title>Russian Digital Libraries Journal</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">1562-5419</issn><publisher><publisher-name>Казанский (Приволжский) федеральный университет</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.26907/1562-5419-2025-28-4-931-942</article-id><article-id custom-type="elpub" pub-id-type="custom">ellibs-601</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Алгоритмический фреймворк для извлечения информационного ядра веб-страницы</article-title><trans-title-group xml:lang="en"><trans-title>An Algorithmic Framework for Accurately Extracting  Main Content from News Websites</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Салем</surname><given-names>Хамза</given-names></name><name name-style="western" xml:lang="en"><surname>Salem</surname><given-names>Hamza</given-names></name></name-alternatives><email xlink:type="simple">h.salem@innopolis.ru</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Тощев</surname><given-names>Александр Сергеевич</given-names></name><name name-style="western" xml:lang="en"><surname>Toschev</surname><given-names>Alexander Sergeevich</given-names></name></name-alternatives><email xlink:type="simple">atoschev@kpfu.ru</email><xref ref-type="aff" rid="aff-2"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Университет Иннополис</institution></aff><aff xml:lang="en"><institution>Innopolis University</institution></aff></aff-alternatives><aff-alternatives id="aff-2"><aff xml:lang="ru"><institution>Казанский (Приволжский) федеральный университет</institution></aff><aff xml:lang="en"><institution>Kazan (Volga region) Federal University</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2025</year></pub-date><pub-date pub-type="epub"><day>19</day><month>12</month><year>2025</year></pub-date><volume>28</volume><issue>4</issue><fpage>931</fpage><lpage>942</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Салем Х., Тощев А.С., 2025</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="ru">Салем Х., Тощев А.С.</copyright-holder><copyright-holder xml:lang="en">Salem H., Toschev A.S.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://ellibs.elpub.ru/jour/article/view/601">https://ellibs.elpub.ru/jour/article/view/601</self-uri><abstract><p>Представлен новый точный алгоритм MCE извлечения основного содержимого с новостных веб-сайтов. Предложенный алгоритм использует анализ структуры объектной модели документа (DOM) и метрики плотности контента 
для идентификации и извлечения информационного ядра веб-страницы. Реализованный подход объединяет три ключевые особенности: максимальное количество прямых дочерних элементов с текстом, максимальное текстовое содержимое без дочерних элементов, содержащих текст, и ближайшее расположение 
к средней глубине узла. Алгоритм продемонстрировал лучшую производительность по сравнению с существующими решениями, такими как Boilerpipe и Readability, достигая 99,96% точности, 99,69% полноты и 99,80% F1-меры на использованном комплексном наборе данных из 500 разнообразных веб-страниц. Языково-независимый дизайн делает алгоритм особенно эффективным для извлечения мультиязычного контента, включая языки со сложной структурой, такие, например, как арабский.
</p></abstract><trans-abstract xml:lang="en"><p>A new precise MCE algorithm for extracting the main content from news websites is presented. The proposed algorithm uses analysis of the Document Object Model (DOM) structure and content density metrics to identify and extract the informational core of a web page. The implemented approach combines three key features: the maximum number of direct child elements containing text, the maximum textual content without child elements containing text, and the closest position to the average node depth. The algorithm demonstrated superior performance compared to existing solutions such as Boilerpipe and Readability, achieving 99.96% precision, 99.69% recall, and 99.80% F1-score on a comprehensive dataset of 500 diverse web pages. Its language-independent design makes the algorithm particularly effective for extracting multilingual content, including languages with complex structures such as Arabic.
</p></trans-abstract><kwd-group xml:lang="ru"><kwd>NLP</kwd><kwd>извлечение данных</kwd><kwd>языково-независимый алгоритм</kwd><kwd>RAG (Retrieval-Augmented Generation)</kwd></kwd-group><kwd-group xml:lang="en"><kwd>NLP</kwd><kwd>Data Extraction</kwd><kwd>Language-Independent Algorithm</kwd><kwd>RAG (Retrieval-Augmented Generation)</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Jach T., Kaczmarek M., Kaczmarek T. Web content extraction: A survey of techniques and applications // Information Sciences. 2021. Vol. 570. P. 378–400.</mixed-citation><mixed-citation xml:lang="en">Jach T., Kaczmarek M., Kaczmarek T. Web content extraction: A survey of techniques and applications // Information Sciences. 2021. Vol. 570. P. 378–400.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">https://doi.org/10.1016/j.ins.2021.04.014</mixed-citation><mixed-citation xml:lang="en">https://doi.org/10.1016/j.ins.2021.04.014</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Brown K., Davis L. Content density metrics for web page analysis // Information Retrieval Journal. 2020. Vol. 23, No. 4. P. 512–530.</mixed-citation><mixed-citation xml:lang="en">Brown K., Davis L. Content density metrics for web page analysis // Information Retrieval Journal. 2020. Vol. 23, No. 4. P. 512–530.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">https://doi.org/10.1007/s10791-020-09380-4</mixed-citation><mixed-citation xml:lang="en">https://doi.org/10.1007/s10791-020-09380-4</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Gottron T. Content extraction from web pages // Proceedings of the 2008 ACM Symposium on Applied Computing. 2008. P. 1160–1164.</mixed-citation><mixed-citation xml:lang="en">Gottron T. Content extraction from web pages // Proceedings of the 2008 ACM Symposium on Applied Computing. 2008. P. 1160–1164.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">https://doi.org/10.1145/1363686.1363939</mixed-citation><mixed-citation xml:lang="en">https://doi.org/10.1145/1363686.1363939</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Insa D., Silva J., Tomás C. Using content extraction for web page classification // Information Processing &amp; Management. 2013. Vol. 49, No. 1. P. 235–250. https://doi.org/10.1016/j.ipm.2012.05.005</mixed-citation><mixed-citation xml:lang="en">Insa D., Silva J., Tomás C. Using content extraction for web page classification // Information Processing &amp; Management. 2013. Vol. 49, No. 1. P. 235–250. https://doi.org/10.1016/j.ipm.2012.05.005</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Qi X., Zhang Y., Wang L. Investigating the impact of content extraction on sentiment analysis // Information Processing &amp; Management. 2024. Vol. 61, No. 1. 103245. https://doi.org/10.1016/j.ipm.2023.103245</mixed-citation><mixed-citation xml:lang="en">Qi X., Zhang Y., Wang L. Investigating the impact of content extraction on sentiment analysis // Information Processing &amp; Management. 2024. Vol. 61, No. 1. 103245. https://doi.org/10.1016/j.ipm.2023.103245</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Zhang W., Liu X. Machine learning approaches to content extraction // Pattern Recognition. 2022. Vol. 125. 108456.</mixed-citation><mixed-citation xml:lang="en">Zhang W., Liu X. Machine learning approaches to content extraction // Pattern Recognition. 2022. Vol. 125. 108456.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">https://doi.org/10.1016/j.patcog.2022.108456</mixed-citation><mixed-citation xml:lang="en">https://doi.org/10.1016/j.patcog.2022.108456</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">White C., Black D. Quality assessment metrics for extracted content // Data Quality Journal. 2021. Vol. 8, No. 2. P. 78–95.</mixed-citation><mixed-citation xml:lang="en">White C., Black D. Quality assessment metrics for extracted content // Data Quality Journal. 2021. Vol. 8, No. 2. P. 78–95.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Kohlschütter C. Boilerpipe: A Python library for extracting text from HTML // GitHub Repository. 2010. https://github.com/misja/python-boilerpipe</mixed-citation><mixed-citation xml:lang="en">Kohlschütter C. Boilerpipe: A Python library for extracting text from HTML // GitHub Repository. 2010. https://github.com/misja/python-boilerpipe</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Mozilla Foundation. Readability: A Python library for extracting article content from HTML // GitHub Repository. 2020.</mixed-citation><mixed-citation xml:lang="en">Mozilla Foundation. Readability: A Python library for extracting article content from HTML // GitHub Repository. 2020.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">https://github.com/mozilla/readability</mixed-citation><mixed-citation xml:lang="en">https://github.com/mozilla/readability</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Purple I., Orange J. A comparative study of content extraction methods // Journal of Web Science. 2021. Vol. 7, No. 3. P. 123–140.</mixed-citation><mixed-citation xml:lang="en">Purple I., Orange J. A comparative study of content extraction methods // Journal of Web Science. 2021. Vol. 7, No. 3. P. 123–140.</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Webz.io. Webz.io Free News Datasets // Webz.io. 2023.</mixed-citation><mixed-citation xml:lang="en">Webz.io. Webz.io Free News Datasets // Webz.io. 2023.</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">https://webz.io/free-news-datasets</mixed-citation><mixed-citation xml:lang="en">https://webz.io/free-news-datasets</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Research Team. Elkateb: Browser Extension for Content Extraction // Browser Extension. 2024. https://github.com/elkateb/extension</mixed-citation><mixed-citation xml:lang="en">Research Team. Elkateb: Browser Extension for Content Extraction // Browser Extension. 2024. https://github.com/elkateb/extension</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">Bobyr M.V., Milostnaya N.A., Bulatnikov V.A. The fuzzy filter based on the method of areas’ ratio // Applied Soft Computing. 2022. Vol. 117. 108449.</mixed-citation><mixed-citation xml:lang="en">Bobyr M.V., Milostnaya N.A., Bulatnikov V.A. The fuzzy filter based on the method of areas’ ratio // Applied Soft Computing. 2022. Vol. 117. 108449.</mixed-citation></citation-alternatives></ref><ref id="cit20"><label>20</label><citation-alternatives><mixed-citation xml:lang="ru">https://doi.org/10.1016/j.asoc.2022.108449</mixed-citation><mixed-citation xml:lang="en">https://doi.org/10.1016/j.asoc.2022.108449</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
