References

ellibs

Электронные библиотеки

Russian Digital Libraries Journal

1562-5419

Казанский (Приволжский) федеральный университет

10.26907/1562-5419-2025-28-4-931-942

ellibs-601

Research Article

Статьи

Алгоритмический фреймворк для извлечения информационного ядра веб-страницы

An Algorithmic Framework for Accurately Extracting Main Content from News Websites

Салем

Хамза

Salem

Hamza

h.salem@innopolis.ru

Тощев

Александр Сергеевич

Toschev

Alexander Sergeevich

atoschev@kpfu.ru

Университет ИннополисInnopolis University

Казанский (Приволжский) федеральный университетKazan (Volga region) Federal University

2025

19122025

284931942

2025

Салем Х., Тощев А.С.

Salem H., Toschev A.S.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://ellibs.elpub.ru/jour/article/view/601

Представлен новый точный алгоритм MCE извлечения основного содержимого с новостных веб-сайтов. Предложенный алгоритм использует анализ структуры объектной модели документа (DOM) и метрики плотности контента для идентификации и извлечения информационного ядра веб-страницы. Реализованный подход объединяет три ключевые особенности: максимальное количество прямых дочерних элементов с текстом, максимальное текстовое содержимое без дочерних элементов, содержащих текст, и ближайшее расположение к средней глубине узла. Алгоритм продемонстрировал лучшую производительность по сравнению с существующими решениями, такими как Boilerpipe и Readability, достигая 99,96% точности, 99,69% полноты и 99,80% F1-меры на использованном комплексном наборе данных из 500 разнообразных веб-страниц. Языково-независимый дизайн делает алгоритм особенно эффективным для извлечения мультиязычного контента, включая языки со сложной структурой, такие, например, как арабский.

A new precise MCE algorithm for extracting the main content from news websites is presented. The proposed algorithm uses analysis of the Document Object Model (DOM) structure and content density metrics to identify and extract the informational core of a web page. The implemented approach combines three key features: the maximum number of direct child elements containing text, the maximum textual content without child elements containing text, and the closest position to the average node depth. The algorithm demonstrated superior performance compared to existing solutions such as Boilerpipe and Readability, achieving 99.96% precision, 99.69% recall, and 99.80% F1-score on a comprehensive dataset of 500 diverse web pages. Its language-independent design makes the algorithm particularly effective for extracting multilingual content, including languages with complex structures such as Arabic.

NLPизвлечение данныхязыково-независимый алгоритмRAG (Retrieval-Augmented Generation)

NLPData ExtractionLanguage-Independent AlgorithmRAG (Retrieval-Augmented Generation)

References1

Jach T., Kaczmarek M., Kaczmarek T. Web content extraction: A survey of techniques and applications // Information Sciences. 2021. Vol. 570. P. 378–400.

https://doi.org/10.1016/j.ins.2021.04.014

Brown K., Davis L. Content density metrics for web page analysis // Information Retrieval Journal. 2020. Vol. 23, No. 4. P. 512–530.

https://doi.org/10.1007/s10791-020-09380-4

Gottron T. Content extraction from web pages // Proceedings of the 2008 ACM Symposium on Applied Computing. 2008. P. 1160–1164.

https://doi.org/10.1145/1363686.1363939

Insa D., Silva J., Tomás C. Using content extraction for web page classification // Information Processing & Management. 2013. Vol. 49, No. 1. P. 235–250. https://doi.org/10.1016/j.ipm.2012.05.005

Qi X., Zhang Y., Wang L. Investigating the impact of content extraction on sentiment analysis // Information Processing & Management. 2024. Vol. 61, No. 1. 103245. https://doi.org/10.1016/j.ipm.2023.103245

Zhang W., Liu X. Machine learning approaches to content extraction // Pattern Recognition. 2022. Vol. 125. 108456.

https://doi.org/10.1016/j.patcog.2022.108456

White C., Black D. Quality assessment metrics for extracted content // Data Quality Journal. 2021. Vol. 8, No. 2. P. 78–95.

Kohlschütter C. Boilerpipe: A Python library for extracting text from HTML // GitHub Repository. 2010. https://github.com/misja/python-boilerpipe

Mozilla Foundation. Readability: A Python library for extracting article content from HTML // GitHub Repository. 2020.

https://github.com/mozilla/readability

Purple I., Orange J. A comparative study of content extraction methods // Journal of Web Science. 2021. Vol. 7, No. 3. P. 123–140.

Webz.io. Webz.io Free News Datasets // Webz.io. 2023.

https://webz.io/free-news-datasets

Research Team. Elkateb: Browser Extension for Content Extraction // Browser Extension. 2024. https://github.com/elkateb/extension

Bobyr M.V., Milostnaya N.A., Bulatnikov V.A. The fuzzy filter based on the method of areas’ ratio // Applied Soft Computing. 2022. Vol. 117. 108449.

https://doi.org/10.1016/j.asoc.2022.108449

The authors declare that there are no conflicts of interest present.