An Algorithmic Framework for Accurately Extracting  Main Content from News Websites

Hamza Salem; Alexander Sergeevich Toschev

doi:10.26907/1562-5419-2025-28-4-931-942

An Algorithmic Framework for Accurately Extracting Main Content from News Websites

Hamza Salem, Alexander Sergeevich Toschev

https://doi.org/10.26907/1562-5419-2025-28-4-931-942

Full Text:

PDF (Rus)

Generate QR code

Abstract

A new precise MCE algorithm for extracting the main content from news websites is presented. The proposed algorithm uses analysis of the Document Object Model (DOM) structure and content density metrics to identify and extract the informational core of a web page. The implemented approach combines three key features: the maximum number of direct child elements containing text, the maximum textual content without child elements containing text, and the closest position to the average node depth. The algorithm demonstrated superior performance compared to existing solutions such as Boilerpipe and Readability, achieving 99.96% precision, 99.69% recall, and 99.80% F1-score on a comprehensive dataset of 500 diverse web pages. Its language-independent design makes the algorithm particularly effective for extracting multilingual content, including languages with complex structures such as Arabic.

Keywords

NLP, Data Extraction, Language-Independent Algorithm, RAG (Retrieval-Augmented Generation)

About the Authors

Hamza Salem

Innopolis University
Russian Federation

Alexander Sergeevich Toschev

Kazan (Volga region) Federal University
Russian Federation

References

1. Jach T., Kaczmarek M., Kaczmarek T. Web content extraction: A survey of techniques and applications // Information Sciences. 2021. Vol. 570. P. 378–400.

2. https://doi.org/10.1016/j.ins.2021.04.014

3. Brown K., Davis L. Content density metrics for web page analysis // Information Retrieval Journal. 2020. Vol. 23, No. 4. P. 512–530.

4. https://doi.org/10.1007/s10791-020-09380-4

5. Gottron T. Content extraction from web pages // Proceedings of the 2008 ACM Symposium on Applied Computing. 2008. P. 1160–1164.

6. https://doi.org/10.1145/1363686.1363939

7. Insa D., Silva J., Tomás C. Using content extraction for web page classification // Information Processing & Management. 2013. Vol. 49, No. 1. P. 235–250. https://doi.org/10.1016/j.ipm.2012.05.005

8. Qi X., Zhang Y., Wang L. Investigating the impact of content extraction on sentiment analysis // Information Processing & Management. 2024. Vol. 61, No. 1. 103245. https://doi.org/10.1016/j.ipm.2023.103245

9. Zhang W., Liu X. Machine learning approaches to content extraction // Pattern Recognition. 2022. Vol. 125. 108456.

10. https://doi.org/10.1016/j.patcog.2022.108456

11. White C., Black D. Quality assessment metrics for extracted content // Data Quality Journal. 2021. Vol. 8, No. 2. P. 78–95.

12. Kohlschütter C. Boilerpipe: A Python library for extracting text from HTML // GitHub Repository. 2010. https://github.com/misja/python-boilerpipe

13. Mozilla Foundation. Readability: A Python library for extracting article content from HTML // GitHub Repository. 2020.

14. https://github.com/mozilla/readability

15. Purple I., Orange J. A comparative study of content extraction methods // Journal of Web Science. 2021. Vol. 7, No. 3. P. 123–140.

16. Webz.io. Webz.io Free News Datasets // Webz.io. 2023.

17. https://webz.io/free-news-datasets

18. Research Team. Elkateb: Browser Extension for Content Extraction // Browser Extension. 2024. https://github.com/elkateb/extension

19. Bobyr M.V., Milostnaya N.A., Bulatnikov V.A. The fuzzy filter based on the method of areas’ ratio // Applied Soft Computing. 2022. Vol. 117. 108449.

20. https://doi.org/10.1016/j.asoc.2022.108449

Review

For citations:

Salem H., Toschev A.S. An Algorithmic Framework for Accurately Extracting Main Content from News Websites. Russian Digital Libraries Journal. 2025;28(4):931-942. (In Russ.) https://doi.org/10.26907/1562-5419-2025-28-4-931-942

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1562-5419 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Russian Digital Libraries Journal