AI in Cancer Prevention: a Retrospective Study
https://doi.org/10.26907/1562-5419-2025-28-5-1253-1266
Abstract
This study investigates the feasibility of effectively solving population-scale cancer screening problems using artificial intelligence (AI) methods that predict malignant neoplasm risk based on minimal electronic health record (EHR) data – medical diagnosis and service codes. To address the formulated problem, we considered a broad spectrum of modern approaches, including classical machine learning methods, survival analysis, deep learning, and large language models (LLMs). Numerical experiments demonstrated that gradient boosting using survival analysis models as additional predictors possesses the best ability to rank patients by cancer risk level, enabling consideration of both population-level and individual risk factors for malignant neoplasms. Predictors constructed from EHR data include demographic characteristics, healthcare utilization patterns, and clinical markers. This solution was tested in retrospective experiments under the supervision of specialized oncologists. In the retrospective experiment involving more than 1.9 million patients, we established that the risk group captures up to 5.4 times more patients with cancer at the same level of medical examinations. The investigated method represents a scalable solution using exclusively diagnosis and service codes, requiring no specialized infrastructure and integrable into oncological vigilance processes, making it applicable for population-scale cancer screening.
About the Authors
Petr Aleksandrovich PhilonenkoRussian Federation
Vladimir Nikolaevich Kokh
Russian Federation
Pavel Dmitrievich Blinov
Russian Federation
References
1. Kaprin A. D., Starinskiy V. V., Shakhzadova A. O. Malignant neoplasms in Russia in 2023 (incidence and mortality) / Ed. by A. D. Kaprin, V. V. Starinskiy, A. O. Shakhzadova. Moscow: P. A. Herzen Moscow Oncology Research Institute — Branch of the National Medical Research Radiological Centre of the Ministry of Health of Russia, 2024. 276 p. ISBN 978-5-85502-298-8. (In Russian).
2. Cenin D. R., Tinmouth J., Naber S. K., Khalaf N., Rabeneck L., Tinmouth J. M., Earle C. C., Hilsden R. J., Leddin D., Rostom A., Issaka R. B., Heitman S. J., Lansdorp-Vogelaar I. Calculation of stop ages for colorectal cancer screening based on comorbidities and screening history. Clinical Gastroenterology and Hepatology, 2021, vol. 19, no. 3, pp. 547–555. https://doi.org/10.1016/j.cgh.2020.05.038
3. Ratushnyak S., Hoogendoorn M., van Baal P. H. M. Cost-effectiveness of cancer screening: health and costs in life years gained. American Journal of Preventive Medicine, 2019, vol. 57, no. 6, pp. 792–799. https://doi.org/10.1016/j.amepre.2019.07.027
4. Alexander M., Burbury K. A systematic review of biomarkers for the prediction of thromboembolism in lung cancer — Results, practical issues and proposed strategies for future risk prediction models. Thrombosis Research, 2016, vol. 148, pp. 63–69. https://doi.org/10.1016/j.thromres.2016.10.020
5. Jacobs M. F. Predicting cancer risk based on family history. eLife, 2021, vol. 10, e73380. https://doi.org/10.7554/eLife.73380
6. Wang X., Oldani M. J., Zhao X., Huang X., Qian D. A review of cancer risk prediction models with genetic variants. Cancer Informatics, 2014, vol. 13, suppl. 2, pp. 19–28. https://doi.org/10.4137/CIN.S13788
7. Zhu M. Recall, precision and average precision. Technical Report, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2004, 6 p.
8. Lee C., Zame W. R., Yoon J., van der Schaar M. DeepHit: A deep learning approach to survival analysis with competing risks. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32, no. 1, pp. 2314–2321. https://doi.org/10.1609/aaai.v32i1.11842
9. Nagpal C., Li X., Dubrawski A. Deep survival machines: Fully parametric survival regression and representation learning for censored data with competing risks. IEEE Journal of Biomedical and Health Informatics, 2021, vol. 25, no. 8, pp. 3163–3175. https://doi.org/10.1109/JBHI.2021.3052441
10. Babaev D., Ovsov N., Kireev I., Ivanova M., Gusev G., Nazarov I., Tuzhilin A. CoLES: Contrastive learning for event sequences with self-supervision. Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22), New York, NY, USA, ACM, 2022, pp. 1190–1199. https://doi.org/10.1145/3514221.3526129
11. Blinov P., Kokh V. Medical profile model: scientific and practical applications in healthcare. IEEE Journal of Biomedical and Health Informatics, 2023, vol. 28, no. 1, pp. 450–458. https://doi.org/10.1109/JBHI.2023.3295631
12. Yalunin A., Nesterov A., Umerenkov D. RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining. arXiv preprint, 2022, arXiv:2204.03951. https://doi.org/10.48550/arXiv.2204.03951
13. Philonenko P., Postovalov S. The new robust two-sample test for randomly right-censored data. Journal of Statistical Computation and Simulation, 2019, vol. 89, no. 8, pp. 1357–1375. https://doi.org/10.1080/00949655.2019.1577858
Review
For citations:
Philonenko P.A., Kokh V.N., Blinov P.D. AI in Cancer Prevention: a Retrospective Study. Russian Digital Libraries Journal. 2025;28(5):1253-1266. (In Russ.) https://doi.org/10.26907/1562-5419-2025-28-5-1253-1266
JATS XML















