<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">ellibs</journal-id><journal-title-group><journal-title xml:lang="ru">Электронные библиотеки</journal-title><trans-title-group xml:lang="en"><trans-title>Russian Digital Libraries Journal</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">1562-5419</issn><publisher><publisher-name>Казанский (Приволжский) федеральный университет</publisher-name></publisher></journal-meta><article-meta><article-id custom-type="elpub" pub-id-type="custom">ellibs-720</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Квантование Vision Transformer: CPU-центричный анализ компромисса между размером модели и скоростью инференса</article-title><trans-title-group xml:lang="en"><trans-title>Vit Quantization: CPU-Centric Analysis of the Trade-Off between Size and Speed</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Нигматуллин</surname><given-names>Амир Рамисович</given-names></name><name name-style="western" xml:lang="en"><surname>Nigmatullin</surname><given-names>Amir Ramisovich</given-names></name></name-alternatives><email xlink:type="simple">am.nigmatullin@innopolis.university</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Лукманов</surname><given-names>Рустам Арифович</given-names></name><name name-style="western" xml:lang="en"><surname>Lukmanov</surname><given-names>Rustam Arifovich</given-names></name></name-alternatives><email xlink:type="simple">r.lukmanov@innopolis.university</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Таха</surname><given-names>Ахмад</given-names></name><name name-style="western" xml:lang="en"><surname>Taha</surname><given-names>Ahmad</given-names></name></name-alternatives><email xlink:type="simple">a.taha@innopolis.university</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Университет Иннополис</institution></aff><aff xml:lang="en"><institution>Innopolis University</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2026</year></pub-date><pub-date pub-type="epub"><day>04</day><month>03</month><year>2026</year></pub-date><volume>29</volume><issue>1</issue><fpage>262</fpage><lpage>286</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Нигматуллин А.Р., Лукманов Р.А., Таха А., 2026</copyright-statement><copyright-year>2026</copyright-year><copyright-holder xml:lang="ru">Нигматуллин А.Р., Лукманов Р.А., Таха А.</copyright-holder><copyright-holder xml:lang="en">Nigmatullin A.R., Lukmanov R.A., Taha A.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://ellibs.elpub.ru/jour/article/view/720">https://ellibs.elpub.ru/jour/article/view/720</self-uri><abstract><p>Использование моделей Vision Transformer (ViT) в реальной медицинской практике, например в больницах или диагностических центрах, часто затруднено, потому что на рабочих компьютерах врачей обычно нет мощных графических процессоров (GPU), а имеющиеся вычислительные ресурсы ограничены. В настоящей работе рассмотрен полный путь практической реализации модели на этапе применения (pipeline инференса), направленный на снижение вычислительных затрат без существенной потери качества.


Предложенный подход объединяет несколько методов оптимизации. 
Во-первых, использована дистилляция знаний (knowledge distillation) – метод обучения, при котором компактная модель копирует поведение более крупной и точной модели-учителя. Во-вторых, применено экспоненциальное скользящее среднее (Exponential Moving Average, EMA) весов, позволяющее стабилизировать обучение и повысить обобщающую способность модели. 
В-третьих, исследована посттренировочная квантизация до целочисленного формата INT8 (post-training quantization, PTQ), направленная на уменьшение размера модели и ускорение инференса. Дополнительно рассмотрен упрощенный вариант квантизации совместно с обучением (QAT-lite), при котором эффекты квантизации частично учитываются во время дообучения модели.


Эксперименты проведены на датасете ISIC, содержащем дерматоскопические изображения кожных новообразований. Оценка качества моделей включает стандартные метрики классификации: точность (accuracy), макроусредненную F1-меру и площадь под ROC-кривой (ROC-AUC). Проанализированы характеристики производительности на центральном процессоре (CPU), включая задержку инференса, пропускную способность, потребление памяти и итоговый размер модели.


Полученные результаты показали, что посттренировочная INT8-квантизация позволяет сохранить качество, близкое к модели в формате FP32, при существенном снижении требований к памяти и вычислительным ресурсам. В то же время использование QAT-lite не демонстрирует устойчивых и воспроизводимых улучшений по сравнению с PTQ.
</p></abstract><trans-abstract xml:lang="en"><p>Using Vision Transformer (ViT) models in real medical practice – for example, in hospitals or diagnostic centers – is often difficult because doctors' work computers usually do not have powerful graphics processors (GPUs), and computing resources are limited. This work investigates a complete practical pipeline for model inference, aimed at reducing computational costs without significant loss of predictive performance.


The proposed approach combines several optimization techniques. First, knowledge distillation (KD) is used, where a compact student model learns to mimic the behavior of a larger, more accurate teacher model. Second, Exponential Moving Average (EMA) of the model weights is applied to stabilize training and improve generalization. Third, post-training INT8 quantization (PTQ) is explored to reduce model size and accelerate inference. Additionally, a simplified quantization-aware training variant (QAT-lite) is considered, where the effects of quantization are partially incorporated during fine-tuning.


Experiments are conducted on the ISIC dataset, which contains dermoscopic images of skin lesions. Model performance is evaluated using standard classification metrics, including accuracy, macro-averaged F1 score, and area under the ROC curve (ROC-AUC). CPU performance is also analyzed, including inference latency, throughput, memory consumption, and the final model size.


The results show that post-training INT8 quantization preserves performance close to the FP32 baseline while substantially reducing memory and computational requirements. In contrast, QAT-lite does not consistently provide reproducible improvements over PTQ.
</p></trans-abstract><kwd-group xml:lang="ru"><kwd>Визуальный трансформер (ViT)</kwd><kwd>дистилляция знаний</kwd><kwd>экспоненциальная скользящая средняя (EMA)</kwd><kwd>посттренировочная квантизация</kwd><kwd>обучение с учетом квантования</kwd></kwd-group><kwd-group xml:lang="en"><kwd>Vision Transformer</kwd><kwd>knowledge distillation</kwd><kwd>EMA</kwd><kwd>post-training quantization</kwd><kwd>quantization-aware training</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // arXiv. 2022.</mixed-citation><mixed-citation xml:lang="en">Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // arXiv. 2022.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">He K., Gan C., et al. Transformers in Medical Image Analysis: A Review // arXiv. 2022.</mixed-citation><mixed-citation xml:lang="en">He K., Gan C., et al. Transformers in Medical Image Analysis: A Review // arXiv. 2022.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Atabansi C.C., Nie J., et al. A Survey of Transformer Applications for Histopathological Image Analysis: New Developments and Future Directions // Biomedical Engineering Online. 2023. Vol. 22, No. 1. https://doi.org/10.1186/s12938-023-01069-5</mixed-citation><mixed-citation xml:lang="en">Atabansi C.C., Nie J., et al. A Survey of Transformer Applications for Histopathological Image Analysis: New Developments and Future Directions // Biomedical Engineering Online. 2023. Vol. 22, No. 1. https://doi.org/10.1186/s12938-023-01069-5</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Azad R., Kazerouni A., Heidari M., et al. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review // arXiv. 2023.</mixed-citation><mixed-citation xml:lang="en">Azad R., Kazerouni A., Heidari M., et al. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review // arXiv. 2023.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // Medical Image Analysis. 2024. Vol. 88. https://doi.org/10.1016/j.media.2023.102843</mixed-citation><mixed-citation xml:lang="en">Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // Medical Image Analysis. 2024. Vol. 88. https://doi.org/10.1016/j.media.2023.102843</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Liu Y., et al. A Recent Survey of Vision Transformers for Medical Image Segmentation // arXiv. 2023.</mixed-citation><mixed-citation xml:lang="en">Liu Y., et al. A Recent Survey of Vision Transformers for Medical Image Segmentation // arXiv. 2023.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Wu F., et al. Lite Transformer with Long-Short Range Attention // Proceedings of the International Conference on Learning Representations (ICLR). 2020.</mixed-citation><mixed-citation xml:lang="en">Wu F., et al. Lite Transformer with Long-Short Range Attention // Proceedings of the International Conference on Learning Representations (ICLR). 2020.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Jacob B., et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. P. 2704–2713. https://doi.org/10.1109/CVPR.2018.00286</mixed-citation><mixed-citation xml:lang="en">Jacob B., et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. P. 2704–2713. https://doi.org/10.1109/CVPR.2018.00286</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Nagel M., et al. A White Paper on Neural Network Quantization // arXiv. 2021.</mixed-citation><mixed-citation xml:lang="en">Nagel M., et al. A White Paper on Neural Network Quantization // arXiv. 2021.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Han S., Mao H., Dally W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding // arXiv. 2016.</mixed-citation><mixed-citation xml:lang="en">Han S., Mao H., Dally W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding // arXiv. 2016.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Yao Z., et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers // Advances in Neural Information Processing Systems (NeurIPS). 2022. Vol. 35.</mixed-citation><mixed-citation xml:lang="en">Yao Z., et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers // Advances in Neural Information Processing Systems (NeurIPS). 2022. Vol. 35.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Wikipedia contributors. Model Compression // Wikipedia. 2025.</mixed-citation><mixed-citation xml:lang="en">Wikipedia contributors. Model Compression // Wikipedia. 2025.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network // arXiv. 2015.</mixed-citation><mixed-citation xml:lang="en">Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network // arXiv. 2015.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Gou J., et al. Knowledge Distillation: A Survey // International Journal of Computer Vision. 2021. Vol. 129, No. 6. P. 1789–1819.https://doi.org/10.1007/s11263-021-01453-z</mixed-citation><mixed-citation xml:lang="en">Gou J., et al. Knowledge Distillation: A Survey // International Journal of Computer Vision. 2021. Vol. 129, No. 6. P. 1789–1819.https://doi.org/10.1007/s11263-021-01453-z</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Umirzakova S., et al. Simplified Knowledge Distillation for Deep Neural Networks: Bridging the Performance Gap with a Novel Teacher–Student Architecture // Electronics. 2024. Vol. 13, No. 3. https://doi.org/10.3390/electronics13030512</mixed-citation><mixed-citation xml:lang="en">Umirzakova S., et al. Simplified Knowledge Distillation for Deep Neural Networks: Bridging the Performance Gap with a Novel Teacher–Student Architecture // Electronics. 2024. Vol. 13, No. 3. https://doi.org/10.3390/electronics13030512</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Liang P., et al. Data-Free Knowledge Distillation with Feature Synthesis and Spatial Consistency for Image Analysis // Scientific Reports. 2024. Vol. 14, No. 1. https://doi.org/10.1038/s41598-024-53241-3</mixed-citation><mixed-citation xml:lang="en">Liang P., et al. Data-Free Knowledge Distillation with Feature Synthesis and Spatial Consistency for Image Analysis // Scientific Reports. 2024. Vol. 14, No. 1. https://doi.org/10.1038/s41598-024-53241-3</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
