References

ellibs

Электронные библиотеки

Russian Digital Libraries Journal

1562-5419

Казанский (Приволжский) федеральный университет

ellibs-720

Research Article

Статьи

Квантование Vision Transformer: CPU-центричный анализ компромисса между размером модели и скоростью инференса

Vit Quantization: CPU-Centric Analysis of the Trade-Off between Size and Speed

Нигматуллин

Амир Рамисович

Nigmatullin

Amir Ramisovich

am.nigmatullin@innopolis.university

Лукманов

Рустам Арифович

Lukmanov

Rustam Arifovich

r.lukmanov@innopolis.university

Таха

Ахмад

Taha

Ahmad

a.taha@innopolis.university

Университет ИннополисInnopolis University

2026

04032026

291262286

2026

Нигматуллин А.Р., Лукманов Р.А., Таха А.

Nigmatullin A.R., Lukmanov R.A., Taha A.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://ellibs.elpub.ru/jour/article/view/720

Использование моделей Vision Transformer (ViT) в реальной медицинской практике, например в больницах или диагностических центрах, часто затруднено, потому что на рабочих компьютерах врачей обычно нет мощных графических процессоров (GPU), а имеющиеся вычислительные ресурсы ограничены. В настоящей работе рассмотрен полный путь практической реализации модели на этапе применения (pipeline инференса), направленный на снижение вычислительных затрат без существенной потери качества. Предложенный подход объединяет несколько методов оптимизации. Во-первых, использована дистилляция знаний (knowledge distillation) – метод обучения, при котором компактная модель копирует поведение более крупной и точной модели-учителя. Во-вторых, применено экспоненциальное скользящее среднее (Exponential Moving Average, EMA) весов, позволяющее стабилизировать обучение и повысить обобщающую способность модели. В-третьих, исследована посттренировочная квантизация до целочисленного формата INT8 (post-training quantization, PTQ), направленная на уменьшение размера модели и ускорение инференса. Дополнительно рассмотрен упрощенный вариант квантизации совместно с обучением (QAT-lite), при котором эффекты квантизации частично учитываются во время дообучения модели. Эксперименты проведены на датасете ISIC, содержащем дерматоскопические изображения кожных новообразований. Оценка качества моделей включает стандартные метрики классификации: точность (accuracy), макроусредненную F1-меру и площадь под ROC-кривой (ROC-AUC). Проанализированы характеристики производительности на центральном процессоре (CPU), включая задержку инференса, пропускную способность, потребление памяти и итоговый размер модели. Полученные результаты показали, что посттренировочная INT8-квантизация позволяет сохранить качество, близкое к модели в формате FP32, при существенном снижении требований к памяти и вычислительным ресурсам. В то же время использование QAT-lite не демонстрирует устойчивых и воспроизводимых улучшений по сравнению с PTQ.

Using Vision Transformer (ViT) models in real medical practice – for example, in hospitals or diagnostic centers – is often difficult because doctors' work computers usually do not have powerful graphics processors (GPUs), and computing resources are limited. This work investigates a complete practical pipeline for model inference, aimed at reducing computational costs without significant loss of predictive performance. The proposed approach combines several optimization techniques. First, knowledge distillation (KD) is used, where a compact student model learns to mimic the behavior of a larger, more accurate teacher model. Second, Exponential Moving Average (EMA) of the model weights is applied to stabilize training and improve generalization. Third, post-training INT8 quantization (PTQ) is explored to reduce model size and accelerate inference. Additionally, a simplified quantization-aware training variant (QAT-lite) is considered, where the effects of quantization are partially incorporated during fine-tuning. Experiments are conducted on the ISIC dataset, which contains dermoscopic images of skin lesions. Model performance is evaluated using standard classification metrics, including accuracy, macro-averaged F1 score, and area under the ROC curve (ROC-AUC). CPU performance is also analyzed, including inference latency, throughput, memory consumption, and the final model size. The results show that post-training INT8 quantization preserves performance close to the FP32 baseline while substantially reducing memory and computational requirements. In contrast, QAT-lite does not consistently provide reproducible improvements over PTQ.

Визуальный трансформер (ViT)дистилляция знанийэкспоненциальная скользящая средняя (EMA)посттренировочная квантизацияобучение с учетом квантования

Vision Transformerknowledge distillationEMApost-training quantizationquantization-aware training

References1

Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // arXiv. 2022.

He K., Gan C., et al. Transformers in Medical Image Analysis: A Review // arXiv. 2022.

Atabansi C.C., Nie J., et al. A Survey of Transformer Applications for Histopathological Image Analysis: New Developments and Future Directions // Biomedical Engineering Online. 2023. Vol. 22, No. 1. https://doi.org/10.1186/s12938-023-01069-5

Azad R., Kazerouni A., Heidari M., et al. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review // arXiv. 2023.

Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // Medical Image Analysis. 2024. Vol. 88. https://doi.org/10.1016/j.media.2023.102843

Liu Y., et al. A Recent Survey of Vision Transformers for Medical Image Segmentation // arXiv. 2023.

Wu F., et al. Lite Transformer with Long-Short Range Attention // Proceedings of the International Conference on Learning Representations (ICLR). 2020.

Jacob B., et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. P. 2704–2713. https://doi.org/10.1109/CVPR.2018.00286

Nagel M., et al. A White Paper on Neural Network Quantization // arXiv. 2021.

Han S., Mao H., Dally W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding // arXiv. 2016.

Yao Z., et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers // Advances in Neural Information Processing Systems (NeurIPS). 2022. Vol. 35.

Wikipedia contributors. Model Compression // Wikipedia. 2025.

Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network // arXiv. 2015.

Gou J., et al. Knowledge Distillation: A Survey // International Journal of Computer Vision. 2021. Vol. 129, No. 6. P. 1789–1819.https://doi.org/10.1007/s11263-021-01453-z

Umirzakova S., et al. Simplified Knowledge Distillation for Deep Neural Networks: Bridging the Performance Gap with a Novel Teacher–Student Architecture // Electronics. 2024. Vol. 13, No. 3. https://doi.org/10.3390/electronics13030512

Liang P., et al. Data-Free Knowledge Distillation with Feature Synthesis and Spatial Consistency for Image Analysis // Scientific Reports. 2024. Vol. 14, No. 1. https://doi.org/10.1038/s41598-024-53241-3

The authors declare that there are no conflicts of interest present.