Vit Quantization: CPU-Centric Analysis of the Trade-Off between Size and Speed
Abstract
Using Vision Transformer (ViT) models in real medical practice – for example, in hospitals or diagnostic centers – is often difficult because doctors' work computers usually do not have powerful graphics processors (GPUs), and computing resources are limited. This work investigates a complete practical pipeline for model inference, aimed at reducing computational costs without significant loss of predictive performance.
The proposed approach combines several optimization techniques. First, knowledge distillation (KD) is used, where a compact student model learns to mimic the behavior of a larger, more accurate teacher model. Second, Exponential Moving Average (EMA) of the model weights is applied to stabilize training and improve generalization. Third, post-training INT8 quantization (PTQ) is explored to reduce model size and accelerate inference. Additionally, a simplified quantization-aware training variant (QAT-lite) is considered, where the effects of quantization are partially incorporated during fine-tuning.
Experiments are conducted on the ISIC dataset, which contains dermoscopic images of skin lesions. Model performance is evaluated using standard classification metrics, including accuracy, macro-averaged F1 score, and area under the ROC curve (ROC-AUC). CPU performance is also analyzed, including inference latency, throughput, memory consumption, and the final model size.
The results show that post-training INT8 quantization preserves performance close to the FP32 baseline while substantially reducing memory and computational requirements. In contrast, QAT-lite does not consistently provide reproducible improvements over PTQ.
About the Authors
Amir Ramisovich NigmatullinRussian Federation
Rustam Arifovich Lukmanov
Russian Federation
Ahmad Taha
Russian Federation
References
1. Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // arXiv. 2022.
2. He K., Gan C., et al. Transformers in Medical Image Analysis: A Review // arXiv. 2022.
3. Atabansi C.C., Nie J., et al. A Survey of Transformer Applications for Histopathological Image Analysis: New Developments and Future Directions // Biomedical Engineering Online. 2023. Vol. 22, No. 1. https://doi.org/10.1186/s12938-023-01069-5
4. Azad R., Kazerouni A., Heidari M., et al. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review // arXiv. 2023.
5. Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // Medical Image Analysis. 2024. Vol. 88. https://doi.org/10.1016/j.media.2023.102843
6. Liu Y., et al. A Recent Survey of Vision Transformers for Medical Image Segmentation // arXiv. 2023.
7. Wu F., et al. Lite Transformer with Long-Short Range Attention // Proceedings of the International Conference on Learning Representations (ICLR). 2020.
8. Jacob B., et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. P. 2704–2713. https://doi.org/10.1109/CVPR.2018.00286
9. Nagel M., et al. A White Paper on Neural Network Quantization // arXiv. 2021.
10. Han S., Mao H., Dally W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding // arXiv. 2016.
11. Yao Z., et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers // Advances in Neural Information Processing Systems (NeurIPS). 2022. Vol. 35.
12. Wikipedia contributors. Model Compression // Wikipedia. 2025.
13. Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network // arXiv. 2015.
14. Gou J., et al. Knowledge Distillation: A Survey // International Journal of Computer Vision. 2021. Vol. 129, No. 6. P. 1789–1819.https://doi.org/10.1007/s11263-021-01453-z
15. Umirzakova S., et al. Simplified Knowledge Distillation for Deep Neural Networks: Bridging the Performance Gap with a Novel Teacher–Student Architecture // Electronics. 2024. Vol. 13, No. 3. https://doi.org/10.3390/electronics13030512
16. Liang P., et al. Data-Free Knowledge Distillation with Feature Synthesis and Spatial Consistency for Image Analysis // Scientific Reports. 2024. Vol. 14, No. 1. https://doi.org/10.1038/s41598-024-53241-3
Review
For citations:
Nigmatullin A.R., Lukmanov R.A., Taha A. Vit Quantization: CPU-Centric Analysis of the Trade-Off between Size and Speed. Russian Digital Libraries Journal. 2026;29(1):262-286. (In Russ.)
JATS XML














