Vit Quantization: CPU-Centric Analysis of the Trade-Off between Size and Speed

Amir Ramisovich Nigmatullin; Rustam Arifovich Lukmanov; Ahmad Taha

Vit Quantization: CPU-Centric Analysis of the Trade-Off between Size and Speed

Amir Ramisovich Nigmatullin, Rustam Arifovich Lukmanov, Ahmad Taha

Full Text:

PDF (Rus)

Generate QR code

Abstract

Using Vision Transformer (ViT) models in real medical practice – for example, in hospitals or diagnostic centers – is often difficult because doctors' work computers usually do not have powerful graphics processors (GPUs), and computing resources are limited. This work investigates a complete practical pipeline for model inference, aimed at reducing computational costs without significant loss of predictive performance.

The proposed approach combines several optimization techniques. First, knowledge distillation (KD) is used, where a compact student model learns to mimic the behavior of a larger, more accurate teacher model. Second, Exponential Moving Average (EMA) of the model weights is applied to stabilize training and improve generalization. Third, post-training INT8 quantization (PTQ) is explored to reduce model size and accelerate inference. Additionally, a simplified quantization-aware training variant (QAT-lite) is considered, where the effects of quantization are partially incorporated during fine-tuning.

Experiments are conducted on the ISIC dataset, which contains dermoscopic images of skin lesions. Model performance is evaluated using standard classification metrics, including accuracy, macro-averaged F1 score, and area under the ROC curve (ROC-AUC). CPU performance is also analyzed, including inference latency, throughput, memory consumption, and the final model size.

The results show that post-training INT8 quantization preserves performance close to the FP32 baseline while substantially reducing memory and computational requirements. In contrast, QAT-lite does not consistently provide reproducible improvements over PTQ.

Keywords

Vision Transformer, knowledge distillation, EMA, post-training quantization, quantization-aware training

About the Authors

Amir Ramisovich Nigmatullin

Innopolis University
Russian Federation

Rustam Arifovich Lukmanov

Innopolis University
Russian Federation

Ahmad Taha

Innopolis University
Russian Federation

References

1. Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // arXiv. 2022.

2. He K., Gan C., et al. Transformers in Medical Image Analysis: A Review // arXiv. 2022.

3. Atabansi C.C., Nie J., et al. A Survey of Transformer Applications for Histopathological Image Analysis: New Developments and Future Directions // Biomedical Engineering Online. 2023. Vol. 22, No. 1. https://doi.org/10.1186/s12938-023-01069-5

4. Azad R., Kazerouni A., Heidari M., et al. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review // arXiv. 2023.

5. Shamshad F., Khan S., Zamir S.W., et al. Transformers in Medical Imaging: A Survey // Medical Image Analysis. 2024. Vol. 88. https://doi.org/10.1016/j.media.2023.102843

6. Liu Y., et al. A Recent Survey of Vision Transformers for Medical Image Segmentation // arXiv. 2023.

7. Wu F., et al. Lite Transformer with Long-Short Range Attention // Proceedings of the International Conference on Learning Representations (ICLR). 2020.

8. Jacob B., et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. P. 2704–2713. https://doi.org/10.1109/CVPR.2018.00286

9. Nagel M., et al. A White Paper on Neural Network Quantization // arXiv. 2021.

10. Han S., Mao H., Dally W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding // arXiv. 2016.

11. Yao Z., et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers // Advances in Neural Information Processing Systems (NeurIPS). 2022. Vol. 35.

12. Wikipedia contributors. Model Compression // Wikipedia. 2025.

13. Hinton G., Vinyals O., Dean J. Distilling the Knowledge in a Neural Network // arXiv. 2015.

14. Gou J., et al. Knowledge Distillation: A Survey // International Journal of Computer Vision. 2021. Vol. 129, No. 6. P. 1789–1819.https://doi.org/10.1007/s11263-021-01453-z

15. Umirzakova S., et al. Simplified Knowledge Distillation for Deep Neural Networks: Bridging the Performance Gap with a Novel Teacher–Student Architecture // Electronics. 2024. Vol. 13, No. 3. https://doi.org/10.3390/electronics13030512

16. Liang P., et al. Data-Free Knowledge Distillation with Feature Synthesis and Spatial Consistency for Image Analysis // Scientific Reports. 2024. Vol. 14, No. 1. https://doi.org/10.1038/s41598-024-53241-3

Review

For citations:

Nigmatullin A.R., Lukmanov R.A., Taha A. Vit Quantization: CPU-Centric Analysis of the Trade-Off between Size and Speed. Russian Digital Libraries Journal. 2026;29(1):262-286. (In Russ.)

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1562-5419 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Russian Digital Libraries Journal

Vit Quantization: CPU-Centric Analysis of the Trade-Off between Size and Speed

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy