References

ellibs

Электронные библиотеки

Russian Digital Libraries Journal

1562-5419

Казанский (Приволжский) федеральный университет

10.26907/1562-5419-2025-28-6-1385-1414

ellibs-625

Research Article

Статьи

Пост-коррекция слабой расшифровки большими языковыми моделями в итерационном процессе распознавания рукописей

Post-Correction of Weak Transcriptions by Large Language Models in the Iterative Process of Handwritten Text Recognition

Зыков

Валерий Павлович

Zykov

Valerii Pavlovich

zykovvp@my.msu.ru

Местецкий

Леонид Моисеевич

Mestetskiy

Leonid Moiseevich

mestlm@mail.ru

Московский государственный университет имени М. В. ЛомоносоваLomonosov Moscow State University

НИУ Высшая школа экономикиHigher School of Economics

2025

19122025

28613851414

2025

Зыков В.П., Местецкий Л.М.

Zykov V.P., Mestetskiy L.M.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://ellibs.elpub.ru/jour/article/view/625

Рассмотрена задача ускорения построения точной редакторской разметки рукописных архивных текстов в рамках инкрементного цикла обучения на основе слабой расшифровки. В отличие от ранее опубликованных результатов, основное внимание уделено интеграции автоматической посткоррекции слабой расшифровки с помощью больших языковых моделей (Large Language Models, LLM). Предложен и реализован протокол применения LLM на уровне строк в режиме обучения на нескольких примерах с тщательно сконструированными промптами и контролем формата вывода (сохранение дореформенной орфографии, защита имен и числительных, запрет на изменение структуры строк). Эксперименты проведены на корпусе дневников А. В. Сухово-Кобылина. В качестве базовой модели распознавания использована строчная версия модели Vertical Attention Network. Результаты показали, что LLM-коррекция на примере сервиса ChatGPT-4o заметно улучшает читабельность слабой разметки и существенно снижает процент ошибок в словах (в нашем опыте – порядка −12 процентных пунктов), при этом не внося ухудшения в проценте ошибок в буквах. Другой исследуемый сервис – DeepSeek-R1 – показал менее стабильное поведение. Рассмотрены практические настройки промптов, ограничения (контекстные лимиты, риск «галлюцинаций») и даны рекомендации по безопасной интеграции LLM-коррекции в итерационный пайплайн разметки с целью сокращения трудозатрат эксперта-асессора и ускорения оцифровки исторических архивов.

This paper addresses the problem of accelerating the construction of accurate editorial annotations for handwritten archival texts within an incremental training cycle based on weak transcription. Unlike our previously published results, the present work focuses on integrating automatic post-correction of weak transcriptions using large language models (LLMs). We propose and implement a protocol for applying LLMs at the line level in a few-shot setup with carefully designed prompts and strict output format control (preservation of pre-reform orthography, protection of proper names and numerals, prohibition of structural changes to lines). Experiments are conducted on the corpus of diaries by A.V. Sukhovo-Kobylin. As the base recognition model, we use the line-level variant of the Vertical Attention Network (VAN). Results show that LLM post-correction–exemplified by the ChatGPT-4o service–substantially improves the readability of weak transcriptions and significantly reduces the word error rate (in our experiments by about −12 percentage points), without degrading the character error rate. Another service tested, DeepSeek-R1, demonstrated less stable behavior. We discuss practical prompt engineering, limitations (context length limits, risk of “hallucinations”), and provide recommendations for the safe integration of LLM post-correction into an iterative annotation pipeline to reduce expert annotators’ workload and speed up the digitization of historical archives.

распознавание рукописного текстаслабая разметкаVertical Attention Network (VAN)большие языковые модели (LLM)посткоррекцияитерационное дообучение

handwritten text recognitionweak markupVertical Attention Network (VAN)large language models (LLM)post-correctioniterative retraining

References1

Penskaya E.N., Kuptsova O.N. (2024) The Invisible Quantity. A.V. Sukhovo-Kobylin: Theater, Literature, Life. Moscow: HSE Publishing House, 2024. 472 p. (In Russ.)

Mestetsky L.M., Smirnova V.S. Line segmentation in images of handwritten documents // Proceedings of the International Conference on Computer Graphics and Vision (Grafikon-2025). Yoshkar-Ola: Volga State Technological University, 2025. (In Russ.)

Mestetskiy L.M., Zykov V.P. Incremental markup of 19th-century handwritten ar-chival diaries // Software & Systems. 2025. Vol. 38, No. 4. https://doi.org/10.15827/0236-235X.152. (In Russ.)

Coquenet D., Chatelain C., Paquet T. End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023. Vol. 45, No. 1. P. 508–524. https://doi.org/10.1109/TPAMI.2022.3144899

Boltunova E.M., Laptev A.K. Handwriting recognition and data mining: Possibilities of neural network technologies (based on admiral Fyodor Lutke's diary) // Imagology and Comparative Studies. 2025. No. 23. P. 358–379. https://doi.org/10.17223/24099554/23/17. (In Russ.)

Brown T.B., Mann B., Ryder N., Subbiah M. et al. Language Models are Few-Shot Learners // Advances in Neural Information Processing Systems (NeurIPS). 2020. Vol. 33. P. 1877–1901.

Marti U.-V., Bunke H. The IAM-database: an English sentence database for offline handwriting recognition // International Journal on Document Analysis and Recognition (IJDAR). 2002. Vol. 5, No. 1. P. 39–46. https://doi.org/10.1007/s100320200071

Sánchez J., Romero V., Toselli A. H., Vidal E. ICFHR2016 competition on handwritten text recognition on the READ dataset // Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR 2016). 2016. P. 630–635.

Shi B., Bai X., Yao C. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017. Vol. 39, No. 11. P. 2298–2304. https://doi.org/10.1109/TPAMI.2016.2646371

Graves A., Fernández S., Gomez F., Schmidhuber J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks // Proceedings of the 23rd International Conference on Machine Learning (ICML 2006). 2006. P. 369–376. https://doi.org/10.1145/1143844.1143891

Coquenet D., Chatelain C., Paquet T. SPAN: A Simple Predict & Align Network for Handwritten Paragraph Recognition // Document Analysis and Recognition – ICDAR 2021. Lecture Notes in Computer Science, Vol. 12823. Springer, 2021. P. 70–84. https://doi.org/10.1007/978-3-030-86334-0_5

Yousef M., Bishop T.E. OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by Learning to Unfold // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020). 2020. P. 14710–14719. https://doi.org/10.1109/CVPR42600.2020.01472

Li M., Lv T., Chen J., Cui L., Lu Y., Florencio D., Zhang C., Li Z., Wei F. TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models // Proceedings of the AAAI Conference on Artificial Intelligence. 2023. Vol. 37, No. 12. P. 14216–14224.

Potanin M., Dimitrov D., Shonenkov A., Bataev V., Karachev D., Novopoltsev M., Chertok A. Digital Peter: New Dataset, Competition and Handwriting Recognition Methods // Proceedings of the 6th International Workshop on Historical Document Imaging and Processing. ACM, 2021. P. 43–48. https://doi.org/10.1145/3476887.3476892

Lakshminarayanan B., Pritzel A., Blundell C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles // Advances in Neural Information Processing Systems (NeurIPS). 2017. Vol. 30. P. 6402–6413.

The authors declare that there are no conflicts of interest present.