References

ellibs

Электронные библиотеки

Russian Digital Libraries Journal

1562-5419

Казанский (Приволжский) федеральный университет

10.26907/1562-5419-2025-28-6-1435-1453

ellibs-627

Research Article

Статьи

Поиск слов в рукописном тексте на основе штриховой сегментации

Word Search in Handwritten Text Based on Stroke Segmentation

Морозов

Иван Дмитриевич

Morozov

Ivan Dmitrievich

morozov-ivan-2003@yandex.ru

Местецкий

Леонид Моисеевич

Mestetskiy

Leonid Moiseevich

mestlm@mail.ru

Московский государственный университет имени М. В. ЛомоносоваLomonosov Moscow State University

НИУ Высшая школа экономикиHigher School of Economics

2025

19122025

28614351453

2025

Морозов И.Д., Местецкий Л.М.

Morozov I.D., Mestetskiy L.M.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://ellibs.elpub.ru/jour/article/view/627

Рукописные архивные документы составляют фундаментальную часть культурного наследия человечества, однако их анализ остается трудоемкой задачей для профессиональных исследователей-историков, филологов и лингвистов. В отличие от коммерческих приложений систем OCR (Optical Character Recognition, оптического распознавания символов), работа с историческими рукописями требует принципиально иного подхода из-за чрезвычайного многообразия почерков, наличия правок и деградации материалов. Предложен метод поиска в рукописных текстах, основанный на штриховой сегментации. Вместо полного распознавания текста, часто недостижимого для исторических документов, метод позволяет эффективно отвечать на поисковые запросы исследователей. Ключевая идея заключается в декомпозиции текста на элементарные штрихи, формировании семантических векторных представлений с помощью контрастного обучения, последующей кластеризации и классификации для создания адаптивного словаря почерка. Экспериментально показано, что поиск сравнением кортежей редуцированных последовательностей наиболее информативных штрихов по расстоянию Левенштейна обеспечивает достаточное качество для рассматриваемой задачи. Метод демонстрирует устойчивость к индивидуальным особенностям почерка и вариациям написания, что особенно важно для работы с авторскими архивами и историческими документами. Предложенный подход открывает новые возможности для ускорения научных исследований в гуманитарной сфере, позволяя сократить время поиска нужной информации с недель до минут, что качественно меняет возможности исследовательской работы с большими архивами рукописных документов.

Handwritten archival documents form a fundamental part of humanity's cultural heritage. However, their analysis remains a labor-intensive task for professional researchers, such as historians, philologists, and linguists. Unlike commercial OCR applications, working with historical manuscripts requires a fundamentally different approach due to the extreme diversity of handwriting, the presence of corrections, and material degradation. This paper proposes a method for searching within handwritten texts based on stroke segmentation. Instead of performing full text recognition, which is often unattainable for historical documents, this method allows for efficiently answering researcher search queries. The key idea involves decomposing the text into elementary strokes, forming semantic vector representations using contrastive learning, followed by clustering and classification to create an adaptive handwriting dictionary. It is experimentally shown that search by comparing tuples of reduced sequences of the most informative strokes using the Levenshtein distance provides sufficient quality for the task at hand. The method demonstrates resilience to individual handwriting characteristics and writing variations, which is particularly important for working with authors' archives and historical documents. The proposed approach opens up new possibilities for accelerating scientific research in the humanities, reducing the time required to find relevant information from weeks to minutes, thereby qualitatively transforming research capabilities when working with large archives of handwritten documents.

рукописный текстпоискштриховый анализсегментациявекторное представлениеконтрастное обучениекластеризация

handwritten textsearchstroke analysissegmentationvector representationcontrastive learningclustering

References1

Zhang X.-Y., Sun Z., Jin L., Ni H. & Lyons T. J. Learning Spatial–Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018. Vol. 40, no. 8. P. 1903–1917. https://doi.org/10.1109/tpami.2017.2732978

Rahal N., Vögtlin L., Ingold R. Historical Document Image Analysis Using Controlled Data for Pretraining // International Journal on Document Analysis and Recognition (IJDAR). 2023. Vol. 26, no. 3. P. 241–254. https://doi.org/10.1007/s10032-023-00437-8

Puigcerver J. Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? // 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2017. Vol. 1. P. 67–72. https://doi.org/10.1109/ICDAR.2017.20

Rath T. M., Manmatha R. Word Spotting for Historical Documents // International Journal on Document Analysis and Recognition (IJDAR). 2007. Vol. 9, no. 2–4. P. 139–152. https://doi.org/10.1007/s10032-006-0027-8

Mestetskii L.M. Continuous Morphology of Binary Images: Figures, Skeletons, Circulars. M.: FIZMATLIT, 2009. 231 p.

Mestetskiy L.M. Stroke Segmentation of Handwritten Text Based on Medial Representation // Pattern Recognition and Image Analysis: Advances in Mathematical Theory and Applications. 2024. Vol. 34, no. 4. P. 1185-1191. https://doi.org/10.1134/S1054661824701256

Dias C. da S., Britto Jr. A. de S., Barddal J. P., Heutte L., Koerich A. L. Pattern Spotting and Image Retrieval in Historical Documents using Deep Hashing. 2022. arXiv:2208.02397

He K., Zhang X., Ren S., Sun J. Deep Residual Learning for Image Recognition // IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. P. 770–778. https://doi.org/10.1109/CVPR.2016.90

Ester M., Kriegel H.-P., Sander J., Xu X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise // 2nd International Conference on Knowledge Discovery and Data Mining (KDD). 1996. P. 226–231.

The authors declare that there are no conflicts of interest present.