<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">ellibs</journal-id><journal-title-group><journal-title xml:lang="ru">Электронные библиотеки</journal-title><trans-title-group xml:lang="en"><trans-title>Russian Digital Libraries Journal</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">1562-5419</issn><publisher><publisher-name>Казанский (Приволжский) федеральный университет</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.26907/1562-5419-2025-28-5-1036-1056</article-id><article-id custom-type="elpub" pub-id-type="custom">ellibs-607</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Нормализация текста, распознанного при помощи технологии оптического распознавания символов, с использованием легковесных LLM</article-title><trans-title-group xml:lang="en"><trans-title>Normalization of Text Recognized by Optical Character Recognition using Lightweight LLMS</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Вершинин</surname><given-names>Владислав Константинович</given-names></name><name name-style="western" xml:lang="en"><surname>Vershinin</surname><given-names>Vladislav Konstantinovich</given-names></name></name-alternatives><email xlink:type="simple">vershinin@itmo.ru</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Ходненко</surname><given-names>Иван Владимирович</given-names></name><name name-style="western" xml:lang="en"><surname>Khodnenko</surname><given-names>Ivan Vladimirovich</given-names></name></name-alternatives><email xlink:type="simple">Ivan.Khodnenko@itmo.ru</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Иванов</surname><given-names>Сергей Владимирович</given-names></name><name name-style="western" xml:lang="en"><surname>Ivanov</surname><given-names>Sergey Vladimirovich</given-names></name></name-alternatives><email xlink:type="simple">svivanov@itmo.ru</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Университет ИТМО</institution></aff><aff xml:lang="en"><institution>ITMO University</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2025</year></pub-date><pub-date pub-type="epub"><day>19</day><month>12</month><year>2025</year></pub-date><volume>28</volume><issue>5</issue><fpage>1036</fpage><lpage>1056</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Вершинин В.К., Ходненко И.В., Иванов С.В., 2025</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="ru">Вершинин В.К., Ходненко И.В., Иванов С.В.</copyright-holder><copyright-holder xml:lang="en">Vershinin V.K., Khodnenko I.V., Ivanov S.V.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://ellibs.elpub.ru/jour/article/view/607">https://ellibs.elpub.ru/jour/article/view/607</self-uri><abstract><p>Несмотря на значительный прогресс, технологии оптического распознавания символов (OCR) для исторических газет по-прежнему допускают 5–10% ошибок на уровне символов. В работе представлена полностью автоматизированная система нормализации пост-OCR, объединяющая легкие языковые модели (LLM) объемом 7–8 млрд параметров, обученные по инструкциям и квантизованные до 4 бит (INT4), с небольшим набором регулярных выражений. На наборе данных BLN600 (600 страниц британских газет XIX в.) лучшая модель YandexGPT-5-Instruct Q4 снижает Character Error Rate (CER) с 8.4% до 4.0% (–52.5%) и Word Error Rate (WER) с 20.2% до 6.5% (–67.8%), повышая при этом семантическое сходство до 0.962. Система работает на потребительском оборудовании (RTX-4060 Ti, 8 ГБ VRAM) со скоростью около 35 секунд на страницу и не требует дополнительного обучения или параллельных данных. Полученные результаты показывают, что компактные INT4-LLM являются практичной альтернативой крупным моделям для постобработки OCR исторических документов.
</p></abstract><trans-abstract xml:lang="en"><p>Despite recent progress, Optical Character Recognition (OCR) on historical newspapers still leaves 5–10% character errors. We present a fully automated post-OCR normalization pipeline that combines lightweight 7–8B instruction-tuned LLMs quantized to 4-bit (INT4) with a small set of regex rules. On the BLN600 benchmark (600 pages of 19th-century British newspapers), our best model YandexGPT-5-Instruct Q4 reduces Character Error Rate (CER) from 8.4% to 4.0% (–52.5%) and Word Error Rate (WER) from 20.2% to 6.5% (–67.8%), while raising semantic similarity to 0.962. The system runs on consumer hardware (RTX-4060 Ti, 8 GB VRAM) at about 35 seconds per page and requires no fine-tuning or parallel training data. These results indicate that compact INT4 LLMs are a practical alternative to large checkpoints for post-OCR cleanup of historical documents.
</p></trans-abstract><kwd-group xml:lang="ru"><kwd>оптическое распознавание символов</kwd><kwd>пост-OCR-коррекция</kwd><kwd>исторические газеты</kwd><kwd>большие языковые модели</kwd><kwd>квантизация</kwd><kwd>INT4</kwd><kwd>конвейер нормализации</kwd><kwd>ошибка на уровне символов</kwd><kwd>семантическое сходство</kwd><kwd>регулярные выражения</kwd><kwd>YandexGPT-5</kwd><kwd>легкие модели</kwd><kwd>обработка естественного языка</kwd><kwd>цифровые гуманитарные науки</kwd><kwd>оцифровка документов</kwd></kwd-group><kwd-group xml:lang="en"><kwd>optical character recognition</kwd><kwd>post-OCR correction</kwd><kwd>historical newspapers</kwd><kwd>large language models</kwd><kwd>quantization</kwd><kwd>INT4</kwd><kwd>normalization pipeline</kwd><kwd>character error rate</kwd><kwd>semantic similarity</kwd><kwd>regex rules</kwd><kwd>YandexGPT-5</kwd><kwd>lightweight models</kwd><kwd>natural language processing</kwd><kwd>digital humanities</kwd><kwd>document digitization</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Memon J., Sami M., Khan R.A. Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR) // IEEE Access. 2020. Vol. 8. P. 142642–142668. https://doi.org/10.1109/ACCESS.2020.3012542</mixed-citation><mixed-citation xml:lang="en">Memon J., Sami M., Khan R.A. Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR) // IEEE Access. 2020. Vol. 8. P. 142642–142668. https://doi.org/10.1109/ACCESS.2020.3012542</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Thomas A., Gaizauskas R., Lu H. Leveraging LLMs for Post-OCR Correction of Historical Newspapers // Proceedings of the 2nd Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA). 2024. P. 116–121. https://doi.org/10.18653/v1/2024.lt4hala-1.6</mixed-citation><mixed-citation xml:lang="en">Thomas A., Gaizauskas R., Lu H. Leveraging LLMs for Post-OCR Correction of Historical Newspapers // Proceedings of the 2nd Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA). 2024. P. 116–121. https://doi.org/10.18653/v1/2024.lt4hala-1.6</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Bourne J. Scrambled text: training Language Models to correct OCR errors using synthetic data // arXiv preprint. 2024. arXiv:2409.19735. https://doi.org/10.48550/arXiv.2409.19735</mixed-citation><mixed-citation xml:lang="en">Bourne J. Scrambled text: training Language Models to correct OCR errors using synthetic data // arXiv preprint. 2024. arXiv:2409.19735. https://doi.org/10.48550/arXiv.2409.19735</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Holley R. How Good Can It Get? Analysing and Improving OCR Accuracy in Large-Scale Historic Newspaper Digitisation Programs // D-Lib Magazine. 2009. Vol. 15, No. 3/4. https://doi.org/10.1045/march2009-holley</mixed-citation><mixed-citation xml:lang="en">Holley R. How Good Can It Get? Analysing and Improving OCR Accuracy in Large-Scale Historic Newspaper Digitisation Programs // D-Lib Magazine. 2009. Vol. 15, No. 3/4. https://doi.org/10.1045/march2009-holley</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">van Strien D., Beelen K., Coll Ardanuy M., Hosseini K., McGillivray B., Tolfo G.S. Assessing the Impact of OCR Quality on Downstream NLP Tasks // Proceedings of the 12th International Conference on Agents and Artificial Intelligence (ICAART 2020). 2020. P. 484–496. https://doi.org/10.5220/0009169004840496</mixed-citation><mixed-citation xml:lang="en">van Strien D., Beelen K., Coll Ardanuy M., Hosseini K., McGillivray B., Tolfo G.S. Assessing the Impact of OCR Quality on Downstream NLP Tasks // Proceedings of the 12th International Conference on Agents and Artificial Intelligence (ICAART 2020). 2020. P. 484–496. https://doi.org/10.5220/0009169004840496</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Drobac S., Friberg Heppin K., Wirén M., Lindén K. Optical Character Recognition with Neural Networks and Post-Correction with Finite State Methods // International Journal on Document Analysis and Recognition (IJDAR). 2020. Vol. 23. P. 279–295. https://doi.org/10.1007/s10032-020-00359-9</mixed-citation><mixed-citation xml:lang="en">Drobac S., Friberg Heppin K., Wirén M., Lindén K. Optical Character Recognition with Neural Networks and Post-Correction with Finite State Methods // International Journal on Document Analysis and Recognition (IJDAR). 2020. Vol. 23. P. 279–295. https://doi.org/10.1007/s10032-020-00359-9</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Neudecker C., Antonacopoulos A. Making Europe’s Historical Newspapers Searchable // DAS 2016 Workshop / Europeana Newspapers. 2016.(Workshop paper). URL: https://www.primaresearch.org/www/files/das2016/Europeana%20Newspapers.pdf.</mixed-citation><mixed-citation xml:lang="en">Neudecker C., Antonacopoulos A. Making Europe’s Historical Newspapers Searchable // DAS 2016 Workshop / Europeana Newspapers. 2016.(Workshop paper). URL: https://www.primaresearch.org/www/files/das2016/Europeana%20Newspapers.pdf.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Boillet M., Kermorvant C., Paquet T. Robust text line detection in historical documents: learning and evaluation methods // International Journal on Document Analysis and Recognition (IJDAR). 2022. Vol. 25. P. 95–114. https://doi.org/10.1007/s10032-022-00395-7</mixed-citation><mixed-citation xml:lang="en">Boillet M., Kermorvant C., Paquet T. Robust text line detection in historical documents: learning and evaluation methods // International Journal on Document Analysis and Recognition (IJDAR). 2022. Vol. 25. P. 95–114. https://doi.org/10.1007/s10032-022-00395-7</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Ermakova L., Tolfo G.S., Hosseini K. On the Impact of OCR Quality on Named Entity Extraction from Historical Newspapers // DH Benelux 2021 (Extended abstracts). 2021. URL: https://dhbenelux.org/wp-content/uploads/booklet2021.pdf#page=66</mixed-citation><mixed-citation xml:lang="en">Ermakova L., Tolfo G.S., Hosseini K. On the Impact of OCR Quality on Named Entity Extraction from Historical Newspapers // DH Benelux 2021 (Extended abstracts). 2021. URL: https://dhbenelux.org/wp-content/uploads/booklet2021.pdf#page=66</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Kettunen K. Optical Character Recognition Quality Affects Perceived Usefulness and Trust // arXiv preprint. 2022. arXiv:2209.08222.</mixed-citation><mixed-citation xml:lang="en">Kettunen K. Optical Character Recognition Quality Affects Perceived Usefulness and Trust // arXiv preprint. 2022. arXiv:2209.08222.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Sreelekha S., Sumam A.R., Nair R.R. Systematic Review on Text Normalization Techniques and Its Approach to Non-Standard Words // Preprint. 2023 (ResearchGate). URL: https://www.researchgate.net/publication/373877004.</mixed-citation><mixed-citation xml:lang="en">Sreelekha S., Sumam A.R., Nair R.R. Systematic Review on Text Normalization Techniques and Its Approach to Non-Standard Words // Preprint. 2023 (ResearchGate). URL: https://www.researchgate.net/publication/373877004.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Shi Y., Peng D., Liao W., Lin Z., Chen X., Liu C., Zhang Y., Jin L. Exploring OCR Capabilities of GPT-4V(ision): A Quantitative and In-Depth Evaluation // arXiv preprint. 2023. arXiv:2310.16809.</mixed-citation><mixed-citation xml:lang="en">Shi Y., Peng D., Liao W., Lin Z., Chen X., Liu C., Zhang Y., Jin L. Exploring OCR Capabilities of GPT-4V(ision): A Quantitative and In-Depth Evaluation // arXiv preprint. 2023. arXiv:2310.16809.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Guan S., Xu C., Lin M., Greene D. Effective Synthetic Data and Test-Time Adaptation for OCR Correction // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). 2024. P. 15412–15425 (ACL Anthology).</mixed-citation><mixed-citation xml:lang="en">Guan S., Xu C., Lin M., Greene D. Effective Synthetic Data and Test-Time Adaptation for OCR Correction // Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). 2024. P. 15412–15425 (ACL Anthology).</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Kanerva J., Ledins C., Käpyaho S., Ginter F. OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches // Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025). 2025. Tallinn, Estonia (ACL Anthology).</mixed-citation><mixed-citation xml:lang="en">Kanerva J., Ledins C., Käpyaho S., Ginter F. OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches // Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025). 2025. Tallinn, Estonia (ACL Anthology).</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Rijhwani S., Anastasopoulos A., Neubig G. OCR Post-Correction for Endangered Language Texts // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. P. 5931–5942. https://doi.org/10.18653/v1/2020.emnlp-main.478</mixed-citation><mixed-citation xml:lang="en">Rijhwani S., Anastasopoulos A., Neubig G. OCR Post-Correction for Endangered Language Texts // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. P. 5931–5942. https://doi.org/10.18653/v1/2020.emnlp-main.478</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Jin R., Du J., Huang W., Liu W., Luan J., Wang B., Xiong D. A Comprehensive Evaluation of Quantization Strategies for Large Language Models // Findings of ACL 2024 (also arXiv preprint). 2024. arXiv:2402.16775. https://doi.org/10.48550/arXiv.2402.16775</mixed-citation><mixed-citation xml:lang="en">Jin R., Du J., Huang W., Liu W., Luan J., Wang B., Xiong D. A Comprehensive Evaluation of Quantization Strategies for Large Language Models // Findings of ACL 2024 (also arXiv preprint). 2024. arXiv:2402.16775. https://doi.org/10.48550/arXiv.2402.16775</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">Mekala A., Atmakuru A., Song Y., Karpinska M., Iyyer M. Does Quantization Affect Models’ Performance on Long-Context Tasks? // arXiv preprint. 2025. arXiv:2505.20276. https://doi.org/10.48550/arXiv.2505.20276.</mixed-citation><mixed-citation xml:lang="en">Mekala A., Atmakuru A., Song Y., Karpinska M., Iyyer M. Does Quantization Affect Models’ Performance on Long-Context Tasks? // arXiv preprint. 2025. arXiv:2505.20276. https://doi.org/10.48550/arXiv.2505.20276.</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Booth C.W., Thomas A., Gaizauskas R. BLN600: A Parallel Corpus of Machine/Human Transcribed Nineteenth-Century Newspaper Texts // Proceedings of LREC-COLING 2024. 2024. P. 2440–2446. https://doi.org/10.15131/shef.data.25439023.</mixed-citation><mixed-citation xml:lang="en">Booth C.W., Thomas A., Gaizauskas R. BLN600: A Parallel Corpus of Machine/Human Transcribed Nineteenth-Century Newspaper Texts // Proceedings of LREC-COLING 2024. 2024. P. 2440–2446. https://doi.org/10.15131/shef.data.25439023.</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">Gupta H., Del Corro L., Broscheit S., Hoffart J., Brenner E. Unsupervised Multi-View Post-OCR Error Correction with Language Models // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2021. P. 8647–8652. https://doi.org/10.18653/v1/2021.emnlp-main.680</mixed-citation><mixed-citation xml:lang="en">Gupta H., Del Corro L., Broscheit S., Hoffart J., Brenner E. Unsupervised Multi-View Post-OCR Error Correction with Language Models // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2021. P. 8647–8652. https://doi.org/10.18653/v1/2021.emnlp-main.680</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
