References

ellibs

Электронные библиотеки

Russian Digital Libraries Journal

1562-5419

Казанский (Приволжский) федеральный университет

10.26907/1562-5419-2025-28-5-1085-1102

ellibs-610

Research Article

Статьи

Нейросимволический подход к дополненной генерации текста на основе автоматизированной индукции морфотактических правил

Neuro-Symbolic Approach to Augmented Text Generation via Automated Induction of Morphotactic Rules

Исангулов

Марат Вильданович

Isangulov

Marat Vildanovich

marathon.our@gmail.com

Елизаров

Александр Михайлович

Elizarov

Alexander Mikhailovich

amelizarov@gmail.com

Кунафин

Айгиз Ражапович

Kunafin

Aygiz Razhapovich

aigizk@gmail.com

Гатиатуллин

Айрат Рафизович

Gatiatullin

Airat Rafizovich

ayrat.gatiatullin@gmail.com

Прокопьев

Николай Аркадиевич

Prokopyev

Nikolai Arkadievich

nikolai.prokopyev@gmail.com

Казанский (Приволжский) федеральный университетKazan (Volga region) Federal University

Академия наук Республики ТатарстанAcademy of Sciences of the Republic of Tatarstan

Академия наук Республики ТатарстанTatarstan Academy of Sciences

2025

19122025

28510851102

2025

Исангулов М.В., Елизаров А.М., Кунафин А.Р., Гатиатуллин А.Р., Прокопьев Н.А.

Isangulov M.V., Elizarov A.M., Kunafin A.R., Gatiatullin A.R., Prokopyev N.A.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://ellibs.elpub.ru/jour/article/view/610

Представлен гибридный нейросимволический метод, который объединяет большую языковую модель (LLM) и конечный автомат (FST) для обеспечения морфологической корректности при генерации текста на агглютинативных языках. Система автоматически извлекает правила из корпусных данных: для локальных примеров словоформ LLM формирует цепочки морфологического разбора, которые затем агрегируются и упорядочиваются в компактные описания правил морфотактики (LEXC) и выбора алломорфов (regex). На этапе генерации LLM и FST работают совместно: если токен не распознается автоматом, LLM извлекает из контекста пару «лемма + теги», а FST реализует корректную поверхностную форму. В качестве набора данных использован корпус художественной литературы (~1600 предложений). Для списка из 50 существительных извлечено 250 словоформ. По предложенному алгоритму LLM сгенерировала 110 контекстных regex-правил вместе с LEXC-морфотактикой, на основе чего был скомпилирован FST, распознавший 170/250 форм (~70%). В прикладном тесте машинного перевода на подкорпусе из 300 предложений интеграция данного FST в цикл LLM повысила качество с BLEU 16.14 / ChrF 45.13 до BLEU 25.71 / ChrF 50.87 без дообучения переводчика. Подход применим к иным частям речи и другим агглютинативным и малоресурсным языкам, где он может быть использован для наполнения словарных и грамматических ресурсов.

The work presents a hybrid neuro-symbolic method that combines a large language model (LLM) and a finite-state transducer (FST) to ensure morphological correctness in text generation for agglutinative languages. The system automatically extracts rules from corpus data: for local examples of word forms, the LLM produces sequences of morphological analyses, which are then aggregated and organized into compact descriptions of morphotactic rules (LEXC) and allomorph selection (regex). During generation, the LLM and FST operate jointly: if a token is not recognized by the automaton, the LLM derives a “lemma+tags” pair from the context, and the FST produces the correct surface form. A literary corpus (~1600 sentences) was used as the dataset. For a list of 50 nouns, 250 word forms were extracted. Using the proposed algorithm, the LLM generated 110 context-sensitive regex rules along with LEXC morphotactics, from which an FST was compiled that recognized 170/250 forms (~70%). In an applied machine translation test on a subcorpus of 300 sentences, integrating this FST into the LLM cycle improved quality from BLEU 16.14 / ChrF 45.13 to BLEU 25.71 / ChrF 50.87 without retraining the translator. The approach scales to other parts of speech (verbs, adjectives, etc.) as well as to other agglutinative and low-resource languages, where it can accelerate the development of lexical and grammatical resources.

нейросимволический подходбольшая языковая модельконечные автоматыдвухуровневая морфологияLEXC морфотактикамашинный переводагглютинативные языкибашкирский язык

neuro-symbolic approachlarge language modelfinite-state transducerstwo-level morphologyLEXC morphotacticsmachine translationagglutinative languagesBashkir language

References1

Sproat R., Østling R. The morphological gap between translation quality and surface accuracy // Proceedings of the WMT 2020 Conference. Online, 2020. P. 1015–1024.

Kann K., Cotterell R., Schütze H. Neural models of inflectional morphology // Proceedings of the 15th Conference of the European Chapter of the ACL (EACL 2017). Valencia, 2017. P. 322–334.

Mielke S., Eisenstein J., Cotterell R. Dialect-to-dialect translation and cross-dialect morphological robustness of language models // Transactions of the ACL. 2021. Vol. 9. P. 288–302.

Koskenniemi K. Two-level morphology: a general computational model for word-form recognition and production. Helsinki: University of Helsinki, Department of General Linguistics, 1983. 38 p.

Beesley K.R., Karttunen L. Finite-State Morphology. Stanford (CA): CSLI Publications, 2003. 550 p.

Stahlberg F., Hasler E., Waite A. SGNMT: A flexible NMT decoding toolkit for quick prototyping of new models // Proceedings of ACL System Demonstrations. Vancouver, 2017. P. 67–72.

Hulden M. FST-based grammar correction for richly inflected languages // Proceedings of ACL Workshop on Finite-State Methods. Montréal, 2012. P. 32–39.

Tamchyna A., Bojar O. Target-side context for morphological reinflection // Proceedings of the First Conference on Machine Translation (WMT 2016). Berlin, 2016. P. 586–594.

Schwartz L., Liu S., Surrain S. Bootstrapping a neural morphological analyzer from an existing FST // Proceedings of the ACL Workshop on Morphological Resources 2022. Seattle, 2022. P. 12–20.

The authors declare that there are no conflicts of interest present.