Preview

Russian Digital Libraries Journal

Advanced search

A method for detecting artificial and non-scientific texts in the collection of documents

Abstract

In this paper, we propose a method of machine-generated and non-scientific text detection in a collection of scientific papers. The method is based on lexical and morphological analysis of the document examined with the help of language modeling. This technique enables estimation of probability that the text belongs to the class of scientific documents. Experimental evidence shows feasibility of the approach.

About the Authors

О. Бахтеев
Компания «Антиплагиат» (115093
Russian Federation


М. Кузнецова
Компания «Антиплагиат» (115093
Russian Federation


А. Романов
Компания «Антиплагиат» (115093
Russian Federation


Ю. Чехович
Компания «Антиплагиат» (115093
Russian Federation


References

1. Arase Y., Zhou M. Machine Translation Detection from Monolingual Web-Text // ACL (1). 2013. P. 1597–1607.

2. Labbé C., Labbé D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? //Scientometrics. 2013. V. 94, No 1. P. 379–396.

3. Van Noorden R. Publishers withdraw more than 120 gibberish papers //Nature. 2014. V. 24.

4. Гречников Е. А. и др. Поиск неестественных текстов // Тр. XI Всероссийской научной конференции «Электронные библиотеки: перспективные методы и технологии, электронные коллекции». Петрозаводск, 2009. С. 306–308.


Review

For citations:


 ,  ,  ,   A method for detecting artificial and non-scientific texts in the collection of documents. Russian Digital Libraries Journal. 2017;20(5):298-304. (In Russ.)

Views: 33


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1562-5419 (Online)