<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3.dtd">
<article article-type="research-article" dtd-version="1.3" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="ru"><front><journal-meta><journal-id journal-id-type="publisher-id">ellibs</journal-id><journal-title-group><journal-title xml:lang="ru">Электронные библиотеки</journal-title><trans-title-group xml:lang="en"><trans-title>Russian Digital Libraries Journal</trans-title></trans-title-group></journal-title-group><issn pub-type="epub">1562-5419</issn><publisher><publisher-name>Казанский (Приволжский) федеральный университет</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.26907/1562-5419-2022-25-2-159-178</article-id><article-id custom-type="elpub" pub-id-type="custom">ellibs-326</article-id><article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject></subj-group><subj-group subj-group-type="section-heading" xml:lang="ru"><subject>Статьи</subject></subj-group></article-categories><title-group><article-title>Разработка модуля проверки данных для удовлетворения метрики устаревания</article-title><trans-title-group xml:lang="en"><trans-title>Development a Data Validation Module to Satisfy the Retention Policy Metric</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Сибгатуллина</surname><given-names>А. И.</given-names></name><name name-style="western" xml:lang="en"><surname>Sibgatullina</surname><given-names>A. I.</given-names></name></name-alternatives><email xlink:type="simple">aigul.sibgatulli@gmail.com</email><xref ref-type="aff" rid="aff-1"/></contrib><contrib contrib-type="author" corresp="yes"><name-alternatives><name name-style="eastern" xml:lang="ru"><surname>Якупов</surname><given-names>А. Ш.</given-names></name><name name-style="western" xml:lang="en"><surname>Yakupov</surname><given-names>A. S.</given-names></name></name-alternatives><email xlink:type="simple">azat.yakupov@it.kfu.ru</email><xref ref-type="aff" rid="aff-1"/></contrib></contrib-group><aff-alternatives id="aff-1"><aff xml:lang="ru"><institution>Казанский (Приволжский) Федеральный университет</institution></aff><aff xml:lang="en"><institution>Kazan (Volga region) Federal University</institution></aff></aff-alternatives><pub-date pub-type="collection"><year>2022</year></pub-date><pub-date pub-type="epub"><day>28</day><month>04</month><year>2022</year></pub-date><volume>25</volume><issue>2</issue><fpage>159</fpage><lpage>178</lpage><permissions><copyright-statement>Copyright &amp;#x00A9; Сибгатуллина А.И., Якупов А.Ш., 2022</copyright-statement><copyright-year>2022</copyright-year><copyright-holder xml:lang="ru">Сибгатуллина А.И., Якупов А.Ш.</copyright-holder><copyright-holder xml:lang="en">Sibgatullina A.I., Yakupov A.S.</copyright-holder><license xml:lang="ru" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>Данная работа распространяется под лицензией Creative Commons Attribution 4.0.</license-p></license><license xml:lang="en" license-type="creative-commons-attribution" xlink:href="https://creativecommons.org/licenses/by/4.0/" xlink:type="simple"><license-p>This work is licensed under a Creative Commons Attribution 4.0 License.</license-p></license></permissions><self-uri xlink:href="https://ellibs.elpub.ru/jour/article/view/326">https://ellibs.elpub.ru/jour/article/view/326</self-uri><abstract><p>Из года в год возрастает объем мирового рынка больших данных. Их анализ является неотъемлемой частью для принятия немедленных и надежных решений. Технологии больших данных ведут к значительному снижению стоимости за счет использования облачных сервисов, распределенных файловых систем, когда возникает потребность в хранении больших объемов информации. Их аналитика неразрывно связана с понятием качества данных, что особенно важно, если они имеют определенный срок хранения – метрику устаревания – и мигрируют из одного источника в другой, увеличивая риск потери данных. Предупреждение негативных последствий достигается за счет процесса сверки данных – комплексной проверки больших объемов информации с целью подтверждения их согласованности.
&#13;

В статье рассмотрены вероятностные структуры данных, которые могут быть использованы для решения задачи, а также предложена реализация – модуль проверки целостности данных с использованием фильтра Блума с подсчетом. Данный модуль интегрирован в Apache Airflow для автоматизации процесса.
</p></abstract><trans-abstract xml:lang="en"><p>Every year the size of the global big data market is growing. Analysing these data is essential for good decision-making. Big data technologies lead to a significant cost reduction with use of cloud services, distributed file systems, when there is a need to store large amounts of information. The quality of data analytics is dependent on the quality of the data themselves. This is especially important if the data has a retention policy and migrates from one source to another, increasing the risk of a data loss. Prevention of negative consequences from data migration is achieved through the process of data reconciliation – a comprehensive verification of large amounts of information in order to confirm their consistency.
&#13;

This article discusses probabilistic data structures that can be used to solve the problem, and suggests an implementation – data integrity verification module using a Counting Bloom filter. This module is integrated into Apache Airflow to automate its invocation.
</p></trans-abstract><kwd-group xml:lang="ru"><kwd>большие данные</kwd><kwd>метрика устаревания</kwd><kwd>партиция</kwd><kwd>фильтр Блума</kwd></kwd-group><kwd-group xml:lang="en"><kwd>parquet файл</kwd></kwd-group></article-meta></front><back><ref-list><title>References</title><ref id="cit1"><label>1</label><citation-alternatives><mixed-citation xml:lang="ru">Big Data Market worth $273.4 billion by 2026. URL: https://www.marketsandmarkets.com/Market-Reports/big-data-market-1068.html.</mixed-citation><mixed-citation xml:lang="en">Big Data Market worth $273.4 billion by 2026. URL: https://www.marketsandmarkets.com/Market-Reports/big-data-market-1068.html.</mixed-citation></citation-alternatives></ref><ref id="cit2"><label>2</label><citation-alternatives><mixed-citation xml:lang="ru">Data Retention Policy: What Is It and How to Build One. URL: https://www.techtarget.com/searchdatabackup/definition/data-retention-policy.</mixed-citation><mixed-citation xml:lang="en">Data Retention Policy: What Is It and How to Build One. URL: https://www.techtarget.com/searchdatabackup/definition/data-retention-policy.</mixed-citation></citation-alternatives></ref><ref id="cit3"><label>3</label><citation-alternatives><mixed-citation xml:lang="ru">Batra S., Garg S., Kaur R., Kumar N., Singh A., Zomaya A.Y. Probabilistic data structures for big data analytics: A comprehensive review // Knowledge-Based Systems. 2019. Vol. 188. No. 104987. P. 54–75.</mixed-citation><mixed-citation xml:lang="en">Batra S., Garg S., Kaur R., Kumar N., Singh A., Zomaya A.Y. Probabilistic data structures for big data analytics: A comprehensive review // Knowledge-Based Systems. 2019. Vol. 188. No. 104987. P. 54–75.</mixed-citation></citation-alternatives></ref><ref id="cit4"><label>4</label><citation-alternatives><mixed-citation xml:lang="ru">Choi K.W., Hossain E., Wiriaatmadja D.T. Discovering mobile applications in cellular device-to-device communications: Hash function and bloom filter-based approach // IEEE Transactions on Mobile Computing. 2016. Vol. 15. No. 2. P. 336–349.</mixed-citation><mixed-citation xml:lang="en">Choi K.W., Hossain E., Wiriaatmadja D.T. Discovering mobile applications in cellular device-to-device communications: Hash function and bloom filter-based approach // IEEE Transactions on Mobile Computing. 2016. Vol. 15. No. 2. P. 336–349.</mixed-citation></citation-alternatives></ref><ref id="cit5"><label>5</label><citation-alternatives><mixed-citation xml:lang="ru">Sasikala J., Thaiyalnayaki S. Indexing near-duplicate images in web search using minhash algorithm // International Conference on Processing of Materials, Minerals and Energy. 2018. Vol. 5. No. 1. P. 1943–1949.</mixed-citation><mixed-citation xml:lang="en">Sasikala J., Thaiyalnayaki S. Indexing near-duplicate images in web search using minhash algorithm // International Conference on Processing of Materials, Minerals and Energy. 2018. Vol. 5. No. 1. P. 1943–1949.</mixed-citation></citation-alternatives></ref><ref id="cit6"><label>6</label><citation-alternatives><mixed-citation xml:lang="ru">Drew J., Hahsler M., Moore T. Polymorphic Malware Detection Using Sequence Classification Methods // IEEE Security and Privacy Workshops (SPW). 2016. P. 81–87.</mixed-citation><mixed-citation xml:lang="en">Drew J., Hahsler M., Moore T. Polymorphic Malware Detection Using Sequence Classification Methods // IEEE Security and Privacy Workshops (SPW). 2016. P. 81–87.</mixed-citation></citation-alternatives></ref><ref id="cit7"><label>7</label><citation-alternatives><mixed-citation xml:lang="ru">Borgohain S.K., Nayak S., Patgiri R. rDBF: A r-Dimensional Bloom Filter for massive scale membership query // Journal of Network and Computer Applications. 2019. Vol. 136. P. 100–113.</mixed-citation><mixed-citation xml:lang="en">Borgohain S.K., Nayak S., Patgiri R. rDBF: A r-Dimensional Bloom Filter for massive scale membership query // Journal of Network and Computer Applications. 2019. Vol. 136. P. 100–113.</mixed-citation></citation-alternatives></ref><ref id="cit8"><label>8</label><citation-alternatives><mixed-citation xml:lang="ru">Batra S., Garg S., Kumar N., Singh A. Probabilistic data structure-based community detection and storage scheme in online social networks // Future Generation Computer Systems. 2019. Vol. 94. P. 173–184.</mixed-citation><mixed-citation xml:lang="en">Batra S., Garg S., Kumar N., Singh A. Probabilistic data structure-based community detection and storage scheme in online social networks // Future Generation Computer Systems. 2019. Vol. 94. P. 173–184.</mixed-citation></citation-alternatives></ref><ref id="cit9"><label>9</label><citation-alternatives><mixed-citation xml:lang="ru">Guo D., Luo L., Luo X., Ma R. T. B., Rottenstreich O. Optimizing Bloom Filter: Challenges, Solutions, and Comparisons // IEEE Communications Surveys &amp; Tutorials. 2019. Vol. 21. No. 2. P. 1912–1949.</mixed-citation><mixed-citation xml:lang="en">Guo D., Luo L., Luo X., Ma R. T. B., Rottenstreich O. Optimizing Bloom Filter: Challenges, Solutions, and Comparisons // IEEE Communications Surveys &amp; Tutorials. 2019. Vol. 21. No. 2. P. 1912–1949.</mixed-citation></citation-alternatives></ref><ref id="cit10"><label>10</label><citation-alternatives><mixed-citation xml:lang="ru">Boy O., Chazelle B., Kilian J., Rubinfeld R., Tal A. The Bloomier filter: An efficient data structure for static support lookup tables // SODA. 2004. P. 30–39.</mixed-citation><mixed-citation xml:lang="en">Boy O., Chazelle B., Kilian J., Rubinfeld R., Tal A. The Bloomier filter: An efficient data structure for static support lookup tables // SODA. 2004. P. 30–39.</mixed-citation></citation-alternatives></ref><ref id="cit11"><label>11</label><citation-alternatives><mixed-citation xml:lang="ru">Hazeyama H., Kadobayashi Y., Matsumoto Y. Adaptive Bloom filter: A space-efficient counting algorithm for unpredictable network traffic // IEICE Transactions on Information and Systems. 2008. Vol. 91. No. 5. P. 1292–1299.</mixed-citation><mixed-citation xml:lang="en">Hazeyama H., Kadobayashi Y., Matsumoto Y. Adaptive Bloom filter: A space-efficient counting algorithm for unpredictable network traffic // IEICE Transactions on Information and Systems. 2008. Vol. 91. No. 5. P. 1292–1299.</mixed-citation></citation-alternatives></ref><ref id="cit12"><label>12</label><citation-alternatives><mixed-citation xml:lang="ru">Song T., Wang X., Zhou Y. EABF: Energy efficient self-adaptive Bloom filter for network packet processing // IEEE International Conference on Communications (ICC). 2012. P. 2729–2734.</mixed-citation><mixed-citation xml:lang="en">Song T., Wang X., Zhou Y. EABF: Energy efficient self-adaptive Bloom filter for network packet processing // IEEE International Conference on Communications (ICC). 2012. P. 2729–2734.</mixed-citation></citation-alternatives></ref><ref id="cit13"><label>13</label><citation-alternatives><mixed-citation xml:lang="ru">Filippova D., Kingsford C., Pellow D. Improving Bloom filter performance on sequence data using k-mer Bloom filters // J. Comput. Biol. 2017. Vol. 26. No. 6. P. 547–557.</mixed-citation><mixed-citation xml:lang="en">Filippova D., Kingsford C., Pellow D. Improving Bloom filter performance on sequence data using k-mer Bloom filters // J. Comput. Biol. 2017. Vol. 26. No. 6. P. 547–557.</mixed-citation></citation-alternatives></ref><ref id="cit14"><label>14</label><citation-alternatives><mixed-citation xml:lang="ru">Calderoni L., Maio D., Palmieri P. Location privacy without mutual trust: The spatial Bloom filter // Computer Communications. 2015. Vol. 68. P. 4–12.</mixed-citation><mixed-citation xml:lang="en">Calderoni L., Maio D., Palmieri P. Location privacy without mutual trust: The spatial Bloom filter // Computer Communications. 2015. Vol. 68. P. 4–12.</mixed-citation></citation-alternatives></ref><ref id="cit15"><label>15</label><citation-alternatives><mixed-citation xml:lang="ru">Du D.H.C., Lu G., Nam Y.J. BloomStore: Bloom filter based memory-efficient key-value store for indexing of data de-duplication on flash // IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST). 2012. P. 1–11.</mixed-citation><mixed-citation xml:lang="en">Du D.H.C., Lu G., Nam Y.J. BloomStore: Bloom filter based memory-efficient key-value store for indexing of data de-duplication on flash // IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST). 2012. P. 1–11.</mixed-citation></citation-alternatives></ref><ref id="cit16"><label>16</label><citation-alternatives><mixed-citation xml:lang="ru">Deng F., Rafiei D. Approximately detecting duplicates for streaming data using stable Bloom filters // ACM SIGMOD international conference on Management of data. 2006. P. 25–36.</mixed-citation><mixed-citation xml:lang="en">Deng F., Rafiei D. Approximately detecting duplicates for streaming data using stable Bloom filters // ACM SIGMOD international conference on Management of data. 2006. P. 25–36.</mixed-citation></citation-alternatives></ref><ref id="cit17"><label>17</label><citation-alternatives><mixed-citation xml:lang="ru">Ahmadi M., Geravand S. A novel adjustable matrix Bloom filterbased copy detection system for digital libraries // IEEE 11th International Conference on Computer and Information Technology. 2011. P. 518–525.</mixed-citation><mixed-citation xml:lang="en">Ahmadi M., Geravand S. A novel adjustable matrix Bloom filterbased copy detection system for digital libraries // IEEE 11th International Conference on Computer and Information Technology. 2011. P. 518–525.</mixed-citation></citation-alternatives></ref><ref id="cit18"><label>18</label><citation-alternatives><mixed-citation xml:lang="ru">Guo J., Li F., Peng Y., Qian W., Zhou A. Persistent Bloom Filter: Membership Testing for the Entire History // International Conference on Management of Data. 2018. P. 1037–1052.</mixed-citation><mixed-citation xml:lang="en">Guo J., Li F., Peng Y., Qian W., Zhou A. Persistent Bloom Filter: Membership Testing for the Entire History // International Conference on Management of Data. 2018. P. 1037–1052.</mixed-citation></citation-alternatives></ref><ref id="cit19"><label>19</label><citation-alternatives><mixed-citation xml:lang="ru">Nayak S., Patgiri R. A Review on Role of Bloom Filter on DNA Assembly // IEEE Access. 2019. Vol. 7. P. 66939–66954.</mixed-citation><mixed-citation xml:lang="en">Nayak S., Patgiri R. A Review on Role of Bloom Filter on DNA Assembly // IEEE Access. 2019. Vol. 7. P. 66939–66954.</mixed-citation></citation-alternatives></ref><ref id="cit20"><label>20</label><citation-alternatives><mixed-citation xml:lang="ru">Reviriego P., Rottenstreich O. The Tandem Counting Bloom Filter – It Takes Two Counters to Tango // IEEE/ACM Transactions on Networking. 2019. Vol. 27. No. 6. P. 2252–2265.</mixed-citation><mixed-citation xml:lang="en">Reviriego P., Rottenstreich O. The Tandem Counting Bloom Filter – It Takes Two Counters to Tango // IEEE/ACM Transactions on Networking. 2019. Vol. 27. No. 6. P. 2252–2265.</mixed-citation></citation-alternatives></ref><ref id="cit21"><label>21</label><citation-alternatives><mixed-citation xml:lang="ru">Announcing Amazon Redshift data lake export: share data in Apache Parquet format. URL: https://aws.amazon.com/about-aws/whats-new/2019/12/announcing-amazon-redshift-data-lake-export/#:~:text=The%20Parquet%20format%20is%20up,lake%20in%20an%20open%20format.</mixed-citation><mixed-citation xml:lang="en">Announcing Amazon Redshift data lake export: share data in Apache Parquet format. URL: https://aws.amazon.com/about-aws/whats-new/2019/12/announcing-amazon-redshift-data-lake-export/#:~:text=The%20Parquet%20format%20is%20up,lake%20in%20an%20open%20format.</mixed-citation></citation-alternatives></ref><ref id="cit22"><label>22</label><citation-alternatives><mixed-citation xml:lang="ru">Parquet. URL: https://databricks.com/glossary/what-is-parquet.</mixed-citation><mixed-citation xml:lang="en">Parquet. URL: https://databricks.com/glossary/what-is-parquet.</mixed-citation></citation-alternatives></ref></ref-list><fn-group><fn fn-type="conflict"><p>The authors declare that there are no conflicts of interest present.</p></fn></fn-group></back></article>
