Information Technology for Quantitative Analysis of Ukrainian-Language Textual Content Based on DocBin Structures
DOI:
https://doi.org/10.32515/2664-262X.2026.13(44).11-21Keywords:
natural language processing, information system, data analysis, quantitative linguistics, information monitoring, corpus linguistics, lexical diversity, TTR, Honoré index, nominative index, DocBin, spaCy, automated text analysis, Ukrainian languageAbstract
The purpose of this article is to develop and implement information technology for quantitative text analysis for Ukrainian-language corpora within the framework of a modular information processing system. The study aims to develop algorithmic tools for automatic computation of lexical and morphological indices based on structured linguistic data. Particular attention is paid to ensuring scalability, offline functionality, and compatibility with secure computing environments. The proposed solution is oriented toward applications in computer science, data analytics, and cybersecurity, where reliable and reproducible text metrics are required. The research also seeks to address the lack of adapted quantitative tools for morphologically rich languages such as Ukrainian.
The implemented subsystem operates on preprocessed DocBin structures and performs automated extraction of lemmas, tokens, and part-of-speech tags using spaCy-based pipelines. Algorithms for calculating Type-Token Ratio (TTR), Honore’s index, nominative index, and frequency distributions of lemmas and POS tags were developed and integrated into a unified TextMetrics class. The architecture follows a modular design that separates preprocessing and statistical analysis stages, ensuring extensibility and maintainability. Experimental validation was conducted on a corpus of 7 Ukrainian texts with a total volume of approximately 18,000 tokens. Performance evaluation demonstrated stable execution time and linear scalability with respect to corpus size. Processing time per 1,000 tokens ranged from 0.11 to 6.72 seconds depending on the selected NLP agent. The subsystem produces structured statistical outputs in tabular formats suitable for further visualization, reporting, or integration into analytical platforms. The design supports deployment in offline environments, reducing risks related to data leakage and enhancing applicability in protected infrastructures.
The results confirm the correctness, robustness, and scalability of the developed subsystem for quantitative linguistic analysis. The approach enables efficient extraction of measurable textual characteristics and can be integrated into broader information systems for corpus management, anomaly detection, and information monitoring. The proposed solution contributes to the advancement of computational linguistics tools for Ukrainian and supports interdisciplinary applications in computer science and cybersecurity. Future development includes performance optimization for large-scale corpora and extension of the metric set with syntactic and complexity-based indicators.
References
Список літератури
1. Buk S. N., Rovenchak A. A. The Rank-Frequency Analysis for the Functional Style Corpora in the Ukrainian Language. Journal of Quantitative Linguistics. 2004. Vol. 11, No. 3. P. 161–171. DOI: https://doi.org/10.1080/0929617042000314912.
2. Козак І., Кунанець Н. Проблеми та виклики при створенні корпусу українських текстів. Науковий вісник НУЛП. 2023. № 4. С. 101–108. DOI: https://doi.org/10.36930/40340213.
3. Kozak I., Kunanets N. Information systems for working with text corpora: classification and comparative analysis. Вісник «Інформаційні системи та мережі» Національного університету «Львівська політехніка». 2024. Вип. 16. С. 273–289. DOI: https://doi.org/10.23939/sisn2024.16.273.
4. Stetsenko D., Okulska I. The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian Language. Communication Papers of the 18th Conference on Computer Science and Intelligence Systems. 2023. P. 303–311. DOI: https://doi.org/10.48550/arXiv.2305.13530.
5. Козак І., Кунанець Н. Information system for text corpora management through the lens of business requirements. Інформаційні технології: теорія і практика : тези доповідей ІІ (VIII) Міжнар. наук.-практ. конф. здобувачів вищої освіти і молодих учених ІТТП-2025 (Запоріжжя, 2025). Запоріжжя : НУ «Запорізька політехніка», 2025. С. 55–58.
6. Chiarcos C. CoNLL-Merge: Efficient harmonization of concurrent tokenization and textual variation. Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI). Association for Computational Linguistics, 2021. P. 41–52.
7. Explosion AI. spaCy documentation: DocBin serialization. 2023. URL: https://spacy.io/api/docbin (дата звернення: 18.02.2026).
8. Федчук Р., Висоцька В. Інформаційні технології вирішення задачі виправлення помилок в україномовних текстах. Вісник «Інформаційні системи та мережі» Національного університету «Львівська політехніка». 2024. Вип. 16. С. 11–34. URL: https://doi.org/10.23939/sisn2024.16.011.
9. Хоптяр А. О., Катуніна О. С., Калініченко Т. М. Використання мовних корпусів у дослідженні усного та письмового перекладу: Аналіз великомасштабних мовних даних. Вісник науки та освіти. Серія: Філологія. 2024. № 9(27). С. 500–514. DOI: https://doi.org/10.52058/2786-6165-2024-9(27)-500-514.
10. Anthony L. AntConc (Version 3.4.4) [Computer software]. Waseda University, 2013. URL: https://www.laurenceanthony.net/software/antconc (дата звернення: 18.02.2026).
11. Kilgarriff A., Baisa V., Bušta J. et al. The Sketch Engine. Lexicography. 2014. Vol. 1, № 1. P. 7–36. DOI: https://doi.org/10.1007/s40607-014-0009-9.
12. Straka M., Hajic J., Straková J. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association, 2016. P. 4290–4297.
References
1. Buk, S. N., & Rovenchak, A. A. (2004). The Rank-Frequency Analysis for the Functional Style Corpora in the Ukrainian Language. Journal of Quantitative Linguistics, 11(3), 161–171. https://doi.org/10.1080/0929617042000314912
2. Kozak, I., & Kunanets, N. (2023). Problems and challenges in creating a corpus of Ukrainian texts. Naukovyi visnyk NULP, (4), 101–108 [in Ukrainian]. https://doi.org/10.36930/40340213
3. Kozak, I., & Kunanets, N. (2024). Information systems for working with text corpora: classification and comparative analysis. Visnyk «Informatsiini systemy ta merezhi» Natsionalnoho universytetu «Lvivska politekhnika», (2), 273–289. https://doi.org/10.23939/sisn2024.16.273
4. Stetsenko, D., & Okulska, I. (2023). The Grammar and Syntax Based Corpus Analysis Tool For The Ukrainian Language. Communication Papers of the 18th Conference on Computer Science and Intelligence Systems, 303–311. https://doi.org/10.48550/arXiv.2305.13530
5. Kozak, I., & Kunanets, N. (2025). Information system for text corpora management through the lens of business requirements. Informatsiini tekhnolohii: teoriia i praktyka: materialy II (VIII) Mizhnarodnoi naukovo-praktychnoi konferentsii (pp. 55–58). Zaporizhzhia: NU «Zaporizka politekhnika» [in Ukrainian].
6. Chiarcos, C. (2021). CoNLL-Merge: Efficient harmonization of concurrent tokenization and textual variation. Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) (pp. 41–52). Association for Computational Linguistics.
7. Explosion AI. (2023). spaCy documentation: DocBin serialization. https://spacy.io/api/docbin
8. Fedchuk, R., & Vysotska, V. (2024). Information technologies for solving the problem of error correction in Ukrainian-language texts. Visnyk «Informatsiini systemy ta merezhi» Natsionalnoho universytetu «Lvivska politekhnika», 16, 11–34 [in Ukrainian]. https://doi.org/10.23939/sisn2024.16.011
9. Khoptyar, A. O., Katunina, O. S., & Kalinichenko, T. M. (2024). The use of language corpora in the study of oral and written translation: Analysis of large-scale language data. Visnyk nauky ta osvity. Seriia: Filolohiia, 9(27), 500–514 [in Ukrainian]. https://doi.org/10.52058/2786-6165-2024-9(27)-500-514
10. Anthony, L. (2013). AntConc (Version 3.4.4) [Computer software]. Waseda University. https://www.laurenceanthony.net/software/antconc
11. Kilgarriff, A., Baisa, V., Bušta, J., et al. (2014). The Sketch Engine. Lexicography, 1(1), 7–36. https://doi.org/10.1007/s40607-014-0009-9
12. Straka, M., Hajic, J., & Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4290–4297). European Language Resources Association.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Ivan Kozak, Victoria Vysotska, Lyubomyr Chyrun

This work is licensed under a Creative Commons Attribution 4.0 International License.