Гібридний метод ідентифікації автоматично згенерованих природномовних текстів

Никоненко, Андрій Олександрович

Please use this identifier to cite or link to this item: https://er.chdtu.edu.ua/handle/ChSTU/9499

Title:	Гібридний метод ідентифікації автоматично згенерованих природномовних текстів
Other Titles:	A Hybrid Method for the Identification of Automatically Generated Natural Language Texts
Authors:	Никоненко, Андрій Олександрович
Keywords:	Гібридний метод;Hybrid method;Бінарна класифікація;Binary classification;Нейронні мережі;Neural networks;Алгоритми машинного навчання;Machine learning algorithms;Текстовий документ;Text document;Ансамблі класифікаторів;Classifier ensembles;Рівень хибнопозитивних результатів;False positive rate;Інформаційна безпека;Information security;Методи прийняття рішень;Decision-making methods;Виявлення аномалій;Anomaly detection
Issue Date:	16-May-2026
Abstract:	У дисертаційному дослідженні розвʼязано актуальну науково-прикладну задачу, що полягає в розробці гібридного методу ідентифікації автоматично згенерованих природномовних текстів, стійкого до трансформаційних атак маскування та епістемічної невизначеності. Широке впровадження великих мовних моделей (LLM), крім позитивних наслідків, призвело до появи цілої низки викликів, серед яких можна виділити створення та розповсюдження неправдивої інформації, використання ШІ в соціальній інженерії, маніпулювання суспільством та підрив принципів академічної доброчесності. У сценаріях неетичного використання LLM їх найбільша перевага - здатність до генерації текстів, які лексично та структурно неможливо відрізнити від текстів, написаних людиною, - стає найбільшою загрозою, оскільки виявлення штучно згенерованих текстів є неможливим без використання спеціальних інструментів. Боротьба з новими викликами в інформаційній безпеці вимагає створення надійних інструментів ідентифікації штучно згенерованого контенту - детекторів ШІ. Проведене дослідження показало, що наявні на ринку системи маскування згенерованих даних проводять глибокі структурні та семантичні трансформації текстів, що веде до приховування сліду ШІ та суттєво знижує ефективність наявних детекторів щодо їх можливостей розпізнавання згенерованих текстів. З іншого боку, більшість сучасних рішень стикається з проблемою епістемічної невизначеності під час аналізу даних поза навчальним розподілом, наприклад, текстів, написаних неносіями мови, що веде до підвищеного рівня хибнопозитивних спрацювань і підриву довіри до засобів детекції загалом. З огляду на вищезазначене, розробка математичної моделі сліду ШІ та гібридного методу ідентифікації автоматично згенерованих природномовних текстів, стійкого до трансформаційних атак маскування та епістемічної невизначеності, є актуальною науково-прикладною задачею, що має важливе значення для розвитку систем контент-модерації, інформаційної безпеки та платформ академічної доброчесності. Для визначення завдань дослідження проведено систематизацію та аналіз існуючих підходів до ідентифікації автоматично згенерованих текстів. Встановлено, що попри наявність великої кількості підходів до виявлення ШІ, кожен з них містить як серйозні теоретичні обмеження щодо застосовності, так і вразливості до сценаріїв маскування, що робить процес визначення згенерованих даних досить ненадійним. Доведено, що детекція ШІ, заснована лише на одному рівні ознак (лексичному, синтаксичному чи семантичному), має недоліки як з боку вразливості до трансформаційних атак, так і з боку високого рівня хибнопозитивних спрацювань. Неідеальна точність наявних класифікаторів, теоретичні та практичні обмеження щодо їх застосування, велика кількість атак та типів маскування і відсутність стійких та швидких засобів інтерпретації результатів вказують на потребу в розробці як надійного гібридного методу ідентифікації автоматично згенерованих текстів, так і інтерпретаційної моделі для візуалізації доказів детекції. Вразливість сучасних методів детекції напряму повʼязана зі сприйняттям сліду ШІ як статичного об'єкта, що дає змогу легко маніпулювати точністю детекції шляхом внесення збурень через перефразування, заміну символів або вихід нових версій моделей чи архітектур. Для підвищення стійкості AI детекції проти трансформаційних атак пропонується замість масштабування детектора (збільшення розміру моделі та обсягу тренувальних даних) змінити концепцію сліду ШІ так, щоб отримати стійкий багатомірний обʼєкт, що включає ознаки трьох рівнів: семантичного, стилістичного і структурного. Зокрема, розроблено математичну модель сліду ШІ в багатовимірному просторі ознак F = Flex×Fsyn×Fsem (лексичних, синтаксичних та семантичних), яка формалізує його у вигляді вектора T_AI (X) =Φ(f_lex (X),f_syn (X),f_sem (X)). Також введено основні рівні аналізу: лексичний (f_lex (X)), що фіксує імовірнісне згладжування тексту генеративними моделями через перплексію та вибуховість (burstiness); синтаксичний (f_syn (X)), що вимірює структурну однотонність згенерованого тексту через дивергенцію Кульбака-Лейблера для N-грам POS-тегів відносно людського еталона, а також через дисперсію глибини синтаксичних дерев; семантичний (f_sem (X)), що виявляє аномально високу локальну узгодженість ембедінгів сусідніх речень та глобальний дрифт тематики. Крім того, удосконалено концептуальну модель оцінки робастності систем детекції сліду ШІ, яка шляхом введення формалізованого поняття Точки зламу та інтеграції математичної моделі багаторівневого простору ознак дає змогу перейти від наявних емпіричних підходів оцінки робастності на фіксованих датасетах до встановлення прямого математичного взаємозв'язку між інтенсивністю структурних трансформацій тексту (коефіцієнт заміни RR), збереженням його семантичної цілісності (I_sem) та ймовірністю виявлення (P_det). Модель виділяє три області, що демонструють ефективність атаки, та пропонує метод оцінки робастності детектора в умовах невизначеності. Вперше розроблено гібридний метод ідентифікації автоматично згенерованих природномовних текстів, який, на відміну від класичних детекторів, що сприймають слід ШІ як статичний і легко змінюваний об'єкт, шляхом використання каскадного ансамблю незалежних класифікаторів із додатковими механізмами валідації дає змогу проводити аналіз міжрівневої узгодженості ознак через адаптивне зважування та розрахунок сигналу розбіжності S_mismatch. На відміну від традиційного простого об'єднання ознак в один вектор, запропонований метод розподіляє аналіз між трьома незалежними модулями: лексичним - M_lex, синтаксичним - M_syn та семантичним - M_sem. Кожен із модулів генерує власну незалежну оцінку, яка передається до метакласифікатора для прийняття остаточного рішення. Оскільки під час трансформаційної атаки методам маскування вкрай важко змінити стилістику, синтаксис та семантику одночасно і пропорційно, будь-яка атака створює розрив між рівнями. Цей розрив аналізується через механізм адаптивного зважування, що дає змогу використовувати дворівневу логіку ваг: w_j (X)=w_j^stat 〖⋅α〗_j (X), тобто вивчена статистична вага множиться на динамічний коефіцієнт довіри. Залежно від типу виявленої атаки, алгоритм змінює довіру до різних модулів, завдяки чому спроба обходу детектора на одному рівні автоматично стає тригером для підвищення ймовірності детекції на інших рівнях. Такий підхід фіксує когнітивний дисонанс між лексичним, синтаксичним та семантичним рівнями тексту під час застосування трансформаційних атак, роблячи спроби маскування на одному рівні тригером для посилення детекції на іншому. Разом з цим набув подальшого розвитку метод забезпечення робастності систем обробки природної мови в умовах епістемічної невизначеності та даних поза навчальним розподілом. Класичні методи детекції генерують точкові ймовірності з надмірною впевненістю і не містять інформації про те, чи відповідає текст, що аналізується, текстам із тренувальної вибірки. Це робить їх критично вразливими до даних поза навчальним розподілом, наприклад, текстів неносіїв мови, академічних текстів або нестандартних стилів письма, що призводить до зростання кількості хибнопозитивних спрацювань. Для розв'язання цієї проблеми розроблено надбудову, яка дає змогу перетворювати евристичні точкові прогнози на математично обґрунтовані множини прогнозів Γ^ϵ (x). Обчислюючи міри неконформності на калібрувальній вибірці, алгоритм ICP гарантує, що істинний клас тексту належатиме до згенерованої множини із заданим рівнем статистичної достовірності 1-ϵ. Це перетворює точкові прогнози метакласифікатора на множини прогнозів з гарантованою статистичною достовірністю, що суттєво знижує рівень хибнопозитивних спрацювань, уможливлює глобальну ідентифікацію аномалій та генерацію порожніх множин замість хибних класифікацій. Крім того, уперше розроблено інтерпретаційну модель обґрунтування доказів детекції пояснюваного ШІ, яка, на відміну від класичних ітеративних методів LIME/SHAP, не потребує значних обчислювальних ресурсів і демонструє стабільність на текстових даних. Її створення розв'язує проблему чорної скриньки, що існує у випадку класичних детекторів, результатам яких часто бракує прозорості для ухвалення рішень експертами. Шляхом використання внутрішніх ваг метакласифікатора w_j^stat та адаптивних коефіцієнтів довіри α_j (X) ця модель дає змогу генерувати три рівні атрибуції: глобальний, локальний та Карту доказів. Глобальний профіль атрибуції дає змогу визначити, які ознаки відіграють найбільшу роль під час аналізу текстів гібридним методом загалом. Локальний профіль атрибуції показує вплив кожного з параметрів метакласифікатора на фінальне рішення щодо конкретного тексту, а інтерактивна Карта доказів візуалізує аномалії лексичної перплексії та структурних загроз. У роботі проведено експериментальну валідацію розроблених математичних моделей та гібридного методу ідентифікації згенерованих текстів на спеціально сформованих тестових даних. Крім текстів, написаних людьми та згенерованих Instant моделями, тестові дані також включають тексти MoE та Reasoning-моделей, тексти після трансформаційних атак парафразерів T5, DIPPER та комерційних систем маскування. Експериментально доведено, що для SOTA-детектора Fast-DetectGPT Точка зламу настає вже за мінімальних змін (коефіцієнт заміни RR = 0.03), тоді як результати гібридного методу підтверджують суттєве підвищення стійкості детекції з уникненням досягнення Точки зламу навіть під час зміщення праворуч до RR > 0.9. Експериментально досліджено та адаптовано методи моделювання та виділення сліда ШІ для сучасних архітектур MoE та Reasoning. На відміну від підходів, що фокусуються на мікроструктурних артефактах генерації через оцінку щільності ядра KDE чи спектральний аналіз ритму FFT, запропоновані методи дають змогу надійно ідентифікувати когнітивний слід моделі. На основі розробленої математичної моделі та гібридного методу створено програмний комплекс ідентифікації автоматично згенерованих текстів. Систему розгорнуто з використанням хмарної інфраструктури AWS, що забезпечує її високу відмовостійкість, масштабованість та можливість паралельної обробки запитів під час аналізу великих масивів даних у режимі реального часу. Загальна асимптотична складність системи, що складається з гібридного методу та підсистеми інтерпретації, дорівнює O(N^2), а середній час обробки одного документа у хмарному середовищі AWS становить 0,33 секунди. На відміну від наявних рішень, що демонструють критичне падіння повноти детекції в Області 2 (зона успішних атак) під час трансформаційних атак, гібридний метод, завдяки архітектурі каскадного ансамблю та аналізу міжрівневої узгодженості, дає змогу повністю нівелювати зону успішного маскування згенерованого контенту. Запропонована архітектура метакласифікатора, використання алгоритмів канонізації та механізму адаптивного зважування ознак дають змогу досягти показника точності F1 = 0.92 на стаціонарних даних та зберегти повноту виявлення (Recall) на рівні 88.77% за жорсткого обмеження рівня хибнопозитивних спрацювань (FPR) до 1%. В умовах протидії сучасним комерційним системам маскування (StealthGPT, AIHumanize, Phrasly) система перевершує світові аналоги, забезпечуючи середній Recall на рівні 77.16% та найнижчий серед аналогів рівень успішності атак (ASR = 22.84%). Впровадження методики застосування фреймворку ICP дало змогу суттєво знизити рівень хибнопозитивних спрацювань (FPR) під час аналізу текстів неносіїв мови (на прикладі датасету IELTS) з 4.81% до 0.98%. Інтеграція фреймворку ICP дозволила системі розпізнати епістемічну невизначеність цих текстів та згенерувати порожні множини в складних випадках. Підсумковий коефіцієнт порятунку Rescue Rate становив 79.71%, що дає змогу говорити про суттєву мінімізацію ризиків несправедливих звинувачень авторів-людей. Практичне значення одержаних результатів полягає у створенні готового до промислового використання інструментарію для систем контент-модерації, інформаційної безпеки та платформ перевірки академічної доброчесності. This dissertation addresses a pressing scientific and practical problem: the development of a hybrid method for identifying automatically generated natural-language texts that is resistant to transformational masking attacks and epistemic uncertainty. The widespread adoption of large language models (LLMs), in addition to its positive effects, has given rise to a number of challenges, including the creation and dissemination of misinformation, the use of AI in social engineering, the manipulation of the public, and the undermining of academic integrity. In scenarios of unethical LLM use, their greatest advantage - the ability to generate texts that are lexically and structurally indistinguishable from those written by humans - becomes the greatest threat, as detecting artificially generated texts is impossible without the use of specialized tools. Addressing new challenges in information security requires the development of reliable tools for identifying artificially generated content - AI detectors. The study indicated that commercially available systems for masking generated data perform deep structural and semantic transformations of texts, which hide the AI trace and significantly reduce the effectiveness of existing detectors in their ability to recognize generated texts. On the other hand, most modern solutions face the problem of epistemic uncertainty when analyzing data outside the training distribution, such as texts written by non-native speakers, which leads to an increased rate of false positives and undermines trust in detection methods in general. In consideration of the above, the development of a mathematical model of an AI trace and a hybrid method for identifying automatically generated natural-language texts that is resistant to transformational masking attacks and epistemic uncertainty is a pressing scientific and applied problem of significant importance for the evolution of content moderation systems, information security, and academic integrity platforms. To define the research objectives, we systematized and analyzed existing approaches to the identification of automatically generated texts. It was found that despite the existence of a wide range of approaches to AI detection, each of them has both serious theoretical limitations regarding applicability and vulnerabilities to masking scenarios, which makes the process of identifying generated data fairly unreliable. It has been proven that AI detection based solely on a single feature level (lexical, syntactic, or semantic) has shortcomings both in terms of vulnerability to transformational attacks and in terms of a high rate of false positives. The imperfect accuracy of existing classifiers, theoretical and practical limitations regarding their application, the large number of attacks and types of obfuscation, and the lack of robust and fast tools for interpreting results point to the need to develop both a reliable hybrid method for identifying automatically generated texts and an interpretation model for visualizing detection evidence. The vulnerability of modern detection methods is directly linked to the interpretation of AI trace as a static object, which can be easily manipulated to affect detection accuracy by introducing perturbations through paraphrasing, character substitution, or the release of new versions of models or architectures. To improve the robustness of AI detection against transformational attacks, we propose, instead of scaling the detector (increasing the model size and the volume of training data), to modify the concept of the AI trace to obtain a robust multidimensional object that includes features at three levels: semantic, stylistic, and structural. Specifically, a mathematical model of the AI trace in a multidimensional feature space F = Flex×Fsyn×Fsem (lexical, syntactic, and semantic) has been developed, which formalizes it as a vector T_AI (X) =Φ(f_lex (X),f_syn (X),f_sem (X)). The main levels of analysis were also introduced: lexical (f_lex (X)), which captures the probabilistic smoothing of text by generative models through perplexity and burstiness; syntactic (f_syn (X)), which measures the structural monotony of the generated text via the Kullback-Leibler divergence for N-grams of POS tags relative to a human benchmark, as well as via the variance in the depth of syntactic trees; semantic (f_sem (X)), which detects abnormally high local consistency of embeddings of neighboring sentences and global thematic drift. Additionally, we have refined the conceptual model for assessing the robustness of AI trace detection systems. By introducing the formalized concept of a Breakdown point and integrating a mathematical model of a multi-level feature space, this model enables a transition from existing empirical approaches to robustness assessment on fixed datasets to the establishment of a direct mathematical relationship between the intensity of structural text transformations (replacement rate RR), the preservation of its semantic integrity (I_sem), and the probability of detection (P_det). The model identifies three areas that demonstrate the effectiveness of an attack and proposes a method for assessing the robustness of the detector under conditions of uncertainty. For the first time, a hybrid method has been developed for identifying automatically generated natural-language texts, which, unlike classical detectors that treat AI traces as static and easily modifiable objects, by using a cascaded ensemble of independent classifiers with additional validation mechanisms, allows for the analysis of cross-level consistency of features through adaptive weighting and the calculation of the S_mismatch discrepancy signal. Unlike the traditional simple combination of features into a single vector, the proposed method distributes the analysis among three independent modules: lexical (M_lex), syntactic (M_syn), and semantic (M_sem). Each module generates its own independent assessment, which is passed to the meta-classifier for the final decision. Since it is extremely difficult for obfuscation methods to simultaneously and proportionally alter style, syntax, and semantics during a transformational attack, any attack creates a gap between the levels. This gap is analyzed through an adaptive weighting mechanism, which allows the use of two-level weight logic: w_j (X)=w_j^stat 〖⋅α〗_j (X), where the learned statistical weight is multiplied by a dynamic reliability coefficient. Depending on the type of detected attack, the algorithm adjusts the degree of trust to different modules, so that an attempt to bypass the detector at one level automatically triggers an increase in the probability of detection at other levels. This approach captures the cognitive dissonance between the lexical, syntactic, and semantic levels of the text when transformational attacks are used, making attempts at obfuscation at one level a trigger for enhanced detection at another. Along with this, a method for ensuring the robustness of natural language processing systems under conditions of epistemic uncertainty and out-of-distribution data has been further developed. Classical detection methods generate point probabilities with excessive confidence and do not provide information on whether the analyzed text matches texts from the training set. This makes them critically vulnerable to out-of-distribution data, such as texts written by non-native speakers, academic texts, or non-standard writing styles, leading to an increase in the number of false positives. To address this issue, an extension has been developed that allows converting heuristic point predictions into mathematically grounded sets of predictions Γ^ϵ (x). By computing non-conformity measures on a calibration sample, the ICP algorithm guarantees that the true class of the text will belong to the generated set with a specified statistical confidence level of 1-ϵ. This transforms the point predictions of the meta-classifier into sets of predictions with guaranteed statistical confidence, which significantly reduces the false positive rate, enables global anomaly detection, and generates empty sets instead of misclassifications. Furthermore, this study presents for the first time an interpretation model for justifying the detection evidence of explainable AI, which, unlike the classical iterative LIME/SHAP methods, does not require significant computational resources and demonstrates stability on text data. Its creation solves the black-box problem inherent in classical detectors, whose results often lack the transparency required for expert decision-making. By utilizing the internal weights of the meta-classifier (w_j^stat) and adaptive confidence coefficients α_j (X), this model enables the generation of three levels of attribution: global, local, and an Evidence Map. The global attribution profile allows us to determine which features play the greatest role in the overall hybrid text analysis. The local attribution profile shows the influence of each of the meta-classifier’s parameters on the final decision regarding a specific text, while the interactive Evidence Map visualizes anomalies in lexical perplexity and structural threats. An experimental validation of the developed mathematical models and the hybrid method for identifying generated texts was conducted on specially constructed test data. In addition to texts written by humans and generated by Instant models, the test data also includes texts from MoE and Reasoning models, as well as texts following transformational attacks by the T5 and DIPPER paraphrasers and commercial masking systems. It has been experimentally proven that for the SOTA detector Fast-DetectGPT, the Breakdown point occurs even with minimal changes (replacement rate RR = 0.03), whereas the results of the hybrid method confirm a significant increase in detection robustness, without reaching the Breakdown point even when shifted to the right to RR > 0.9. AI trace modeling and extraction methods for modern MoE and Reasoning architectures have been experimentally investigated and adapted. Unlike approaches that focus on microstructural generation artifacts through KDE kernel density estimation or FFT spectral analysis of rhythm, the proposed methods provide a reliable way to identify the model’s cognitive trace. Based on the developed mathematical model and hybrid method, a software package for identifying automatically generated texts has been created. The system has been deployed using the AWS cloud infrastructure, which ensures its high fault tolerance, scalability, and the ability to process requests in parallel when analyzing large datasets in real time. The overall asymptotic complexity of the system, consisting of the hybrid method and the interpretation subsystem, is O(N^2), and the average processing time for a single document in the AWS cloud environment is 0.33 seconds. Unlike existing solutions, which demonstrate a critical drop in detection completeness in Region 2 (the zone of successful attacks) during transformational attacks, the hybrid method, due to its cascading ensemble architecture and inter-level consistency analysis, allows for the complete elimination of the zone of successful masking of generated content. The proposed meta-classifier architecture, the use of canonicalization algorithms, and the adaptive feature weighting mechanism allow us to achieve an F1 score of 0.92 on stationary data and maintain a Recall of 88.77% while strictly limiting the false positive rate (FPR) to 1%. When countering modern commercial obfuscation systems (StealthGPT, AIHumanize, Phrasly), the system outperforms global counterparts, achieving an average recall of 77.16% and the lowest attack success rate (ASR = 22.84%) among comparable systems. The implementation of the ICP framework methodology made it possible to significantly reduce the false positive rate (FPR) during the analysis of texts written by non-native speakers (using the IELTS dataset as an example) from 4.81% to 0.98%. The integration of the ICP framework enabled the system to recognize the epistemic uncertainty of these texts and generate empty sets in complex cases. The resulting Rescue Rate was 79.71%, indicating a significant reduction in the risk of unfair accusations against human authors. The practical significance of these results lies in the creation of a ready-to-use toolkit for content moderation systems, information security, and academic integrity verification platforms.
URI:	https://er.chdtu.edu.ua/handle/ChSTU/9499
Number of Pages:	248
Specialization:	122 Комп’ютерні науки
Appears in Collections:	122 Комп'ютерні науки

Files in This Item:

File	Size	Format
Дисертація_Никоненко.pdf.p7s.zip	9.67 MB	Unknown	View/Open
Дисертація_Никоненко.pdf	11.62 MB	Adobe PDF	View/Open
Витяг_Никоненко.pdf	1.4 MB	Adobe PDF	View/Open
Висновок_про_наукову_новизну_Никоненко.pdf	526.32 kB	Adobe PDF	View/Open

Show full item record

ChSTU repository

ChSTU repository preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets