Study of the impact of different categorical feature encoding techniques on cluster structures

Kondruk, Natalia; Neroda, Inna; Кондрук, Наталія; Нерода, Інна

doi:https://doi.org/10.62660/bcstu/3.2025.93

Please use this identifier to cite or link to this item: https://er.chdtu.edu.ua/handle/ChSTU/9060

Title:	Study of the impact of different categorical feature encoding techniques on cluster structures
Other Titles:	Дослідження впливу різних технік кодування категоріальних ознак на структури кластерів
Authors:	Kondruk, Natalia Neroda, Inna Кондрук, Наталія Нерода, Інна
Keywords:	data analysis;machine learning;unsupervised learning;automatic object grouping;segmentation;аналіз даних;машинне навчання;навчання без учителя;автоматичне групування об’єктів;сегментація
Issue Date:	2025
Publisher:	Вісник Черкаського державного технологічного університету
Abstract:	Categorical features are a common type of data used in data analysis, but their non-metric nature makes it difficult to apply standard clustering algorithms. The relevance of the study is conditioned by the need to assess the impact of different methods of recoding (digitisation) of such features on the effectiveness of cluster analysis. The purpose of the study was to investigate how different techniques of categorical data processing affect the quality and structure of clusters. The methodology included the implementation of three models with different approaches to variable coding: without taking into account domain specifics, considering the content of the features, and with alternating the order of application of clustering and dimensionality reduction approaches. LabelEncoder, OrdinalEncoder, One-Hot Encoding, Mapping, and MultiLabelBinarizer were used for coding. In each of the models, clustering was performed using two algorithms – K-Means and agglomerative clustering, which allowed comparison of their sensitivity to changes in data representation. The t-SNE dimensionality reduction method was used to visualise the cluster structure in two-dimensional space. The quality of clustering was evaluated using the Silhouette Score, Dunn Index, Davies-Bouldin Index, and CalinskiHarabasz Index metrics. The data for the analysis were obtained from an open source and contained information about the psycho-emotional state of students. The study found that the basic recoding of categorical features without considering their semantics and context negatively affected the quality of clustering, reducing the accuracy of the division and complicating the interpretation of the results. Instead, the use of domain-oriented coding approaches ensured the development of clusters with clearer boundaries and a more logical internal structure. In addition, it was found that changing the sequence of clustering and dimensionality reduction affects the preservation of local relationships in the data. It was analysed that different approaches change both the number and quality of clusters, which was reflected in the values of the evaluation metrics. The practical significance of the results lies in the possibility of their application by data analysts and machine learning specialists to improve the accuracy of segmentation of complex categorical data. Категоріальні ознаки є поширеним типом даних, що використовуються у практиці аналізу даних, проте їх неметричний характер створює труднощі для застосування стандартних алгоритмів кластеризації. Актуальність дослідження зумовлена необхідністю оцінки впливу різних методів перекодування (оцифровування) таких ознак на результативність кластерного аналізу. Метою роботи було дослідити, як різні техніки обробки категоріальних даних впливають на якість та структуру кластерів. Методологія включала реалізацію трьох моделей з різними підходами до кодування змінних: без урахування доменної специфіки, з урахуванням змісту ознак та з чергуванням порядку застосування підходів кластеризації і зменшення розмірності. Для кодування використовувалися LabelEncoder, OrdinalEncoder, One-Hot Encoding, Mapping і MultiLabelBinarizer. У кожній із моделей кластеризація здійснювалася з використанням двох алгоритмів – K-Means та агломеративної кластеризації, що дозволяло порівняти їхню чутливість до змін у представленні даних. Метод зниження розмірності t-distributed Stochastic Neighbor Embedding (t-SNE) застосовувався для візуалізації кластерної структури у двовимірному просторі. Якість кластеризації оцінювалася за допомогою метрик Silhouette Score, Dunn Index, Davies-Bouldin Index та Calinski-Harabasz Index. Дані для аналізу було отримано з відкритого джерела й вони містили інформацію про психоемоційний стан студентів. У ході дослідження було встановлено, що базове перекодування категоріальних ознак без урахування їхньої семантики та контексту негативно впливало на якість кластеризації, знижуючи точність поділу та ускладнюючи інтерпретацію результатів. Натомість використання доменно-орієнтованих підходів до кодування забезпечувало формування кластерів із чіткішими межами та логічнішою внутрішньою структурою. Додатково було виявлено, що зміна послідовності застосування кластеризації та редукції розмірності позначається на збереженні локальних взаємозв’язків у даних. Проаналізовано, що різні підходи змінюють як кількість, так і якість кластерів, що відображається у значеннях оцінкових метрик. Практична цінність результатів полягає у можливості їх застосування фахівцями з аналізу даних та машинного навчання для підвищення точності сегментації складних категоріальних даних.
URI:	https://er.chdtu.edu.ua/handle/ChSTU/9060
ISSN:	2306-4412 (print) 2708-6070 (online)
DOI:	https://doi.org/10.62660/bcstu/3.2025.93
Volume:	30
Issue:	3
First Page:	93
End Page:	105
Appears in Collections:	том 30, №3/2025

Files in This Item:

File	Size	Format
зміст.pdf	117.47 kB	Adobe PDF	View/Open
титул.pdf	234.55 kB	Adobe PDF	View/Open
10.pdf	1.72 MB	Adobe PDF	View/Open

Show full item record

ChSTU repository

ChSTU repository preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets