Disertasi
Modification of the SMOTE method using noise reduction and clustering to address imbalanced health data / Hairani
Abstrak
Data imbalance occurs when one class is underrepresented (the minority class) while another class is overrepresented in the data (the majority class). Imbalanced data can lead to decreased performance of classification methods and overfitting. In other words a classification method can produce high accuracy on the majority of the data and low accuracy on the minority of the data. Imbalanced data is typically addressed using the Synthetic Minority Oversampling Technique (SMOTE). Recently the SMOTE-LOF method has been developed which only considers outliers to remove minority classes (noisy data). However SMOTE-LOF has several weaknesses it only considers minority data as noise within the outlier area whereas the more important type of noise to address is minority data adjacent to the majority class. Furthermore this method uses only LOF filtering to detect noise without involving clustering mechanisms. It also does not address the overlapping problem in the synthetic minority class data generated. Therefore this study proposes a combined approach of filtering clustering and distance modification to reduce noise and overlapping produced by SMOTE. Filtering removes minority class data (noise) located in majority class regions with the k-NN method applied for filtering. The use of Noise Reduction (NR) which removes data that is considered noise before applying SMOTE has a positive impact in overcoming data imbalance. Clustering establishes decision boundaries by partitioning data into clusters allowing SMOTE with modified distance metrics to generate minority class data within each cluster. This SMOTE clustering and distance modification approach aims to minimize overlap in synthetic minority data which can introduce noise. The proposed method is called NR-Clustering SMOTE which has several stages in balancing data (1) filtering by removing minority classes close to majority classes (data noise) using the k-NN method (2) Clustering data using K-means aims to establish decision boundaries by partitioning data into several clusters (3) applying SMOTE oversampling with Manhattan distance within each cluster. Test results indicate that the proposed NR-Clustering SMOTE method achieves the best performance across all evaluation metrics for classification methods such as Random Forest SVM and Naive Bayes compared to the original data and traditional SMOTE. The implication of the results of this study is that the proposed methods namely NR-Modified SMOTE and NR-Clustering SMOTE are proven to be able to improve classification performance compared to the traditional SMOTE method and its latest variants such as SMOTE-LOF Radius-SMOTE and RN-SMOTE in solving imbalanced health data with two classes. In addition this finding also provides a better alternative solution in handling imbalanced data compared to other SMOTE variants. Practically the findings of this study have the potential to be applied in medical decision support systems to support early detection of diabetes. Improved accuracy in minority classes is expected to enhance the reliability of predictive models and strengthen the basis for healthcare decision-making. Thus the results of this study not only provide theoretical contributions to the development of methods for handling imbalanced data but also provide added value in the application of intelligent technology to support more accurate diagnostic processes and clinical decision-making.