Model of the data normalization quality evaluation on the basis of application of the objects classification quality criteria

Author(s) Collection number Pages Download abstract Download full text
Yasinska-Damri L. M., Durniak B. V. № 1 (81) 35-44 Image Image

The paper presents a comparative analysis of various types of normalization techniques. The accuracy of data classification which was carried out after data normalizing was used as the main criterion for evaluating the quality of the appropriate normalizing method. Four various types of datasets downloaded from the UCI Machine Learning Repository were used as the experimental data during the simulation process. Various normalization techniques available from package clusterSim of R software were applied to the experimental data. The quality of the data normalizing procedure was evaluated based on the use of data classification by the calculation of the accuracy of the distribution of the objects into classes. The neural network multilayer perceptron was used as the classifier at this step.

Four different types of datasets were used as the experimental data during the simulation procedure: Iris Plants, Seeds, Wine and Glass. The simulation results have shown that the data normalizing stage significantly influences the classification accuracy and selection of the normalization method depends on the type of data and, consequently, the selection of the normalizing technique should be carried out in each of the cases separately.

The analysis of the obtained results allows also concluding that the normalization methods that correspond to maximum value of the classification accuracy are different for various datasets. So, the normalization methods n1, n6a, n8, n9 and n11 are the optimal ones for the iris dataset. In this case, 100% classification accuracy is obtained for test dataset. The normalizing technique n11 is optimal one for the seeds data. The highest (almost maximal) classification accuracy was received in this case. The methods n3a and n5a are optimal ones for complex wine data. In the case of the glass dataset use, the n1 and n5 normalization methods are optimal ones.

Keywords: data normalization, classification quality criteria, multilayer percept­ron, data processing, classification accuracy.

doi: 10.32403/0554-4866-2021-1-81-35-44


  • 1. De Silva, A., De Livera, A., & Lee, K. et al. (2021). Multiple imputation methods for handling missing values in longitudinal studies with sampling weights: Comparison of methods implemented in stata: Biometrical Journal, 63 (2), 354–371 (in English).
  • 2. Johnson, T., Isaac, N., & Paviolo, A. et al. (2021). Handling missing values in trait data: Glo­bal Ecology and Biogeography, 30 (1), 51–62 (in English).
  • 3. Kim, K. H., & Kim, K. J. (2021). Missing-data handling methods for lifelogs-based wellness index estimation: Comparative analysis with panel data: JMIR Medical Informatics, 8 (12), art. no. e20597 (in English).
  • 4. Ngueilbaye, A., Wang, H., Mahamat, D., & Junaidu, S. (2021). Modulo 9 model-based lear­ning for missing data imputation: Applied Soft Computing, 103, art. no. 107167 (in English).
  • 5. Sharma, S., & Sood, M. (2020). Exploring feature selection technique in detecting sybil accounts in a social network: Advances in Intelligent Systems and Computing, 1166, 695–708 (in English).
  • 6. Peterson, R., & Cavanaugh, J. (2020). Ordered quantile normalization: a semiparametric transformation built for the cross-validation era: Journal of Applied Statistics, 47 (13–15), 2312–2327 (in English).
  • 7. Bushel, P., & Ferguson, S. et al. (2020). Comparison of normalization methods for analysis of tempo-seq targeted RNA sequencing data: Frontiers in Genetics, 11, art. no. 594 (in English).
  • 8. Clustersim package. Retrieved from http://keii.ue.wroc.pl/clusterSim/ (in English).
  • 9. Ihaka, R., & Gentleman, R. (1996). R: a linguage for data analysis and graphics: Journal of Computational and Graphical Statistics, 5 (3), 299–314 (in English).
  • 10. Caret package. Retrieved from https://topepo.github.io/caret/ (in English).
  • 11. UCI - machine learning repository. Retrieved from https://archive.ics.uci.edu/ml/datasets.php (in English).
  • 12. Fisher, R. (1936). The use of multiple measurements in taxonomic problems: Annals of Eugenics, 7 (2), 179–188 (in English).
  • 13. Seeds dataset. Retrieved from https://archive.ics.uci.edu/ml/datasets/seeds (in English).
  • 14. Wine recognition data. Retrieved from https://archive.ics.uci.edu/ml/datasets/wine (in Eng­lish).