Previous |  Up |  Next

Article

Keywords:
big data; cluster analysis; $k$-means algorithm; batch clustering; mini batch $k$-means
Summary:
Information technologies such as social media, mobile computing, and the realization of the industrial Internet of Things (IoT) produce huge amounts of data every day. The development of powerful tools for knowledge-discovery is imperative to deal with such a volume of data. Clustering methods are among the most important knowledge-discovery techniques. The growth in computational power and algorithmic developments allow us to efficiently and accurately solve clustering problems in large datasets. However, these developments are insufficient to deal with clustering problems in big datasets. This is because these datasets cannot be processed as a whole due to hardware and computational restrictions. In this paper an iterative batch $k$-means ($ibk$-means) algorithm is proposed that yields good clustering results with low computation costs on big datasets. It is designed to cluster datasets using batch data. The efficiency and accuracy of the proposed algorithm are investigated depending on the size of batches, the number of attributes and clusters. The algorithm is compared with the classic $k$-means and mini batch $k$-means algorithms using computational results on several real-world datasets, all of which are available from the UCI Machine Learning Repository. The smallest dataset has 500000 data points and 2 attributes and the largest one contains 43930257 data points and 16 attributes. Results demonstrated that the $ibk$-means algorithm outperforms both the $k$-means and mini batch $k$-means algorithms in the sense of both efficiency and accuracy and it is applicable for the clustering of big datasets. The proposed algorithm provides real time clustering and may have direct applications in expert and intelligent systems. Furthermore, results from this paper will have a clear impact in the sense of designing more accurate and efficient clustering algorithms for big datasets taking into account available computer resources.
References:
[1] Aggarwal, C. C., Reddy, C. K.: Data Clustering: Algorithms and Applications (First edition). CRC Press, Taylor and Francis Group, Boca Raton, London, New York 2014. DOI 
[2] Ahmed, E., Yaqoob, I., T.Hashem, I. A., al., et: The role of big data analytics in Internet of Things. Computer Networks 129 (2017), 2, 459-471. DOI 
[3] Alguliyev, R., Aliguliyev, R., Sukhostat, L.: Anomaly detection in big data based on clustering. Statistics, Optim. Inform. Computing 5 (2017), 4, 325-340. DOI 
[4] Alguliyev, R., Aliguliyev, R., Bagirov, A., Karimov, R.: Batch clustering algorithm for big data sets. In: Proc. 2016 IEEE 10th International Conference on Application of Information and Communication Technologies, IEEE Press 2016, pp. 79-82. DOI 
[5] Alguliyev, R., Aliguliyev, R., Imamverdiyev, Y., Sukhostat, L.: An anomaly detection based on optimization. Int. J. Intell. Systems Appl. 9 (2017), 12, 87-96. DOI 
[6] Alguliyev, R., Aliguliyev, R., Imamverdiyev, Y., Sukhostat, L.: Weighted clustering for anomaly detection in big data. Statist. Optim. Inform. Comput. 6 (2018), 2, 178-188. DOI 
[7] Alguliyev, R. M., Aliguliyev, R. M., Alakbarov, R. G.: Constrained $k$-means algorithm for resource allocation in mobile cloudlets. Kybernetika 59 (2023), 88-109. DOI 
[8] Alguliyev, R., Aliguliyev, R., Sukhostat, L.: Improved parallel big data clustering based on $k$-medoids and $k$-means algorithm. Probl. Inform. Technol. 15 (2024), 18-25. DOI 
[9] Alguliyev, R., Aliguliyev, R., Sukhostat, L.: Parallel batch $k$-means for big data clustering. Comput. Industr. Engrg. 152 (2021), 107023, 1-11. DOI 
[10] Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of Euclidean sum-of-squares clustering. Machine Learn. 75 (2009), 2, 245-248. DOI 
[11] Amrahov, S. E., Ar, Y., Tugrul, B., Akay, B. E., Kartli, N.: A new approach to Mergesort algorithm: Divide smart and conquer. Future Gener. Comput. Syst. 157 (2024), 330-343. DOI 
[12] Bagirov, A. M., Taheri, S., Ordin, B.: An adaptive $k$-medians clustering algorithm. Probl. Inform. Technol. 13 (2022), 3-15. DOI 
[13] Bagirov, A. M., Ugon, J., Webb, D.: Fast modified global $k$-means algorithm for incremental cluster construction. Pattern Recogn. 44 (2011), 866-876. DOI 
[14] Bahmani, B., Moseley, B., Vattani, A., al., et: Scalable $k$-means++. Proc. VLDB Endowment 5 (2012), 7, 622-633. DOI 
[15] Béjar, J.: $k$-means vs mini batch $k$-means: A comparison. Technical Report, Universitat Politécnica de Catalunya, 2013.
[16] Bose, A., Munir, A., Shabani, N.: A comparative quantitative analysis of contemporary big data clustering algorithms for market segmentation in hospitality industry. 2017.
[17] Bottou, L., Bengio, Y.: Convergence properties of the $k$-means algorithm. In: Proc. 7th International Conference on Neural Information Processing Systems, MIT Press, Cambridge 1995, pp. 585-592.
[18] Bradley, P. S., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proc. Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, New York 1998, pp. 9-15. DOI 
[19] Cai, X., Nie, F., Huang, H.: Multi-view $k$-means clustering on big data. In: Proc. Twenty-Third International Joint Conference on Artificial Intelligence, ACM Press, New York 2013, pp. 2598-2604. DOI 
[20] Catherine, A., Alejandro, C., Ricardo, F., Badih, G.: Multivariate and functional robust fusion methods for structured big data. J. Multivar. Anal. 170 (2019), 149-161. DOI 
[21] Cetin, P., Tanriöver, Ö. Ö.: Priority rule for resource constrained project planning problem with predetermined work package durations. J. Fac. Engrg. Architect. Gazi University 35 (2020), 3, 149-161. DOI 
[22] Chen, C. L. P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inform. Sci. 275 (2014), 314-347. DOI 
[23] Cui, X., Zhu, P., Yang, X., al., et: Optimized big data k-means clustering using MapReduce. J. Supercomput. 70 (2014), 3, 1249-1259. DOI 
[24] Dua, D., Taniskidou, E. Karra: UCI Machine Learning Repository. Irvine, Univ. California, School of Information and Computer Science, 2017.
[25] Fahad, A., Alshatri, N., Tari, Z., al., et: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerging Topics Computing 2 (2014), 3, 267-279. DOI 
[26] Ghadiri, N., Ghaffari, M., Nikbakht, M. A.: BigFCM: Fast, precise and scalable FCM on Hadoop. Future Gener. Computer Syst. 77 (2018), 29-39. DOI 
[27] Hadian, A., Shahrivari, S.: High performance parallel $k$-means clustering for disk-resident datasets on multi-core CPUs. J. Supercomput. 69 (2014), 2, 845-863. DOI 
[28] Haibo, L., Zhi, W.: Application of an intelligent early-warning method based on DBSCAN clustering for drilling overflow accident. Cluster Comput. (2019). DOI 
[29] Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques (Third edition). Morgan Kaufmann, 2011. DOI 
[30] Hassan, I.: $I-k$-means-+: an iterative clustering algorithm based on an enhanced version of the $k$-means. Pattern Recogn. 79 (2018), 402-413. DOI 
[31] Hathaway, R. J., Bezdek, J.C ., Huband, J. M.: Scalable visual assessment of cluster tendency for large data sets. Patt. Recog. 39 (2006), 7, 1315-1324. DOI 
[32] Havens, T. C., Bezdek, J. C.: An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans. Knowl. Data Engrg. 24 (2012), 5, 813-822. DOI 
[33] Havens, T. C., Bezdek, J. C., Palaniswami, M.: Scalable single linkage hierarchical clustering for big data. In: Proc. 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, IEEE Press 2013. pp. 396-401. DOI 
[34] Hilbert, M., López, P.: The world's technological capacity to store, communicate, and compute information. Science 332 (2011), 6025, 60-65. DOI 
[35] Ilango, S. S., Vimal, S., Kaliappan, M., Subbulakshmi, P.: Optimization using Artificial Bee Colony based clustering approach for big data. Cluster Comput. (2019). DOI 
[36] Imamverdiyev, Y., Abdullayeva, F.: Deep learning method for DoS attack detection based on restricted Boltzmann machine. Big Data 6 (2018), 2, 159-169. DOI 
[37] Karmitsa, N., Bagirov, A. M., Taheri, S.: New diagonal bundle method for clustering problems in large data sets. European J. Oper. Res. 263 (2017), 2, 367-379. DOI 
[38] Karmitsa, N., Bagirov, A. M., Taheri, S.: Clustering in large data sets with the limited memory bundle method. Pattern Recogn. 83 (2018), 245-249. DOI 
[39] Kumar, D., Bezdek, J. C., Palaniswami, M., al., et: A hybrid approach to clustering in big data. IEEE Trans. Cybernet. 46 (2016), 10, 2372-2385. DOI 
[40] Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 (1982), 2, 129-137. DOI 
[41] MacQueen, J. B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Berkeley, University of California Press, 1967, pp. 281-297. Zbl 0214.46201
[42] Marjani, M., Nasaruddin, F., Gani, A., al., et: Big IoT data analytics: architecture, opportunities, and open research challenges. IEEE Access 5 (2017), 5247-5261. DOI 
[43] Newling, J., Fleuret, F.: Nested mini-batch $k$-means. In: Proc. 30th International Conference on Neural Information Processing Systems, Curran Associates Inc. 2016, pp. 1360-1368. DOI 
[44] Peng, K., Leung, V. C. M., Huang, Q.: Clustering approach based on mini batch $k$-means for intrusion detection system over big data. IEEE Access 6 (2018), 11897-11906. DOI 
[45] Sabo, M.: Consensus clustering with differential evolution. Kybernetika 50 (2014), 661-678. DOI 
[46] Saini, A., Minocha, J., Ubriani, J., Sharma, D.: New approach for clustering of big data: disk-means. In: Proc. International Conference on Computing, Communication and Automation, IEEE Press, 2016, pp. 122-126. DOI 
[47] Sculley, D.: Web-scale $k$-means clustering. In: Proc. 19th International Conference on World Wide Web, ACM Press, New York 2010, pp. 1177-1178. DOI 
[48] Shirkhorshidi, A. S., Aghabozorgi, S., Wah, T. Y., Herawan, T.: Big data clustering: a review. In: Proc. International Conference on Computational Science and its Applications, LNCS 8583, Part V, Springer 2014, pp. 707-720. DOI 
[49] Sun, Z., Wang, P. P.: A mathematical foundation of big data. New Math. Natur. Comput. 13 (2017), 2, 83-99. DOI 
[50] Tong, Q., Li, X., Yuan, B.: Efficient distributed clustering using boundary information. Neurocomput. 275 (2018), 2355-2366. DOI 
[51] Torra, V., Endo, Y., Miyamoto, S.: On the comparison of some fuzzy clustering methods for privacy preserving data mining: towards the development of specific information loss measures. Kybernetika 45 (2009), 548-560.
[52] Tsai, C.-W., Liu, S.-J., Wang, Y.-C.: A parallel metaheuristic data clustering framework for cloud. J. Parallel Distribut. Comput. 116 (2018), 39-49. DOI 
[53] Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Networks 16 (2005, 3, 645-678. DOI 
[54] Zhang, Q., Yang, L. T., Chen, Z., Li, P.: High-order possibilistic $c$-means algorithms based on tensor decompositions for big data in IoT. Inform. Fusion 39 (2018), 72-80. DOI 
[55] Zhao, W.-L., Deng, C.-H., Ngo, C.-W.: $k$-means: a revisit. Neurocomput. 291 (2018), 195-206. DOI 
Partner of
EuDML logo