Previous |  Up |  Next

Article

Keywords:
contiguous missing values; seasonal patterns; time-series
Summary:
This work presents a new approach for the imputation of missing data in weather time-series from a seasonal pattern; the seasonal time-series imputation of gap missing algorithm (STIGMA). The algorithm takes advantage from a seasonal pattern for the imputation of unknown data by averaging available data. We test the algorithm using data measured every $10$ minutes over a period of $365$ days during the year 2010; the variables include global irradiance, diffuse irradiance, ultraviolet irradiance, and temperature, arranged in a matrix of dimensions $52,560$ rows for data points over time and $4$ columns for weather variables. The particularity of this work is that the algorithm is well-suited for the imputation of values when the missing data are presented continuously and in seasonal patterns. The algorithm employs a date-time index to collect available data for the imputation of missing data, repeating the process until all missing values are calculated. The tests are performed by removing $5\%$, $10\%$, $15\%$, $20\%$, $25\%$, and $30\%$ of the available data, and the results are compared to autoregressive models. The proposed algorithm has been successfully tested with a maximum of $2,736$ contiguous missing values that account for $19$ consecutive days of a single month; this dataset is a portion of all the missing values when the time-series lacks $30\%$ of all data. The metrics to measure the performance of the algorithms are root-mean-square error (RMSE) and the coefficient of determination ($R^{2}$). The results indicate that the proposed algorithm outperforms autoregressive models while preserving the seasonal behavior of the time-series. The STIGMA is also tested with non-weather time-series of beer sales and number of air passengers per month, which also have a cyclical pattern, and the results show the precise imputation of data.
References:
[1] Ahn, H., Sun, K., Kim, K. P.: Comparison of missing data imputation methods in time series forecasting. Computers Materials Continua 70 (2022), 767-779. DOI 
[2] Anava, O., Hazan, E., Zeevi, A.: International Conference on Machine Learning. Proc. Machine Learning Research, Lille 2015.
[3] Bashir, F., Wei, H. L.: Handling missing data in multivariate time series using a vector autoregressive model-imputation (VAR-IM) algorithm. Neurocomputing 276 (2018), 23-30. DOI 
[4] Batista, G. E. A. P. A., Monard, M. C.: An analysis of four missing data treatment methods for supervised learning. Appl. Artific. Intell. 17 (2003), 519-533. DOI 
[5] Bras, L. P., Menezes, J. C.: Dealing with gene expression missing data. IEE Proceedings - Systems Biology, 153 (2006), 105-119. DOI 
[6] Brown, S., Tauler, R., Walczak, B.: Comprehensive Chemometrics: Chemical and Biochemical Data Analysis. (Second edition.). Elsevier, Smsterdam 2020.
[7] Choong, M. K., Charbit, M., Yan, H.: Autoregressive-model-based missing value estimation for DNA microarray time series data. IEEE Trans. Inform. Technol. Biomedicine 13 (2009), 131-137. DOI 
[8] Dan, E. L., Dinşoreanu, M., Mureşan, R. C.: 2020 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR). IEEE, London 2020.
[9] Dunsmuir, W., Robinson, P. M.: Estimation of time series models in the presence of missing data. J. Amer. Statist. Assoc. 76 (1981), 560-568. DOI 
[10] Folch-Fortuny, A., Arteaga, F., Ferrer, A.: Enabling network inference methods to handle missing data and outliers. BMC Bioinformatics 16 (2015), 1-12. DOI 
[11] Folch-Fortuny, A., Arteaga, F., Ferrer, A.: PCA model building with missing data: New proposals and a comparative study. Chemometr. Intell. Labor. Systems 146 (2015), 77-88. DOI 
[12] Folch-Fortuny, A., Arteaga, F., Ferrer, A.: Missing data imputation toolbox for MATLAB. Chemometr. Intell. Labor. Systems 154 (2016), 93-100. DOI 
[13] González-Martíneza, J. M., Noord, O. E. de, Ferrer, A.: Multisynchro: a novel approach for batch synchronization in scenarios of multiple asynchronisms. J. Chemometr. 28 (2014), 462-475. DOI 
[14] Hui, D., Wan, S., Su, B, Katul, G., Monson, R., Luo, Y.: Gap-filling missing data in eddy covariance measurements using multiple imputation (MI) for annual estimations. Agricultur. Forest Meteorology 121 (2004), 93-111. DOI 
[15] Junger, W. L., Leon, A. Ponce de: Imputation of missing data in time series for air pollutants. Atmosph. Environment 102 (2015), 96-104. DOI 
[16] Liu, S., Molenaar, P. C. M.: iVAR: A program for imputing missing data in multivariate time series using vector autoregressive models. Behavior Res. Methods 46 (2014), 1138-1148. DOI 
[17] Magán-Carrión, R., Pulido-Pulido, F., Camacho, J., García-Teodoro, P.: Tampered data recovery in WSNs through dynamic PCA and variable routing strategies. J. Commun. 8 (2013), 738-750. DOI 
[18] Makridakis, S., Wheelwright, S. C., Hyndman, R. J.: Forecasting: Methods and Applications. (Third edition.). Wiley, India 2008.
[19] Montgomery, D. C.: Statistical Quality Control. (Sixth edition.). Wiley, New York 2005.
[20] Murad, H., Dankner, R., Berlin, A., Olmer, L., Freedman, L. S.: Imputing missing time-dependent covariate values for the discrete time Cox model. Statist. Methods Medical Res. 29 (2020), 2074-2086. DOI  | MR 4128979
[21] Neves, D. T., Alves, J., Naik, M. G., Proenca, A. J., Prasser, F.: From missing data imputation to data generation. J. Comput. Sci. 61 (2022), 101640. DOI 
[22] Noor, N. M., Bakri-Abdullah, M. M. Al, Yahaya, A. Shukri, Ramli, N. A.: Comparison of Linear Interpolation Method and Mean Method to Replace the Missing Values in Environmental Data Set. Trans Tech Publications, Switzerland 2014.
[23] Pedreschi, R., Hertog, M. L. A. T. M., Carpentier, S. C., Lammertyn, J., Robben, J., Noben, J. P., Panis, B., Swennen, R., Nicola, B. M.: Treatment of missing values for multivariate statistical analysis of gel-based proteomics data. Proteomics 29 (2008), 1371-1383. DOI 
[24] Quevedo, J., Puig, V., Cembrano, G., Aguilar, J., Isaza, C., Saporta, D., Benito, G., Hedo, M., Molina, A.: Estimating missing and false data in flow meters of a water distribution network. IFAC Proc. Vol. 39 (2006), 1181-1186. DOI 
[25] Sun, Y., Li, J., Xu, Y., Zhang, T., Wang, X.: Deep learning versus conventional methods for missing data imputation: A review and comparative study. Expert Systems Appl. 227 (2023), 120-201. DOI  | MR 4523179
[26] Zarzo, M., Martí, P.: Modeling the variability of solar radiation data among weather stations by means of principal components analysis. Appl. Energy 88 (2011), 2775-2784. DOI 
[27] Zhang, Z.: Missing data imputation: focusing on single imputation. AME Publ. 4 (2016), 1-8. DOI 
Partner of
EuDML logo