Hazards caused by the concentration of pollutants PM_2.5 by using Regression Methods and Spatial-temporal Similarity in Order to Impute the Missing Values in their Time Series (Case Study of Tehran)

Document Type : Applied Article


1 PhD in. Remote Sensing, Department of Geomatics Engineering, Faculty of Civil and Transportation Engineering, University of Isfahan

2 Assistant Professor, Department of Geomatics Engineering, Faculty of Civil and Transportation Engineering, University of Isfahan, Isfahan


With the increasing growth of industrialization of cities, air pollution has become one of the serious environmental hazards in the world's largest cities, including Tehran. Due to the undesirable effects of pollutants on the environment and human health, the analysis of air quality data plays an important role in protecting the environment and its hazards and tackling air pollution problems. During the last decade, a large number of air quality control data, involving the concentration of existing pollutants in the atmosphere, have been collected by pollution monitoring stations in different cities of the country, which due to various reasons such as calibration, maintenance, device errors, and processing errors show missing values at different intervals. These missing values caused problems in data analysis and leads to challenges in making decisions based on these data. Missing data is a common problem in time series issues and introducing efficient models and methods for managing this problem in data is an effective step towards decreasing bias and increasing air pollution model power.
Materials and Methods
This paper uses  pollutant concentration data recorded in 12 air quality monitoringquality-monitoring stations, which are controlled by the air quality control company. Data were collected on an hourly basis from Dec. 7, 2016 to Feb. 27, 2019 through the air quality control site.
The purpose of this paper is to introduce an innovative method based on including spatial correlations between time series related to similar stations from the perspective of time series behavior in imputation of missing information related to each pollution measuring station. In this regard, in the first step, through dynamic time wrapping, the spatio-temporal similarity between the time series of  pollutant concentration of the stations is calculated in pairs. Then, for imputation in each target station, the dependence of those stations with the most similarity of desired station is used. In the second step, the initial complete data is formed by deleting the missing values at each station.
In the next step, with a pattern similar to the main missing data, new missing data is obtained with 10, 15 and 20% of missing data. The fourth step involves implementing and comparing different multiple and single imputation algorithms to fill in the missing data. Finally, the performance of various imputation methods is evaluated by the introduced indicators.
Discuss and Results
In this study, in order to implement multiple imputation algorithms such as predictive mean matching, classification and regression tree, random sample and also implementing different single imputation algorithms such as interpolation methods, observation carried forward last from R-programming language has been used.
Cart imputation method with R-squared of 0.66 and correlation coefficient of 0.8 in 10% of missing values, R-squared of 0.6 and correlation coefficient of 0.76 in 15% of missing values, R-squared of 0.58 and correlation coefficient of 0.75 at 20% of missing values, showed the best performance among multiple imputation methods. It is clear that as the percentage of missing values increases, the accuracy of the evaluation criteria decreases.
Given the obtained results, the predictive mean matching method and the random method showed similar performance and performed worse than the tree regression method.
Based on all three evaluation criteria, the linear interpolation method was better than the other introduced methods. Therefore, among the individual methods for the given data, this method is more appropriate. Also, the spline interpolation method has shown the weakest performance among all multiple and single imputation methods.
Although, compared to the tree regression method, in data with 10% of loss, the linear interpolation method has the highest coefficient of determination and correlation and the lowest error in the evaluation indicators, but it should be noted that the linear interpolation method shows magnificent performance for missing values with low interval, but when the data loss interval increases, for example, in the 20% of missing interval, these methods are not able to provide a good imputation for the lost data and consider a fixed rate or a rate with small variation for all the missing values in each interval.
The existence of missing data in the pollutant concentration time series negatively affects the performance of data analysis in machine learning algorithms and causes bias. The results have shown that determining the spatio-temporal similarity of stations and using the pattern of similar stations using dynamic time wrapping algorithm in combination with based-regression methods leads to improvement of the model performance with high missing intervals, and the tree regression model is the most suitable method for multiple imputation. Single imputation methods, though fast and simple, are dependent on the interval length of missing in time and their performance depends on the variable under study.
Therefore, the use of single methods in air pollution data with high missing intervals is not recommended. Due to the effect that other factors such as meteorological parameters have on air pollution, in future studies, the accuracy of the model can be increased by adding these parameters.


[1]. Ghazali, S.M.; Shaadan, N.; & Idrus, Z. (2020). "Missing data exploration in air quality data set using R-package data visualisation tools". Bulletin of Electrical Engineering and Informatics, 9(2), pp: 755-763. doi:https://doi.org/10.11591/eei.v9i2.2088.
[2]. Junger, W.; & De Leon, A.P. (2015). "Imputation of missing data in time series for air pollutants". Atmospheric Environment, 102, pp: 96-104. doi:https://doi.org/10.1016/j.atmosenv.2014.11.049.
[3]. Liu, X.; Wang, X.; Zou, L.; Xia, J.; & Pang, W. (2020). "Spatial imputation for air pollutants data sets via low rank matrix completion algorithm". Environment International, 139, pp: 105713. doi:https://doi.org/10.1016/j.envint.2020.105713.
[4]. Rombach, I.; Gray, A.M.; Jenkinson, C.; Murray, D.W.; & Rivero-Arias, O. (2018). "Multiple imputation for patient reported outcome measures in randomised controlled trials: advantages and disadvantages of imputing at the item, subscale or composite score level". BMC medical research methodology, 18(1), pp: 87. doi:https://doi.org/10.1186/s12874-018-0542-6.
[5]. Shahbazi, H.; Karimi, S.; Hosseini, V.; Yazgi, D.; & Torbatian, S. (2018). "A novel regression imputation framework for Tehran air pollution monitoring network using outputs from WRF and CAMx models". Atmospheric Environment, 187, pp: 24-33. doi:https://doi.org/10.1016/j.atmosenv.2018.05.055