Improving Air Quality Prediction via Self-Supervision Masked Air Modeling

: Presently, the harm to human health created by air pollution has greatly drawn public attention, in particular, vehicle emissions including nitrogen oxides as well as particulate matter. How to predict air quality, e.g., pollutant concentration, efficiently and accurately is a core problem in environmental research. Developing a robust air quality predictive model has become an increasingly important task, holding practical significance in the formulation of effective control policies. Recently, deep learning has progressed significantly in air quality prediction. In this paper, we go one step further and present a neat scheme of masked autoencoders, termed as masked air modeling (MAM), for sequence data self-supervised learning, which addresses the challenges posed by missing data. Specifically, the front end of our pipeline integrates a WRF-CAMx numerical model, which can simulate the process of emission, diffusion, transformation, and removal of pollutants based on atmospheric physics and chemical reactions. Then, the predicted results of WRF-CAMx are concatenated into a time series, and fed into an asymmetric Transformer-based encoder–decoder architecture for pre-training via random masking. Finally, we fine-tune an additional regression network, based on the pre-trained encoder, to predict ozone (O 3 ) concentration. Coupling these two designs enables us to consider the atmospheric physics and chemical reactions of pollutants while inheriting the long-range dependency modeling capabilities of the Transformer. The experimental results indicated that our approach effectively enhances the WRF-CAMx model’s predictive capabilities and outperforms pure supervised network solutions. Overall, using advanced self-supervision approaches, our work provides a novel perspective for further improving air quality forecasting, which allows us to increase the smartness and resilience of the air prediction systems. This is due to the fact that accurate prediction of air pollutant concentrations is essential for detecting pollution events and implementing effective response strategies, thereby promoting environmentally sustainable development.


Introduction
Air pollution is one of the main environmental issues that has a severe effect on public health [1][2][3].Urbanization, industrialization and fossil fuel consumption are the main causes of severe air pollution issues.In particular, transportation is a significant contributor to fossil fuel consumption and is associated with devastating health impacts, such as respiratory and cardiovascular diseases, and even death [4][5][6].During the past few decades, air quality forecasting has become a research hotspot in controlling air pollution.Air pollutant concentration information is crucial for preventing human health issues and strengthening environmental management.Therefore, researchers employ various strategies to predict air pollutant concentrations.These methods can be grouped into two categories [7,8]: (i) deterministic methods based on hypothesis theory and prior knowledge and (ii) statistical methods based on capturing characteristics from data (see Figure 1, left-hand side).
Atmosphere 2024, 1, 0 2 of 14 (i) deterministic methods based on hypothesis theory and prior knowledge, (ii) statistical methods based on capturing characteristics from data (See Figure 1 Left).Predicting air pollutant concentrations (APCs) is influenced by various complicated factors.The generation of air pollutants involves intricate chemical reactions in the atmosphere.Besides, meteorological factors (e.g., wind speed, temperature, relative humidity, wind direction) influence not only the diffusion of air pollutants, but also photochemical reactions and subsequent concentration changes.Temperature affects atmospheric and ventilation conditions; relative humidity and precipitation alter the deposition characteristics of particulate matter; and wind speed facilitates the diffusion and spread of pollutants [9].Overall, meteorological forecast deviation, complex chemical processes, uncertainties in pollutant emission inventories, and imperfect parameterization of physical processes in the model lead to errors between the predicted results and measured values [10,11].Developing a robust model for predicting APCs remains challenging due to inaccurate or missing observations.
To address the above issues, a promising direction lies in data-driven air quality forecast with Artificial Intelligence (AI) models, in particular, deep learning such as Transformer.Transformer [12] is a deep learning model primarily applied in natural language processing tasks.It relies on self-attention mechanisms to process sequential data, enabling it to capture dependencies regardless of their distance in the input sequence.The data-driven simulation optimization can automatically identify patterns and regularities in data.However, this requires a large amount of labeled data.Recently, self-supervised learning via masked autoencoding has been proven to be a promising scheme for learning generalized pre-trained representations [13,14].For example, BERT [13] uses masked language modeling, achieving state-of-the-art results in tasks like text classification and question answering.Nevertheless, self-supervised pre-training has not been fully explored in APCs.In fact, due to limited or missing observations, masked autoencoding that removes a portion of the air quality data and learns to predict the removed content is natural and applicable in air quality prediction.We propose a composite model that integrates WRF-CAMx model and a neat scheme of masked autoencoders to accurately predict air pollutant O 3 concentrations (see Figure1 Right), which is one of the highest risk factors for global premature mortality [15][16][17].The main contributions of this research are as follows: 1.
We propose a hybrid air quality prediction pipeline, which does not only simulate atmospheric physics and chemical reactions of pollutants, but also inherits the longrange dependency modeling capabilities of the Transformer.Predicting air pollutant concentrations (APCs) is influenced by various complicated factors.The generation of air pollutants involves intricate chemical reactions in the atmosphere.Furthermore, meteorological factors (e.g., wind speed, temperature, relative humidity, wind direction) influence not only the diffusion of air pollutants, but also photochemical reactions and subsequent concentration changes.Temperature affects atmospheric and ventilation conditions; relative humidity and precipitation alter the deposition characteristics of particulate matter; and wind speed facilitates the diffusion and spread of pollutants [9].Overall, meteorological forecast deviation, complex chemical processes, uncertainties in pollutant emission inventories, and imperfect parameterization of physical processes in the model lead to errors between the predicted results and measured values [10,11].Developing a robust model for predicting APCs remains challenging due to inaccurate or missing observations.To address the above issues, a promising direction lies in data-driven air quality forecast with Artificial Intelligence (AI) models, in particular, deep learning such as Transformer.Transformer [12] is a deep learning model primarily applied in natural language processing tasks.It relies on self-attention mechanisms to process sequential data, enabling it to capture dependencies regardless of their distance in the input sequence.The data-driven simulation optimization can automatically identify patterns and regularities in data.However, this requires a large amount of labeled data.Recently, self-supervised learning via masked autoencoding has been proven to be a promising scheme for learning generalized pre-trained representations [13,14].For example, BERT [13] uses masked language modeling, achieving state-of-the-art results in tasks like text classification and question-answering.Nevertheless, self-supervised pre-training has not been fully explored in APCs.In fact, due to limited or missing observations, masked autoencoding that removes a portion of the air quality data and learns to predict the removed content is natural and applicable in air quality prediction.We propose a composite model that integrates WRF-CAMx model and a neat scheme of masked autoencoders to accurately predict air pollutant O 3 concentrations (see Figure 1, right-hand side), which is one of the highest risk factors for global premature mortality [15][16][17].The main contributions of this research are as follows: 1.
We propose a hybrid air quality prediction pipeline that not only simulates atmospheric physics and chemical reactions of pollutants, but also inherits the long-range dependency modeling capabilities of the Transformer.

2.
We design an asymmetric Transformer-based encoder-decoder architecture as a promising scheme of masked air modeling, which yields a nontrivial and meaningful self-supervisory sequence representation learning task.

3.
In terms of hour-by-hour simulation performance, the proposed MAM can effectively boost the WRF-CAMx and purely supervisory learning models' predictive capabilities, which provides more than 26 percent (correlation coefficient) of performance improvements.

Related Work
According to the features of existing research, air quality forecasting strategies can be grouped into two major categories: deterministic methods and statistical methods.
The structure of a deterministic model is predefined according to certain theoretical assumptions and prior knowledge.Thus, deterministic methods utilize a set of equations describing the atmospheric physical and chemical processes to simulate diffusion with meteorological and other data inputs [7].Various representative air quality models have been proposed to simulate the complex changes in atmospheric pollutants.The Community Multiscale Air Quality (CMAQ) model [18][19][20], Weather Research and Forecasting model coupled with the Chemistry (WRF-Chem) model [21,22], the Chemical Lagrangian Model of the Stratosphere (CLaMS) [23], and the Comprehensive Air Quality Model with Extensions (CAMx) [24,25] are typically employed in air pollutant concentration forecasting and are widely used in scenario and policy analyses.Although the theoretical understanding of pollutant diffusion mechanisms continues to be enriched and refined, deterministic models are typically associated with sophisticated a priori knowledge, such as determining a model structure using theoretical assumptions and estimating parameters empirically, where the predictive performance is limited [26][27][28].Furthermore, the accuracy of such methods depends on the abundance of information and data about emission sources.In general, these errors usually fall into two major types: (i) the inherent biases from parameterizing physical processes and discretizing differential equations reduce simulation accuracy and (ii) the internal variability driven by the sensitivity to the initial conditions, such as meteorological fields and emissions.
Unlike deterministic models, statistical methods can avoid using complex theoretical models, gradually emerging in air pollution prediction [29].Statistical methods aim to capture patterns and regularities between input data and predictive variables, without relying on explicit knowledge of the underlying physical and chemical processes in the atmosphere [7,30].Statistical methods are typically divided into classical statistical methods and machine learning methods.Classic statistical methods establish a certain statistical relationship (e.g., AutoRegression Integrated Moving Average [31], or Geographically Weighted Regression [32]) by analyzing the forecast and monitoring data within the same time period.Traditional machine learning methods include Support Vector Machine (SVM) [33,34], multilabel classifier based on Bayesian [35], Random Forest [36], hidden Markov model [37], Boosted Regression Trees [38], and XGBoost [39].In summary, statistical forecasting methods analyze the statistical regularity of pollutants and then predict the pollution trend.However, statistical models tend to severely degrade when simulating extreme episodes.This is due to the fact that the training data are limited in the representation of complex meteorological phenomena and nonlinear patterns [40].
As an emerging research branch of statistical methods, deep learning is able to effectively capture potential nonlinear relationships from data, and its nonlinear relationship's forecast ability is superior to that of traditional statistical methods.Typical deep learning networks for forecasting air pollution concentrations include Multilayer Perceptron (MLP) [41], Recurrent Neural Network (RNN) [42], Generative Adversarial Network (GAN) [43], Long Short-Term Memory (LSTM) neural network [44], CNN-LSTM model [45], LSTM variants [46], etc. Deep learning methods show satisfactory performance in extracting latent pattern and inherent features from data [47].Since emission, diffusion, conversion, and removal of air pollutants are dynamic processes that evolve over time, air pollutant prediction is transformed into a time series data forecasting task, and is used to capture the spatiotemporal feature of pollutants.

Method
The proposed algorithm consists of two parts: (1) The Weather Research and Forecasting-Comprehensive Air Quality Model with Extensions (WRF-CAMx) model, and (2) a neat scheme of masked autoencoders that reduces uncertainty and improves simulation accuracy.The implementation details are shown in Figure 2.
Atmosphere 2024, 1, 0 4 of 14 prediction is transformed into a time series data forecasting task, and is used to capture the spatiotemporal feature of pollutants.

Method
The proposed algorithm consists of two parts: 1) The Weather Research and Forecasting-Comprehensive Air Quality Model with Extensions (WRF-CAMx) model, and 2) a neat scheme of masked autoencoders that reduces uncertainty and improves simulation accuracy.The implementation details are shown in Figure2.

WRF-CAMx Modeling
The Weather Research and Forecasting (WRF) model provides hourly weather simulation data for subsequent missions.The Comprehensive Air Quality Model with Extensions (CAMx) model is applied to simulate pollutant concentrations, and the WRF output is processed together with the emission inventory as its input.The time resolution of the model forecast results is 1h.

Simulation Domain
The Yangtze River Delta (YRD) region, one of China most industrialized regions, is located in the eastern coast of China.The YRD region is composed of 41 cities in Shanghai municipality, Zhejiang, Jiangsu and Anhui provinces.The air quality issue in the YRD region has consistently attracted considerable attention.For these factors, the YRD region is selected as the research area.The meteorological fields of three successive nested domains with horizontal resolutions of 27 km (d01), 9 km (d02) and 3 km (d03) were simulated by WRF model version 3.9 [25].The outer domain covers the Chinese mainland, the middle domain covers the eastern part of China, and the inner domain covers the YRD region.CAMx employs a two-layer nested grid with resolution and grid center points identical to the second and third layers of WRF.Each layer of the CAMx grid has slightly smaller coverage than the WRF grid to reduce the influence of boundary fields on simulation results [48][49][50].

Model Building
The Global Final Analysis data provided by the National Centers for Environmental Prediction (NCEP) provides the initial and boundary conditions for the WRF model, with a spatial resolution of 1 • × 1 • and a time interval of 6h.Meteorological data output from the WRF model and emission inventory were inputted into the CAMx version6.5

WRF-CAMx Modeling
The Weather Research and Forecasting (WRF) model provides hourly weather simulation data for subsequent missions.The Comprehensive Air Quality Model with Extensions (CAMx) model is applied to simulate pollutant concentrations, and the WRF output is processed together with the emission inventory as its input.The time resolution of the model forecast results is 1 h.

Simulation Domain
The Yangtze River Delta (YRD) region, one of China's most industrialized regions, is located on the eastern coast of China.The YRD region is composed of 41 cities in the Shanghai municipality, Zhejiang, Jiangsu and Anhui provinces.The air quality issue in the YRD region has consistently attracted considerable attention.For these factors, the YRD region is selected as the research area.The meteorological fields of three successive nested domains with horizontal resolutions of 27 km (d01), 9 km (d02), and 3 km (d03) were simulated by WRF model version 3.9 [25].The outer domain covers the Chinese mainland, the middle domain covers the eastern part of China, and the inner domain covers the YRD region.CAMx employs a two-layer nested grid with resolution and grid center points identical to the second and third layers of WRF.Each layer of the CAMx grid has slightly smaller coverage than the WRF grid to reduce the influence of boundary fields on simulation results [48][49][50].

Model Building
The Global Final Analysis data provided by the National Centers for Environmental Prediction (NCEP) provides the initial and boundary conditions for the WRF model, with a spatial resolution of 1 • × 1 • and a time interval of 6 h.Meteorological data output from the WRF model and emission inventory were inputted into the CAMx version 6.5 model to simulate air pollutant concentrations.The emission inventory of the YRD region provided by the Shanghai Academy of Environmental Sciences was adopted within the inner domain, with a resolution of 4 km.The Multi-resolution Emission Inventory for China (MEIC) developed by Tsinghua University was adopted within the other two domains, with a spatial resolution of 0.25 • × 0.25 • (http://meicmodel.org.cn) [51,52].According to the principle of conservation of total emissions, bilinear interpolation was used to interpolate the involved emission inventories to a resolution that matches each nested layer of the CAMx model.The essential parameterization schemes of the WRF-CAMx model are listed in Table 1 [48].Given the WRF-CAMx simulation results {D 0 , D 1 , • • • , D h−1 } of meteorology and air quality for the past (h) time periods, we aimed to predict the real air quality concentration for the next time period (O h ).In other words, our goal is to find a mapping for predicting O h , which can be written as where O * h denotes the predicted value for the next time period of the input sequence, and θ indicates learnable parameters.To infer θ, a popular practice is to directly optimize the error between O h and O * h .However, limited data annotation may result in poor generalization of the model.Therefore, in this work, we focus on leveraging the self-supervised model to learn good sequence representation, then fine-tune downstream tasks, i.e., the prediction of air pollutant O 3 concentration.
Note that O 3 concentration is confirmed to exhibit a causal relationship with the air pollution data, e.g., SO 2 , NO 2 , PM 2.5 , and meteorological data.Specifically, wind direction determines the direction of dispersion; higher wind speeds accelerate dispersion; and relative humidity and temperatures typically affect the rate of atmospheric chemical reactions.Therefore, four meteorological parameters (temperature, relative humidity, wind direction, and wind speed) and four air pollutant concentrations simulated by CAMx (SO 2 , NO 2 , PM 2.5 , and O 3 ) are selected as the model input in the research, and we set the time span of the sequence to 12 h.We will detail our masked air modeling in the rest of the section.

Masked Autoencoders for Context Understanding
Masked language and image modeling, which aims to hold out a portion of the input and train networks to predict the masked content, have made great progress on natural language processing (NLP) and computer vision (CV) communities.The preponderance of evidence continues to indicate that this self-supervised learning can produce generalized pre-trained representations for various downstream tasks.
Significant interest in this pre-training paradigm arose following the success of some milestones, e.g., BERT [13] and MAE [14].However, self-supervised pre-training has not been fully explored in air quality forecasting (AQF).In fact, due to inaccurate or missing observations, the scheme that removes a portion of the air quality data and learns to predict the removed content is natural and applicable in air quality prediction.In this work, we attempt to explore the potential of this pre-training strategy in AQF, and refer to this as masked air modeling (MAM).This practice does not only directly solve the problem of missing data, but also promises to provide excellent representation for prediction tasks through fine-tuning.
Formally, the proposed MAM is a framework of neutral learning paradigm.In this work, following MAE, we leverage a simple Transformer-based autoencoder as an instance to reconstruct the missing signal, given its partial observation.To this end, we randomly select time-continuous samples [x 1 , x 2 , • • • , x n ] (where x i = [D i ] ∈ R 8 ) from the dataset to serve as our sequence input, and mask (i.e., remove) a subset of sequence without replacement based on a uniform distribution.Our training strategy is straightforward.One reason it is straightforward is that the input to the MAM encoder is only on visible unmasked vectors, where the MAM encoder is a ViT [53], including alternating layers of multi-headed self-attention (MSA) and MLP blocks: where x g is the learnable global token; F N (•) is the normalization layer, which is applied before network blocks (L is the number of blocks); E ∈ R K×D and E pos denote trainable linear projection parameters and position embeddings, respectively.Another reason it is straightforward is that decoder input is the full set of tokens, including (i) encoded visible features and (ii) mask tokens, i.e., where is the encoder output, and X = [X 1 ; X 2 ; • • • ; X n−m ] denotes a learnable vector sequence indicating mask tokens, and [•||•] is the concatenation operation.Finally, Q will be fed into another series of Transformer blocks to predict the missing data.The decoder is only used during pre-training to address the missing data problem.Therefore, the architecture of the decoder can be flexibly designed.It is important to notice that unlike the original ViT model, we attach the extra learnable embedding p g L to sequence representations, thus enhancing the interaction of local and global features.In the original ViT, p g L often acted as a class embedding for the final classification tasks.

Learning Prediction Representation
In order to fulfill air quality prediction, we remove the pre-trained MAM decoder and introduce a predictor, which is applied to the sequence features extracted from the pre-trained MAM encoder.The predictor also consists of alternating layers of MSA and MLP blocks, but here, the extra learnable embedding serves as a "regression token" Z, i.e., prediction representation, which is fed into a regression head implemented by an MLP with one hidden layer.During the training phase, the parameters of the encoder are frozen, and only the predictor is trainable, which allows us to facilitate a direct inheritance of the encoder's powerful context modeling capabilities acquired during the pre-training.In addition, the pre-trained encoder-decoder provides a data augmentation method: the practice involves performing random masking on input sequences, wherein the masks are different for each iteration and so they generate new training samples.

Loss Function
Our approach consists of two targets, namely reconstruction and prediction; both belong to regression tasks.Therefore, in this work, we use simple element-wise meansquared error (MSE) loss to optimize our model, and we find that this works well in our experiments.
where x = [x 1 , x 2 , • • • , x n ] denotes input sequence; y indicates ground truth label; and F E , F D , and F P are the encoder, decoder, and predictor, respectively.More complex loss functions are worth exploring, but we will leave that to future works.

Experiment 4.1. Ground-Level Air Pollutant Measurements
The Yangtze River Delta region includes a total of 41 cities, as shown in Figure 3. Hourly air pollutant concentration observation data are obtained from National Urban Air Quality Realtime Release Platform (http://www.cnemc.cn/,(accessed on 1 May 2024)).The simulated data of the WRF-CAMx model were extracted according to the longitude and latitude of the air quality monitoring sites and were established in correspondence with the observed data.Air pollution concentration observation data were used as labels for the forecast data, aiming to calculate simulation errors.The experiment involved pollutant concentration and meteorological data from the YRD in January, April, July, and October 2021.

Loss Function
Our approach consists of two targets, namely, reconstruction and prediction, both belong to regression tasks.Therefore, in this work, we use simple element-wise mean square error (MSE) loss to optimize our model, and we find that this works well in our experiments. where denotes input sequence, y indicates ground truth label, F E , F D , F P are encoder, decoder and predictor, respectively.More complex loss functions are worth exploring, but we'll leave that to future works.

Experiment 4.1. Ground-level Air Pollutant Measurements
The Yangtze River Delta region includes a total of 41 cities, as shown in Figure3.Hourly air pollutant concentration observation data are obtained from National Urban Air Quality Realtime Release Platform http://www.cnemc.cn/.The simulated data of the WRF-CAMx model were extracted according to the longitude and latitude of the air quality monitoring sites, and established correspondence with the observed data.Air pollution concentration observation data were used as labels for the forecast data, aiming to calculate simulation errors.The experiment involved pollutant concentration and meteorological data from the YRD in January, April, July, and October 2021.

Performance Metrics
In this section, we focus on the performance of MAM in predicting air pollutant concentrations and compare it against other algorithms.Mean Bias (BIAS), Root Mean Square Error (RSME), Index of Agreement (IOA), Correlation Coefficient (COR) are applied to evaluate the accuracy of air pollutant concentration predictions.Evaluation metrics are described as follows:

Performance Metrics
In this section, we focus on the performance of MAM in predicting air pollutant concentrations and compare it against other algorithms.Mean Bias (BIAS), Root-Mean-Squared Error (RSME), Index of Agreement (IOA), and Correlation Coefficient (COR) are applied to evaluate the accuracy of air pollutant concentration predictions.The evaluation metrics are described as follows: where N is the total number of predicted (or monitored) data.x i represents the simulated value of pollutant concentration.xi represents the monitoring value of air pollutant concentration.x is the mean of {x 1 , ..., x N } and x is defined in the same way.

Results and Discussion
To verify the effectiveness of MAM, we designed a series of experiments on the obtained air quality dataset, including simulated data and corresponding monitoring data in the Yangtze River Delta.A 10-fold cross-validation method was applied to assess the performance or effectiveness of various methods.The input dataset was split into ten equally sized subsets called folds.The model was trained and tested ten times.During each evaluation process, nine folds were used as the training set and the remaining one fold was used for validation.This evaluation process was repeated ten times to ensure that each fold was tested.For each assessment of the proposed model performance, BIAS (µg/m 3 ), RSME (µg/m 3 ), IOA, and COR were employed as statistical indicators to quantify the accuracy of O 3 simulations.

Comparison with Baseline
To test the performance of our self-supervised framework, we compared our method with the baseline (WRF-CAMx model).Cross-validation results on the air quality dataset (i.e., O 3 ) are shown in Figure 4. Overall, the proposed MAM performed better than the baseline, with higher IOA and COR, and lower BIAS and RMSE.O 3 concentrations varied in different seasons.January, April, July, and October were selected to represent winter, spring, summer, and autumn, respectively.According to the Mean Bias shown in Figure 4, the hourly O 3 concentration data simulated by WRF-CAMx in the YRD region are generally lower than the monitoring station data.This phenomenon is more obvious in April.
Our MAM framework outperformed the WRF-CAMx model in the four months, with a 0.10-0.26IOA enhancement and a 0.13-0.27COR increase, demonstrating that MAM has a stable positive effectiveness.To be specific, compared with the WRF-CAMx model, the RMSE of the April simulation results decreased from 40.69 to 22.87, and the IOA increased from 0.60 to 0.86, which is the most obvious change.This may be due to a low accuracy of the WRF-CAMx model; thus, the effect of MAM is obvious.As shown in Figure 4, in April, there is a significant discrepancy between the simulation results of the WRF-CAMx model and the observed data at monitoring stations.Limited knowledge of pollutant sources and imperfect representation of physicochemical processes would pose biases in the predicted results of the WRF-CAMx.
The hour-by-hour time series comparison results of O 3 concentration in the YRD region (Shanghai, Zhejiang, Jiangsu, and Anhui) are shown in Figure 5.The O 3 simulated data in the YRD region are divided into four datasets based on administrative areas, and hourly average values are validated against monitoring data.The temporal variation trend and numerical range of the simulated concentration produced by the proposed model are generally consistent with the observed values.Table 2 shows the forecast performance of the proposed method in the four regions, evaluated using correlation coefficients.For the four regions, the simulated hourly O 3 concentrations in each month are compared with the monitoring data.
In order to further analyze the effectiveness of MAM in air quality forecasting, we validate the predicted results based on the four months of data provided by each monitoring site, shown in Figure 6.Correlation coefficient is used to evaluate the difference between forecast data and monitoring data, where monitoring data are used as labels.The correlation coefficients are visualized in the corresponding geographical locations, and different colors correspond to different levels of correlation coefficients.It can be concluded that MAM achieved satisfactory accuracy in the YRD region.In detail, most of the correlation coefficients range between 0.655 and 0.711, with the highest reaching 0.768.From the results, the proposed MAM is clearly able to produce satisfactory prediction accuracy for different geographical locations in the Yangtze River Delta region.

Comparison with Baseline
To test the performance of our self-supervised framework, we compared our method with the baseline (WRF-CAMx model).Cross validation results on the air quality dataset (i.e., O 3 ) are shown in Figure4.Overall, the proposed MAM performed better than the baseline, with higher IOA and COR, lower BIAS and RMSE.O 3 concentrations varied Fully connected Neural Network (FNN), Random Forest (RF)), and WRF-CAMx and Transformer + MAM (w/o WRF-CAMx).In this experiment, all models are tested on the dataset mentioned above, and the performance of each machine learning model is verified by the 10-fold cross-validation method.A comparison of validation results between our method and other models are shown in Table 3. From the results, we found that MAM pre-training can lead to significant improvements in both IOA and COR metrics.It is worth noting that although the Transformer model is more advanced, it does not exhibit a significant advantage over traditional FCN and RF models.Transformer framework often suffers poor generalization when training on a limited dataset, since Transformer lacks certain inductive biases such as locality.

Conclusions
In this paper, a deep learning model, termed as masked air modeling (MAM), is proposed to delve into the effectiveness of self-supervised learning in air quality prediction.Moreover, in order to simulate atmospheric physics and chemical reactions of pollutants, we combine conventional atmospheric models (WRF-CAMx) with data-driven deep learning methods.This design leverages the strengths of both approaches to enhance simulation accuracy and predictive capabilities.The experimental results show that in terms of hour-by-hour simulation performance, MAM can effectively boost the model's robustness, demonstrating its effectiveness.Accurate prediction of atmospheric pollutant concentrations is crucial for formulating strategies to control air pollution, protecting human health, and environmental management.
Even though the proposed self-supervised masked air modeling (MAM) has an advantage in air quality prediction, it often requires large-scale data and computational resources for effective pre-training [54], which may be a potential limitation.Moreover, our method may suffer performance degradation in unseen contexts due to the domain bias between training data and test data.At the same time, the reliance on reconstruction tasks may not always align with downstream tasks, leading to poor generalization in real-world applications.Transformer models can be extended to larger spatial domains, but there are some challenges.For example, a larger spatial domain increases the number of tokens, resulting in higher computational costs and memory usage; this is due to the fact that a Transformer scales quadratically with the number of tokens [12].That is, scaling to larger spatial domains typically requires more diverse and extensive training data to capture additional variability and complexity.The above challenges may be addressed by using advanced initialization techniques or lightweight Transformer variants.For future work, exploring air pollutant interactions among different locations could provide insights into spatial dependencies and pollutant dispersion patterns.Implementing multi-source data fusion techniques and advanced spatiotemporal models can further improve predictive capabilities and inform effective pollution control strategies.

Figure 1 .
Figure 1.Left: Traditional air prediction pipeline.Right: The proposed masked air modelling framework for improving air quality prediction.

Figure 1 .
Figure 1.Left: Traditional air prediction pipeline.Right: The proposed masked air modeling framework for improving air quality prediction.

Figure 2 .
Figure 2. Schematic illustration of the Transformer-based masked air modelling.

Figure 2 .
Figure 2. Schematic illustration of the Transformer-based masked air modeling.

Figure 3 .
Figure 3. Left: the location of the YRD.Right: the spatial distribution of air quality monitoring sites.

Figure 3 .
Figure 3. Left: The location of the YRD.Right: The spatial distribution of air quality monitoring sites.

Figure 4 .
Figure 4. Scatter density plots of cross validation results for the WRF-CAMx model (left) and our MAM model (right).Cells with aggregate counts up to 1% of the total will be colored.Each row from top to bottom represents the simulation results in January, April, July, and October, respectively.

Figure 4 .
Figure 4. Scatter density plots of cross-validation results for the WRF-CAMx model (left) and our MAM model (right).Cells with aggregate counts up to 1% of the total will be colored.Each row from top to bottom represents the simulation results in January, April, July, and October, respectively.

Table 1 .
The parameterization schemes of the WRF-CAMx model.

Table 3 .
Performance comparison of all models.The best are highlighted in bold.