Introduction

Air pollution can have serious negative impacts on human health, including cardiovascular and respiratory diseases. Monitoring and controlling air pollutants is crucial for protecting public health and the environment. Commonly used strategies for air pollution monitoring include statistical analysis, data visualization, and identifying correlations and trends in pollution levels. These tools can help identify sources of pollution and inform the development of policies and regulations to reduce and control pollution levels. Additionally, monitoring and control efforts may also include the use of specialized equipment and technology, such as air quality sensors and monitoring stations, to measure and track specific pollutants in the air (Manisalidis et al. 2020; Cook et al. 2021; Li et al. 2022c). Modern statistical approaches, such as spatiotemporal modeling and machine learning techniques, can effectively handle high-dimensional data with temporal and spatial characteristics in environmental air pollution monitoring. These methods can capture the underlying variations and dynamics trends of air pollutants over the entire temporal-spatial scale, making them more suitable for this type of data compared to classical statistical methods (Acal et al. 2022; Wang et al. 2022).

Functional data analysis (FDA) is a powerful technique for working with multi-dimensional air pollutants data. It utilizes additional information, such as the smoothness of the data structure, rate of change, acceleration, and dynamic changes over a large-scale domain, to extract more information from the data compared to traditional vectorial approaches (King et al. 2018; Al-Janabi et al. 2021; Reinholdt Jensen et al. 2022). FDA has been well-established in the literature over the past two decades, with a strong methodological and operational framework. There are several advantages of using FDA compared to traditional vectorial approaches: (i) flexibility: FDA can handle data that is not easily represented by vectors, such as data that is curve- or surface-based, or data that varies over a continuous domain, (ii) smoothness: FDA can incorporate information about the smoothness of the data structure, which is not captured by traditional vector-based methods, (iii) dynamics: FDA can capture dynamic changes in the data, such as rates of change or acceleration, which are not possible with traditional vector-based methods, (iv) high-dimensional data: FDA can handle high-dimensional data, which is a challenge for traditional vector-based methods, (v) modeling: FDA can be used to model complex relationships between variables that are not easily represented by simple linear or polynomial models, (vi) visualization: FDA also allows for better visualization of the data, which can aid in understanding and interpreting the results (Al-Janabi et al. 2020b; Betancourt-Odio et al. 2021). The principles and foundations of the FDA methods are found in Ramsay and Silverman (2002, 2005) besides the nonparametric methods of functional data are presented in a monograph study by Ferraty and Vieu (2006).

The use of FDA techniques in analyzing environmental data has received remarkable attention in the past two decades. Escabias et al. (2005) combined functional logistic regression and principal component analysis of environmental data modeling besides the proposed method were used to estimate drought risk in terms of temporal evolution in temperatures. Ground ozone represents one of the most dangerous environmental pollutants; complex chemical and physical processes generate it in the atmosphere and combustion processes in the troposphere. Several FDA approaches have been proposed to analyze the ozone concentration level, for instance, functional principal components analysis (FPCA) to extract manifest features for ground ozone concentration levels (Caligiuri et al. 2005), smooth-spline-based models to study time trends and oscillations in stratospheric ozone (Meiring 2007), mixed functional methods to model trends in the profiles of stratospheric ozone (Park et al. 2013), and the Kendall’s Tau functional statistic (KFT) to discover significant correlations between tropospheric ozone levels in urban and rural sites (Betancourt-Odio et al. 2021). In the same context, Gao (2007) and Gao and Niemeier (2008) used FDA techniques to model the dynamic pattern of nitrogen oxide and diurnal ozone cycles; they showed important results about the structure of spatiotemporal variations in diurnal cycles. In the atmosphere, the concentration of particulate matter (PM) is a highly time–space variable, which follows a periodic cycle dominated by meteorological situations as well as anthropogenic activities. The study by (Broomandi et al. 2021) examined the impact of fine PM2.5 on respiratory and heart diseases. They used a data-driven directed graph representation to infer the causal directionality and spatial embeddedness of PM2.5 concentrations in 14 UK cities over the course of one year. They found notable spatial embedding in the summer and spring and stability to disturbances through the network trophic coherence parameter, with winter being the most significant vulnerability. Many studies have employed FDA to analyze PM and its relationship to air quality. FDA can be used to model the temporal and spatial variation of PM levels, and to identify patterns and trends in the data. It can also be used to estimate the relationship between PM levels and other factors such as weather, traffic, and land use. Additionally, FDA can be used to make predictions about future PM levels and to assess the effectiveness of interventions aimed at reducing PM exposure. For instance, Shaadan et al. (2012) used a functional approach to assess the PM10 pollutant behaviour and compare data from two different years. In another related work, Hörmann et al. (2015) proposed a dynamic version of functional principal component analysis (dynamic FPCs), and the advantage of this approach has been illustrated by applying it to PM10 changes. In another related paper, Kosiorowski et al. (2017) adapted a hierarchical functional time series on a micro-model to forecast day and night PM10 air pollution. In another related study, King et al. (2018) applied modern FDA methods to study the spatial and temporal trends and variability of fine PM components across the USA. In recent years, research on the concentrations of multivariate air pollutants has been investigated by FDA techniques. In another related work, Ruggieri et al. (2013) focused on the principal component analysis of functional data (FPCA) to investigate the variability of multivariate air pollutants data, including (CO, NO2, PM10, and SO2). More recently, an analysis of variance based on functional data analysis (FANOVA) has been proposed by Acal et al. (2022). This method has been applied to four air pollutant concentrations, namely PM2.5, benzene, NO2, and PM10, to assess air pollution changes during the COVID-19 lockdown.

In the environmental pollution framework, an unusually high concentration of air pollutants, known formally as anomalies, may bring problems in the air quality index. Martínez et al. (2014), Sancho et al. (2014), and Torres et al. (2020) implemented a model relying on functional analysis to identify outliers samples, with the overall goal of achieving a better air quality monitoring solution. In another related paper, Shaadan et al. (2015) conducted a study to detect anomalies in daily PM10 functional data, investigate behaviour patterns, and identify potential factors determining PM10 abnormalities at three selected air quality monitoring stations. More applications that demonstrate the usefulness and advantages of FDA methods in environmental data analysis are found in Ocana-Peinado et al. (2008), Valderrama et al. (2010), Embling et al. (2012), Escabias et al. (2013), Ignaccolo et al. (2014), Xiao and Hu (2018), Ochoa et al. (2020), Reinholdt Jensen et al. (2022).

Machine learning is widely used to perform in-depth analysis in various fields such as biomedicine, energy, and economics (Saleh et al. 2023). To make our proposed method more comprehensive, we will compare it to recent algorithms in the context of machine learning and deep learning. For example, in biomedicine, Al-Janabi and Alkaim (2022) proposed a novel optimization method called Lion-AYAD to find optimal DNA protein generated through DNA synthesis. Their results showed the method to be robust with dynamic DNA sequence lengths, with increased accuracy and reduced execution times. In a similar context, Kadhuim and Al-Janabi (2023) presented a model that uses Deep Optimal Neurocomputing Technique (DLSTM-DSN-WOA) and Multivariate Analysis to predict Codon-mRNA. Their proposed model is a pragmatic intelligent data analysis model that reduces computation and handling time for large real data. In the field of renewable energy, Al-Janabi et al. (2020a) proposed deep learning techniques (DCapsNet and DCOM), and Mohammed and Al-Janabi (2022) proposed optimization techniques (FDIRE-GSK) for the generation of electrical energy from natural resources such as wind energy. Another approach, called DRFLLS, has been developed to estimate missing values in various datasets (Al-Janabi and Alkaim 2020). Additionally, the use of machine learning algorithms for high-dimensional functional data classification has become increasingly important in environmental air pollution research. The current study specifically focuses on using the FDA approach for the classification and visualization of high-frequency spatiotemporal air pollution data. Researchers have previously attempted to use FDA methods to cluster air pollution levels, which can help identify patterns and trends in the data and better understand the factors that contribute to air pollution. The use of the FDA, in combination with machine learning algorithms, can help to improve the accuracy and robustness of air pollution classification and visualization.

There have been several studies that have used functional data clustering approaches to analyze the network paths of air quality. To show an example, Ignaccolo et al. (2008) proposed an early study on analyzing the network paths of air quality using functional data clustering; they considered the air pollutant variable as a functional data object and classified them using the Partitioning Around Medoids (PAM) algorithm. Similarly, Ranalli et al. (2016) used FDA and PAM clustering approach to analyze high-frequency spatiotemporal data on the size distribution of particulate matter (PM). In another paper, Kosiorowski and Szlachtowska (2017) proposed a novel k–local functional median algorithm applied to the analysis of a real data set concerning air pollution monitoring. More recently, Bouveyron et al. (2022) developed a functional co-clustering approach based on the functional latent block model (funLBM) and illustrated by the analysis of multivariate air pollution data in the South of France. All these studies have made significant progress in the field of clustering functional air pollution data in terms of methodology and practical applications.

Research on clustering spatiotemporal air pollution using FDA is still an active area of interest, and new studies are needed to further advance the field. Therefore, this study has two main contributions: 1) from a methodological perspective, it presents a method based on the FDA approach for clustering and visualizing spatiotemporal functional data, and 2) from a practical aspect, it applies the proposed method to identify, classify and visualize multiple air pollutants, such as sulphur dioxide (SO2), carbon monoxide (CO), ozone (O3), and particulate matter (PM2.5) measured over multiple sites in Yemen during the period of January 1980 to April 2022. As far as the authors know, the air pollution problem in Yemen has not been investigated before, and this is the first study to analyze the multivariate air pollution concentrations using the FDA method. The study highlights several steps to achieve its goal: (1) transforming the discretization air pollution data into functional data to work with functional realm; (2) smoothing the functional air pollution data to improve the structured data from step 1; (3) visualizing the spatiotemporal features of functional air pollution data to discover the mechanism of variability; (4) clustering spatiotemporal functional air pollution data to group similar patterns found in step 3 for both spatial and temporal profiles.

Data and methods

Study area

Yemen, officially the “Republic of Yemen,” is a west Asian country located in the Middle East in the southern part of the Arabian Peninsula. It is bordered by the Kingdom of Saudi Arabia to the north, the Sultanate of Oman to the east, the Arabian Sea to the south, and the Red Sea to the west, and it shares maritime borders with Djibouti, Eritrea, and Somalia. It is the second-largest Arab sovereign state on the peninsula, occupying 555,000 km2 (214 thousand square miles). Yemen’s total coastline extends over a length of approximately 2000 square kilometers (1200 miles). Sana’a is the constitutionally stipulated capital and largest city of Yemen. As of 2021, Yemen has an estimated population of 30,491,000. Yemen lies within latitude and longitude 15° 0ʹ North and 48° 0ʹ East which includes an area mostly desert. It also consists of a narrow coastal plain surrounded by rugged mountains.

Yemen’s climate is a mixture of temperate, humid, and hot. The western part is exposed to the influence of the monsoon monsoons. Towards the inland eastern region of Yemen, the climate becomes unbearably hot. During the summer, the temperature can reach 54 °C, and the winters are much colder, with frost in some parts. The average annual temperature of the capital, Sanaa, is 18° C. Yemen is also exposed to natural hazards in the form of dust and sand storms. Yemen’s climate can be described as a dry subtropical, hot desert climate with low annual rainfall, very high summer temperatures, and a large difference between the maximum and minimum temperatures, especially in the interior regions.

Figure 1 depicts a geopolitical map of Yemen with detailed legends for its major cities, road networks, airports, railways/railroads, and waterways.

Fig. 1
figure 1

The map of Yemen with major geographical features (Worldmaps 2023)

Data and variables

In this study, the selection of the 37 major cities for analysis was based on the criteria of high population density and wide geographical coverage across the entire country. The population density of a city is an important factor in determining the level of air pollution, as a higher population density typically leads to higher levels of industrial and vehicular emissions. The geographical location of the cities was also considered to ensure that a diverse range of regions was represented in the analysis. Additionally, the availability of historical air pollution data was taken into account to ensure that an accurate and comprehensive analysis could be performed. The geographical location of the selected cities is illustrated in Fig. 2, which provides a visual representation of the distribution of the cities across the country. This information can be useful in understanding the regional variations in air pollution levels and trends. The name of selected stations and their geographical characteristics (latitude, longitude, and average elevation) are given in Supplementary Table 1.

Fig. 2
figure 2

The spatial distribution of the selected sites in Yemen

The concentrations of a pollutant are typically gauged by using the metric of micrograms per meter cube\((\mu g/{m}^{3})\). There are two different metrics to measure the particulate matter concentrations: (PM2.5) and (PM10) refer to the particles that are less than \(2.5\mu g/{m}^{3}\) and \(10\mu g/{m}^{3}\) in diameter, respectively. This study focuses on four main air pollutants: PM2.5, O3, SO2, and CO. The area-averaged monthly records for four primary air pollutant variables for several locations in Yemen During the period 1980–2022 were extracted as satellite data from (NASA 2023). The total sample size for CO, SO2, and PM2.5 pollutants measurements is equal to (n = 500) discrete records discrete records\({X}_{i}\left({t}_{j}\right) , { t}_{j}\in \left[\mathrm{1,500}\right] , i=1,\dots ,37\), the \(ith\) discrete observation \({X}_{i}\left({t}_{j}\right) , j=1,\dots ,500\) indicates pollutants values for the \(ith\) month. The total sample size for O3 pollutant measurements is equal to (n = 508) discrete records\({X}_{i}\left({t}_{j}\right) , { t}_{j}\in \left[\mathrm{1,508}\right] , i=1,\dots ,37\), the \(ith\) discrete observation \({X}_{i}\left({t}_{j}\right) , j=1,\dots ,508\) indicates pollutants values for the \(ith\) month.The discrete monthly pollutants dataset \({X}_{i}\left({t}_{j}\right)\) will be transformed into continuous functions by adapting a suitable basis functions system. More details about the variables and their features are summarized in Table 1.

Table 1 Details of variables and data

Model hypothesis and limitations

The aim of this paper is to analyze and categorize the dynamic changes in air pollution concentrations using functional analysis techniques and a functional mixtures clustering model. The method was applied to multivariate high-dimensional air pollution data collected from cities in Yemen from 1980 to 2022. Fourier transformation, B-spline functions, and generalized cross-validation were utilized to smooth and reconstruct the data. Two enhanced 3D visualization tools were used to examine the spatiotemporal variations in air pollutants and a functional mixture model was employed to classify the functional air pollutants data based on their spatiotemporal characteristics. The paper sets forth three hypotheses: (i) that the air pollution data is measured at a set of ordered times and the discrete observations \({X}_{i}\left({t}_{j}\right)\) are dense and regular over a specified time interval, (ii) that the air pollutants data follows a Gaussian mixture model-based \(\mathrm{FDM}({\Sigma }_{k},{\beta }_{k})\) model, and (iii) that the number of clusters (K) and the intrinsic dimension (d) must be predetermined. The first hypothesis is supported by the regular observation of air pollution data from 1980 to 2022. The second hypothesis is supported by the use of Gaussian mixture models in functional data analysis, and the option to use a robust mixture model in the future. The third hypothesis, which assumes a fixed number of clusters, is based on the belief that using selection methods to determine K would lead to unclear results. In this study, the number of clusters was set at 4, as it was determined to provide the best segmentation of the air pollution data.

Model estimation

The structure of the spatiotemporal functional data (STFD) model in a multivariate pollutant’s context is a statistical framework for analyzing functional data that varies over both space and time. The model typically consists of several components: (i) spatial component: this captures the spatial variation in the functional data, often represented as a spatial random effect, (ii) temporal component: this captures the temporal variation in the functional data, often represented as a temporal random effect, (iii) functional component: this captures the functional variation in the data, often represented as a functional principal component analysis (FPCA) model, (iv) covariate component: this captures the relationship between the functional data and any additional covariate information, often represented as a linear or nonlinear regression model, and (v) error component: this captures the residual variation in the data not explained by the other components (King et al. 2018; Wang et al. 2020; Hael et al. 2020). Overall, the STFD model is a flexible framework that can be used to analyze multivariate pollutant’s objects over space and time and can be extended to include other sources of variation or additional information as needed. Additional information has been included in the supplementary section regarding the theoretical framework, including elements like the structure of spatiotemporal functional data, the basis functions and smoothing techniques used, and the functional mixture model for analyzing STFD. This section will cover the concepts of model estimation, including the Expectation, Discrimination, and Maximization phases. Expectation (E) phase: In the E-step, the model uses the current estimate of the parameters to calculate the probability of each data point belonging to each cluster. Whereas Discrimination (D) phase: In the D-step, the model uses the probabilities calculated in the E-step to re-estimate the parameters of the clusters. While Maximization (M) step: In the M-step, the model uses the re-estimated parameters from the D-step to update the overall estimate of the parameters of the model (Preda 2007; Bouveyron et al. 2015). The Expectation-Discrimination-Maximization (EDM) procedure is a three-step process used to estimate the parameters of the mixture model. It involves alternating between the E-step, the D-step, and the M-step until a specified criterion is met. In this study, the criterion was set as 100 iterations (q = 100). The EDM procedure continues to iterate until this iteration number is reached.

The expectation phase

The expectation is the first step which computes the posterior probabilities \({t}_{ik}^{(q)}\) under the condition the current value of the parameter \({\theta }^{\left(q\right)}\), at iteration q. The probability \({P(z}_{ik}=1)\) points out that the curve brings from the \(kth\) component and \({P(z}_{ik}=0)\) otherwise. In the functional discriminative model, the posterior probabilities \({t}_{ik}^{(q)}, \mathrm{i}=\mathrm{1,2},\dots ,\mathrm{n};\mathrm{ k}=\mathrm{1,2},\dots ,K\) that each curve suits the \(kth\) component can be given as (Bouveyron et al. 2015):

$${t}_{ik}^{(q)}=E\left[{z}_{ik}\left|{y}_{i},{\theta }^{\left(q\right)}\right.\right]=\frac{{\pi }_{k}^{(q)} \phi ({y}_{i},{\theta }_{k}^{(q)})}{\sum_{l=1}^{K}{\pi }_{l}^{(q)}\phi ({y}_{i}\mid {\theta }_{l}^{(q)})}$$
(1)

where \({\theta }_{k}^{(q)}=({\pi }_{k}^{(q)},{\mu }_{k}^{(q)},{\Sigma }_{k}^{(q)},{\beta }^{(q)})\) are the combination of parameters for the \(kth\) mixture component and \(\phi (.)\) is the Gaussian density. The model parameters will be updated in the Maximization(M) step (mentioned below) and estimated at an optimal point in the last iteration \((q)\).

The discrimination phase

The Discrimination (D) step is aimed to determine the orientation matrix \({U}^{q}\) of the discriminative latent space F conditionally on the posterior probabilities \({t}_{ik}^{(q)}\) through maximizing the standard Fisher’s (F) criterion (Preda 2007):

$${\widehat{U}}^{(q)}=\underset{U}{Max }\mathrm{trace}\left({\left({U}^{t}SU\right)}^{-1} {U}^{t}{S}_{B}^{\left(q\right)}U\right)$$
(2)

The S refers to the whole sample covariance matrix and \({S}_{B}^{(q)}\) refers to the soft between-cluster covariance matrix, which is defined as: \({S}_{B}^{(q)}=\frac{1}{n}\sum_{k=1}^{K}{n}_{k}^{(q)}({m}_{k}^{(q)}-\overline{y}){({m}_{k}^{(q)}-\overline{y})}^{t}\). In the functional unsupervised classification framework with an unobserved variable \((z)\), the Fisher criterion optimizes the discriminative function \(U\in {L}_{2}[0,T]\) by (Bouveyron et al. 2015):

$$\underset{U}{Max} \frac{Var\left({\mathbb{E}}\left\{{\int }_{0}^{T}X(t)U(t)dt\mid z\right\}\right)}{Var\left({\int }_{0}^{T}X(t)U(t)dt\right)}$$
(3)

The optimization procedure of (4) is the eigenfunction U associated with the highest eigenvalue λ of the following generalized eigenproblem:

$${\int }_{0}^{T}B\left(t,s\right) U\left(s\right) ds=\uplambda {\int }_{0}^{T}C\left(t,s\right) U\left(s\right) ds$$
(4)

The estimation of the covariance operator \(C\left(t,s\right)\) based on the basis function \({({\psi }_{j})}_{j}=1,\dots ,p\), is given as:

$$\widehat{C}\left(t,s\right)=\frac{1}{n}\sum_{i=1}^{n}\left(\sum_{j=1}^{p}{y}_{ij}{\psi }_{j}\left(t\right)\right)\left(\sum_{j=1}^{p}{y}_{ij}{\psi }_{j}\left(t\right)\right)$$
(5)

The \(B(t,s)\) here indicates the integral between cluster covariance operators and conditionally on the posterior probabilities \({t}_{ik}^{(q)}\) obtained from the Expectation (E)-step, the estimator of \(B(t,s)\) at iteration \((q)\) is defined as:

$${\widehat{B}}^{\left(q\right)}\left(t,s\right)=\frac{1}{n}\sum_{k=1}^{k}\frac{1}{{n}_{k}^{\left(q\right)}}\left(\sum_{i=1}^{n}{t}_{ik}^{\left(q\right)}v\left(t\right)\right)\left(\sum_{i=1}^{n}{t}_{ik}^{\left(q\right)}v\left(s\right)\right)$$
(6)

The maximization phase

The Maximization step (M) is aimed to estimate the parameters of the functional latent mixture model. In this step, maximizing the conditional expectation of the complete data log-likelihood conditionally to the orientation matrix \({U}^{q}\) is computed in the following form (Bouveyron et al. 2015):

$$Q\left(\theta \right)=\frac{-1}{2}\sum_{k=1}^{K}{n}_{k}^{\left(q\right)} \left[ -2\mathrm{log}\left({\pi }_{k}\right)+A+\mathrm{log}\left(\left|{\Sigma }_{k}\right|\right)\right. \left.+B+D+h\right]$$
(7)

The explanation of these notations is as follows: \(\theta\) indicates the parameters of the mixture model \(\theta =({\pi }_{k},{\mu }_{k},{\Sigma }_{k},{\beta }_{k})\), \(A=\mathrm{trace}\left({\Sigma }_{k}^{-1}{U}^{\left(q\right)t}{C}_{k}^{\left(q\right)}{U}^{\left(q\right)}\right)\), \(B=\left(p-d\right)\mathrm{log}\left({\beta }_{k}\right)\), D = \(\frac{\mathrm{trace}\left({C}_{k}^{\left(q\right)}\right)-\sum_{j=1}^{d} {u}_{j}^{\left(q\right)t}{C}_{k}^{\left(q\right)}{u}_{j}^{\left(q\right)}}{{\beta }_{k}}\), \({C}_{k}^{(q)}=\frac{1}{{n}_{k}^{(q)}} \sum_{i=1}^{n} {t}_{ik}^{(q)}\left({y}_{i}-\left.{\mu }_{k}^{(q)}\right){\left({y}_{i}-{\mu }_{k}^{(q)}\right)}^{t}\right.\) presents the empirical covariance matrix of the \(kth\) cluster,\({u}_{j}^{\left(q\right)}\) is the \(jth\) column vector of \({U}^{(q)}\), and \(h=p \mathit{log}\left(2\pi \right)\) is a constant term. At iteration q, the maximization of \(Q(\theta )\) is conditional on \({U}^{(q)}\) conduces to the estimation of mixture parameters of the \(\mathrm{FDM}({\Sigma }_{k},{\beta }_{k})\) model according to the following update formulas (Bouveyron et al. 2015):

$$\Rightarrow\mathrm{The}\;\mathrm{proportion}\;\mathrm{parameter},\pi_k^{(q)}=\frac{n_k^{(q)}}n$$
(8)
$$\Rightarrow\mathrm{The}\;\mathrm{mean}\;\mathrm{parameter},\mu_k^{(q)}=\frac1{n_k^{(q)}}\sum_{i=1}^nt_{ik}^{(q)}U^{(q)t}y_i$$
(9)
$$\Rightarrow\mathrm{The}\;\mathrm{covariance}\;\mathrm{parameter},\Sigma_k^{\left(q\right)}=U^{(q)t}C_kU^{(q)}$$
(10)
$$\Rightarrow\mathrm{The}\;\mathrm{noise}\;\mathrm{parameter},\beta_k^{\left(q\right)}=\frac{\mathrm{trace}\left(C_k\right)-\sum_{j=1}^du_j^{(q)t}C_ku_j^{(q)}}{p-d}$$
(11)

The flowchart illustrating the methods proposed in the current study can be found in Fig. 3. Additionally, this flowchart serves as a framework for understanding the terminology used in the study’s methodology and statistics. Besides, the proposed method for the visualization and clustering of the STFD uses a multivariate framework implemented in R programming language with the help of several package environments. Spatial–Temporal Functional Air Pollution Data Analyzer (STFAPDA), which is a useful tool for analysing and understanding the dynamics of air pollution, is the name of the proposed algorithm. The algorithm consists of several steps, which are described in detail as follows:

Fig. 3
figure 3

The flowchart of the proposed methods in the current study

figure a

Results and discussion

Transforming and smoothing data

The discretization datasets of four primary air pollutants over 37 major cities in Yemen during the period from 1980 to 2022 are given in Fig. 4. Discretization is the process of dividing continuous data into discrete intervals or bins. In the context of air pollution data, discretization is often used to convert continuous measurements of pollutants (e.g., in micrograms per cubic meter) into categorical levels (e.g., low, medium, high). The discretization points chosen are determined by the nature of the data and the research question. As a result, it is critical to remember that the discretization points used can have a significant impact on the analysis’s outcome, and it may be necessary to experiment with different options in order to find the best representation of the data. In Fig. 5, discrete air pollution data (CO, SO2, PM2.5, and O3) with the spatial elements are presented. The initial step is converting the discretely observed air pollutants curves (CO, SO2, PM2.5, and O3) into continuous functional objects for reconstructing the data framework. In particular, Fourier basis functions are applied to the O3 and PM2.5 air pollutants curves, which exhibit seasonal variability throughout the entire data domain. It is observed that some functional O3 and PM2.5 data objects have a high level of fluctuations, which is unusual.

Fig. 4
figure 4

Discretization points for a SO2, b PM2.5, c O3, and d CO datasets over 37 cities in Yemen

Fig. 5
figure 5

Transformed and smoothed functional data of multivariate air pollutants over multiple cities a SO2, b PM2.5, c O3, and d CO (legend below)

Spatial–temporal functional data analysis is a technique that can be used to transform and smooth air pollution data. It involves analyzing data over both space and time and modeling the data as a collection of functions rather than a set of discrete points. This allows for a more accurate representation of the data, as well as the ability to smooth and interpolate missing values. This technique utilizes various methods such as functional principal component analysis and functional regression to analyze the data and gain insights from it. Additionally, it allows modeling the temporal and spatial correlations in the data, which can help to understand the underlying patterns and trends in the air pollution data. Transforming and smoothing air pollution data involves several steps: data preprocessing, data transformation, data smoothing, data visualization, and data modeling. Data preprocessing involves cleaning and preparing the data for analysis. Data transformation involves converting the data into a more suitable format for analysis. Data smoothing involves removing noise or random variations from the data to make patterns and trends more visible. Data visualization involves creating visual representations of the data to gain insights and identify patterns and trends. Data modeling involves developing statistical or machine learning models to better understand the data and make predictions about future air pollution patterns. These steps are not always applied in a linear fashion and the analyst may have to iterate over the process to reach a final solution. Figure 5 depicts the charts of transforming and smoothing multivariate functional air pollutants data for multiple locations. The data of transforming and smoothing from air pollutants (CO, SO2, PM2.5, and O3) will be discussed in the following subsections.

Analysis of SO2 concentrations

The process of converting discrete data to a continuous functional form is known as functional data analysis. B-spline functions are commonly used in this process because they are flexible and can be used to approximate a wide range of shapes. B-splines are piecewise polynomial functions that are defined over a set of control points, called knots. When applying B-spline functions to discrete SO2 pollution data, the goal is to find a smooth curve that accurately represents the underlying trend in the data. However, in some cases, the resulting curve may have a rough level that can be improved using a smoothing method. The GCV standard is a commonly used method for smoothing functional data. It is a computational method that allows for the estimation of the optimal smoothing parameter, which controls the degree of roughness in the final curve. The present smoothing aims to improve the structure of the functional SO2 data and facilitate its subsequent interpretation and analysis. This can be useful for identifying patterns and trends in the data that may be difficult to discern when working with discrete data. Additionally, the continuous functional form of the data can be used in further statistical analysis, such as regression or model fitting. Figure 5a describes the use of B-spline bases function with cubic degree and a smoothing parameter of 0.06 to transform and smooth SO2 air pollutant data. This process makes the data more informative and allows for an easy visual representation of the shape of the SO2 pollutants over the entire domain. From Fig. 5a, it is clear that the SO2 shape has two significant periods of long-term changes, with higher variations in the middle and end of the domain. The highest peak of SO2 is located around the year 2021 with an average pollutant level of 0.0000007 kg/m3. Table 2 gives more details on the type and number of basis functions, smooth method, and parameters for all functional variables. Previous research has shown that the functional form has not been used to study SO2 in the past, but there have been studies using the classical form. The study by Al-Janabi et al. (2020b) used a different analysis method and included other pollutants, whereas the current study focuses specifically on SO2 and processes the data in a discontinuous/discretization form without any data transformation or smoothing, which sets it apart from the aforementioned study.

Table 2 Details on type and number of basis functions, smooth method, and parameters for all functional variables

Analysis of PM2.5 concentrations

Fourier basis functions are used to adapt to the seasonal variability of PM2.5 pollution in the entire data domain. The structure of functional PM2.5 air pollutant data is also improved using a penalized roughness method based on generalized cross-validation criteria for easier interpretation. Thirty Fourier bases functions and a smoothing parameter of 0.04 are used for converting and smoothing PM2.5 air pollutant data. The smoothing parameters determine the strength of the smoothing applied to the data. A smaller smoothing parameter will result in a smoother function with less noise but may also smooth out important features or outliers in the data. A larger smoothing parameter will result in a less smooth function with more noise but will also be more responsive to changes in the data, including outliers. The use of Fourier basis functions in this context is to transform the data into a different domain where it can be more easily smoothed. Smoothing functions can be configured to handle outliers in various ways such as excluding or down-weighting them or treating them as valid data points. Additionally, outliers can be handled before the smoothing step through preprocessing techniques such as removing or transforming the data. Table 2 and Fig. 5b provide additional information and a graphical representation of PM2.5 pollutant data. The proposed procedures make the data more informative and easier to understand. Figure 4b shows that the PM2.5 shape exhibits significant variations at the end of the time domain, which indicates a change in the levels of the pollutant compared to the rest of the time domain. This could potentially be due to changes in industrial activity, transportation patterns, or other factors that contribute to PM2.5 emissions. It is important to further investigate the cause of these variations to understand the impact on air quality and take appropriate measures to reduce emissions. Additionally, Fig. 5b also highlights that some functional PM2.5 data objects have a high level of fluctuations. These fluctuations could be caused by various factors such as weather conditions, changes in population density, and industrial activity. These fluctuations can have a significant impact on air quality and public health, so it is important to analyze them more deeply in later sections of the research. This could include identifying specific sources of emissions, evaluating the effectiveness of current air quality regulations, and developing strategies to reduce emissions and improve air quality. The highest peak of PM2.5 is located at the time domain labeled around 450 to 470, corresponding to the years 2017–2018 with an approximate pollutant average of 0.000006 kg/m3. Previous studies have examined the PM2.5 pollutant variable using a functional form, including King et al. (2018) and more recently Acal et al. (2022). King et al. (2018) used both B-spline and restricted maximum likelihood (REML) techniques to transform and smooth the PM2.5 data, while Acal et al. (2022) only used B-spline smoothing. In contrast, recent studies have also analyzed PM2.5 pollutant data without any transformation or smoothing, such as Al-Janabi et al. (2021) and Wang et al. (2022).

Analysis of O3 concentrations

There are several methods for transforming and smoothing O3 concentrations. One common method for transforming O3 concentrations is to convert them from a raw measurement to a pollutant index, such as the Air Quality Index (AQI), which provides a more easily understandable and comparable value. Another method is to apply a smoothing technique, such as moving average or lowest smoothing, to reduce the impact of random measurement errors and to better identify trends and patterns in the data. It’s important to note that before applying any transformation or smoothing technique, it is necessary to ensure that the data is of good quality and that any outliers or missing values have been properly handled.

The use of Fourier basis functions for adapting to O3 pollutant data with seasonal variability is a common approach in functional data analysis. The Fourier basis functions are able to capture the periodic nature of the data and can be used to model and smooth the data (Hael 2021). The use of the penalized roughness method with the GCV criteria is a method for selecting the optimal smoothing parameter (Guo et al. 2022). The GCV criteria is a way of evaluating the performance of the model by comparing the observed data to the predicted data. The penalized roughness method is a way of controlling the smoothness of the model by adding a penalty term to the objective function. By using the GCV criteria, the optimal smoothing parameter is chosen to minimize the difference between the observed and predicted data while also controlling the smoothness of the model. In this specific case, thirty-five Fourier basis functions and a smoothing parameter of 0.03 were used for converting and smoothing the O3 pollutant data. This approach aims at reducing the high fluctuations level present in some functional O3 data objects. The overall goal is to improve the structure of functional O3 air pollutant data and make it with enhanced sights. Table 2 and Fig. 5c provide additional information and a graphical representation of the O3 pollutant data. The transformation and smoothness procedures make the data more informative and easier to understand. Figure 5c shows that the functional shape of the O3 pollutant has several peaks, indicating significant dynamic changes over time. The highest peaks occur in the time domain labeled around 410 to 430, corresponding to the years 2014–2015, with an average pollutant level of 265 Dobson. The functional analysis framework used in this study is similar to that of a study by (Bouveyron et al. 2022) in that it uses the Fourier basis to reconstruct the functions because the O3 pollutant exhibits clear periodicity. Additionally, several recent studies have examined O3 pollutant data without any alteration or smoothing, such as those by Yang et al. (2020), Liu et al. (2022), and Shams et al. (2022). These studies differ from the current one in terms of the way they analyze and process O3 pollutant data.

Analysis of CO concentrations

In this study, B-spline functions are being applied to discrete CO pollution data to convert it to a continuous functional form. The rough level of some curves can be improved using a smoothing method, and the GCV standard is used to handle the transformed functional data. This smoothing aims to improve the structure and interpretability of the functional CO data. Twenty-five B-spline bases functions with a cubic (fourth-order) degree and a smoothing parameter of 0.05 are being used for this transformation and smoothing process. The use of the GCV criteria to penalize the functional CO data suggests that the data with high noise levels are being treated as outliers. Outliers in this context are likely to be data points that are significantly different from the rest of the data and do not fit the underlying structure of the functional data. By penalizing these data points, the noise is removed, and the structure of the functional CO data is improved. However, it is important to note that it’s not always correct to consider that these data are outliers, they might be real events, so it is necessary to examine the data carefully and consider other factors before removing any data. One possible way to avoid removing real events and to smooth the data is to use moving averages. Table 2 and Fig. 5d provide detailed information about the process of CO data and its graphical representation. The proposed procedures make the CO pollution data more informative and easier to understand. The table displays the numerical values of the CO pollution data, while Fig. 5d provides a visual representation of the data. From Fig. 5d, it is clear that the shape of the CO pollutants has high significant variations over a long period of time. The CO concentrations started in the time domain labeled around 100, corresponding to the year 1990. The highest peak is located in the time domain labeled around 195, corresponding to the year 1996 with an approximate pollutant average of 0.000004 kg/m2. Additionally, a more accurate analysis for CO curve clustering will be explained in “Clustering of functional CO data” Section of the paper. This section will provide a deeper understanding of the data and possible patterns in pollution levels over time. This will assist in identifying areas that need to be addressed for pollution control and management. CO had not been studied before using the functional form, according to earlier research. However, this pollutant has received some traditional research. Al-Janabi et al. (2021) and Li et al. (2022d) currently studied the CO pollutant using different analysis methods in a recent study. As a result, the current study differs from their study in that the data was processed in a discontinuous/discretization form without any transformation or smoothing.

Visualizing the spatiotemporal variability

Visualizing the spatiotemporal variability of data for major pollutants is important for several reasons: (i) understanding patterns and trends: visualizing the data can help identify patterns and trends in pollutant levels over time and in different locations. this information can inform decision-making and management strategies for reducing pollution, (ii) identifying high-risk areas: by visualizing the data, it is possible to identify areas or times where pollutant levels are particularly high or variable. this can help target interventions and resources to areas that need them the most, (iii) communication and education: visualizing the data can be an effective way to communicate and educate the public about air pollution and its impacts. by providing clear and easy-to-understand information, it can help to raise awareness and engage the community in efforts to reduce pollution, (iv) monitoring progress: visualizing the data over time can help to track the progress of pollution reduction efforts and evaluate the effectiveness of different management strategies, (v) decision-making and policy formulation: the spatiotemporal variability data when visualized, it can be used to evaluate the distribution, intensity, and frequency of pollutants, which can be used as an input for decision-making and policy formulation (Li et al. 2022a; Nikolaou et al. 2023). Overall, visualizing the spatiotemporal variability of data for major pollutants can provide valuable insights that can help to improve air quality and protect public health.

The use of static 3-D perspectives charting and dynamic web-based interactive surface mapping as visualization tools for high-dimensional, voluminous, and spatial–temporal pollutants data is a promising approach. Static 3-D perspectives charting allows for the visualization of the data in a three-dimensional space, providing a clear and intuitive representation of the distribution of pollutants over time and in different locations. This can be particularly useful for identifying patterns and trends in the data, and for identifying areas or times where pollutant levels are particularly high or variable. Dynamic web-based interactive surface mapping, on the other hand, allows for the creation of interactive maps that can be viewed on the web. This can be useful for visualizing the data in a geographic context, and for exploring the data in more detail. The interactivity allows users to navigate and zoom in and out of the data, and to view the data at different levels of detail. Both techniques can be used to analyze the spatial–temporal data for four major pollutants variables (SO2, PM2,5, O3, and CO) and provide a comprehensive understanding of the patterns and trends in the data, as shown in Fig. 6. The use of these visualization tools can be very effective for data analysis, decision-making, and policy formulation, and for communicating and educating the public about air pollution and its impacts. The next sub-sections analyze the spatial–temporal data for four major pollution variables (SO2, PM2,5, O3, and CO).

Fig. 6
figure 6

3D-visualization of Spatiotemporal functional data for multivariate air pollutants using dynamic web-based interactive method (right penal), and static perspective surface (left panel) a SO2, b PM2.5, c O3, and d CO

Visualizing of SO2 data

The visualization in Fig. 6a shows two different tools for displaying spatiotemporal variation in SO2 pollutant data. The first tool is a static 3D chart with a gradient heatmap (left penal), while the second tool is a dynamic, web-based interactive surface map (right penal). The proposed two tools for visualizing SO2 pollutants can aid users in understanding the spatiotemporal dynamics of low and high air pollution over a long-term period. The dynamic web-based interactive surface mapping tool, in particular, has interactive features that allow users to click buttons to view the concentration of SO2 pollution. The heatmap metrics in this tool provide detailed information on the structure of heterogeneity and fluctuations in SO2 levels. Specifically, a red color on the gradient heatmap indicates intense high variations across spatial and temporal SO2 cycles, while a blue color on the gradient heatmap indicates low variations across these cycles. It is clear that the highest variability of SO2 occurs from 2007 to 2021, with insignificant variations observed in the rest of the time interval, particularly at the beginning of the interval. Using the same technique, (Ranaarif and Yuwono 2021) recently observed variations in the concentration and spatial distribution patterns of SO2 pollutants in Bali Island from 2011 to 2020. The main difference between current SO2 visualization tools and the study by (Ranaarif and Yuwono 2021) is that the latter used traditional visualization approaches to analyze changes in the concentration and spatial distribution patterns of SO2 pollutants, while current tools may use more advanced techniques such as 3D modeling or interactive maps.

Visualizing of PM2.5 data

The visualization of variance–covariance 3D surfaces for spatiotemporal functional PM2.5 pollutant data is an important tool for understanding and identifying patterns in air quality. Two different methods have been implemented for the visualization of spatiotemporal PM2.5 variation. The first is a static 3-D perspective chart with a gradient heatmap, as shown in the left panel of Fig. 6b. The second is a dynamic web-based interactive surface mapping that provides enhanced animation, as shown in the right panel of Fig. 6b. The implementation of these visualization tools is crucial for assisting users in developing an awareness of the air quality in a specific area. It allows users to identify the spatiotemporal dynamics of low and high air pollution over a long-term period, as well as provide detailed information about the structure of heterogeneity and fluctuations in PM2.5 concentrations. For example, the red color on the gradient heatmap indicates an intense high variation across spatial and temporal PM2.5 cycles, while the blue color on the gradient heatmap indicates a low variation across spatial and temporal PM2.5 cycles. The PM2.5 visualization results have also revealed different fluctuations throughout the whole domain, with the highest fluctuation period of the PM2.5 pollutants observed between 2016 and 2020. Additionally, the dynamic web-based interactive surface mapping offers interactive features such as buttons that allow users to view the PM2.5 pollution concentration in an easy-to-understand format, making it a powerful tool for understanding air quality patterns. In addition to the visualization methods described earlier, it is important to note that some existing visualization approaches are limited in their ability to process large PM2.5 data sets and support dynamic visualizations. To compare the current PM2.5 visualization tools with recent previous studies, the current study can be compared to the approach used by Li et al. (2016) who employed a visualization approach to analyze the air quality index based on PM2.5 pollutant in Beijing, China. Similar studies include those by Tang et al. (2021) and Medhi and Gogoi (2021) used visualization tools to analyze the impact of COVID-19 on PM2.5 in China and India, respectively. All of these previous studies have used traditional visualization and interpolation methods which differ from the current study which used spatial–temporal dynamic visualization for PM2.5 functions. This current study presents a novel approach that addresses the limitations of traditional visualization methods and provides a more comprehensive understanding of PM2.5 data.

Visualizing of O3 data

The visualization of variance–covariance 3D surfaces for spatiotemporal functional O3 pollutant data is shown in Fig. 6c. Two different tools have been implemented for the visualization of spatiotemporal O3 variation. The first one is a static 3-D perspective charting with a gradient heatmap drawn in Fig. 6c (left panel). The second one is a dynamic web-based interactive surface mapping which provides enhanced animation drawn in Fig. 6c (right panel). The visualization of O3 pollutants could help the users to identify the spatiotemporal dynamic of low and high air pollution over a long-term period. Moreover, in the dynamic web-based interactive surface mapping, interactive features are designed where users are able to click buttons to view the O3 pollution concentration. It is easy to understand the O3 variability by heatmap metrics which provide detailed information regarding the structure of heterogeneity and fluctuations. Specifically, the red color on the gradient heatmap indicates an intensely high variation across spatial and temporal O3 cycles, and the blue color on the gradient heatmap indicates a low variation across spatial and temporal O3 cycles. The O3 visualization results discovered different fluctuations throughout the whole domain, with the highest fluctuation period of the O3 pollutant being during the period from 2013 to 2017. Additionally, recent studies (Nurgazy et al. 2019; Gagliardi and Andenna 2020; Ahmad et al. 2022) have used machine learning approaches for the visualization of O3 pollutants, such as the presentation of CAVisAP, a context-aware system for outdoor air pollution visualization by using internet of thing (IoT) platforms, and the exploration of surface ozone behavior by using machine learning approaches. However, different from these studies, the current analysis considers the dynamic visualization of O3 pollutants over spatial–temporal.

Visualizing of CO data

The visualization of variance–covariance 3D surfaces for spatiotemporal functional CO pollutant data is shown in Fig. 6d. Two different tools have been implemented for the visualization of spatiotemporal CO variation. The first one is a static 3-D perspective charting with a gradient heatmap, as depicted in the left panel of Fig. 6d. The second tool is a dynamic web-based interactive surface mapping, which provides enhanced animation and is shown in the right panel of Fig. 6d. It is observed that the significant central part of the high CO variability occurs in the periods from January 1990 to 2000 and it is insignificant elsewhere. In recent literature, the CO pollutant concentrations have been analyzed with other pollutants using traditional and machine learning methods. For example, Grace et al. (2020) used a traditional visualization approach to analyze CO pollutant concentrations with other pollutants. Jain and Kaur (2021) proposed machine learning and visualization techniques for the analysis of air pollution concentrations during the COVID-19 pandemic. The clustering method used in the study aims to identify and differentiate different layers within the temporal and spatial cycle of air pollution. The results of this analysis, including any functional clustering of temporal and spatial pollution variables, will be discussed in a subsequent section of the study.

Clustering the spatiotemporal functional data

Clustering is a technique used in machine learning to group similar data points together. In the context of spatiotemporal functional data of air pollutants, clustering can be used to group locations with similar air pollution patterns over time. This can be useful for identifying hotspots of pollution and for understanding the factors that contribute to air pollution in different areas. There are several different clustering algorithms that can be used, such as k-means, hierarchical clustering, and density-based clustering (Schmutz et al. 2020). The choice of algorithm will depend on the specific characteristics of the data and the research question being addressed. In order to cluster spatiotemporal functional data, it is necessary to first extract relevant features from the data that capture the patterns of interest (Hael et al. 2021). This could include measures of the overall level of pollution, the variability of pollution over time, or the similarity of pollution patterns between different locations. These features can then be used as inputs to the clustering algorithm (Shi et al. 2022). Once the clusters have been identified, various visualization and analysis techniques can be used to explore the results. This can include mapping the clusters to examine their spatial distribution, plotting the time series data for each cluster to examine their temporal patterns, and comparing the pollution levels and patterns across different clusters to identify key differences. The study presented an advanced clustering approach based on functional mixture models to effectively deal with high-dimensional, large-scale, and spatial–temporal air pollutants data. The results of this functional clustering approach provided meaningful insights into the spatial–temporal dynamics of pollutants and informative visualizations for both temporal cluster profiles and spatial cluster mapping. In this section, clustering the spatiotemporal functional data of air pollutants (SO2, PM2.5, O3, and CO) will be discussed.

Clustering of functional SO2 data

Sulfur dioxide (SO2) is a gaseous air pollutant composed of sulfur and oxygen. It can have a significant impact on human health, animal health, and the environment. In this study, the functional clustering of smooth functional SO2 pollutant levels was conducted over multiple cities in Yemen and the results were used to identify groups of cities with similar levels of pollution. The spatial distribution of these clusters is presented in Fig. 7. Cluster group 3, indicated by the color green, is identified as having the least polluted cities in terms of SO2. These cities are primarily located in southern Yemen and represent approximately half of the total cities studied, including the Al-Mahrah Governorate, Hadramout Governorate, Abyan Governorate, and Shabwah Governorate. The largest source of sulfur dioxide emissions is the burning of high-sulfur fossil fuels by heavy equipment power plants and other industrial facilities. Other sources of sulfur dioxide emissions include natural sources such as volcanoes and industrial processes such as extracting minerals from ore, as well as ships, locomotives, and other vehicles. Table 3 provides more detailed information about the memberships, spatiotemporal features, and degree of pollution for each cluster group. It is worth noting that short-term exposure to sulfur dioxide can damage the respiratory system, especially lung function, and irritate the eyes. It causes coughing and mucus secretion and exacerbates chronic bronchitis and asthma conditions. Generally, sulfur dioxide emissions led to high sulfur dioxide concentrations in the air, which can form other sulfur oxides (SOx) that can be harmful as well. Based on a thorough review of recent literature, it appears that the clustering of spatial–temporal dynamics of SO2 data using a functional data framework has not been studied or investigated before. However, a few studies have analyzed SO2 pollutant data using vectorized-based methods. For example, the study conducted by Al-Janabi et al. (2021) employed the intelligent prediction method called IFCsAP to handle SO2 pollutant data along with other pollutant variables. Another study by Kujawska et al. (2022) used an artificial neural network model to forecast sulfur dioxide levels in the air. Both of these studies were focused on forecasting SO2 pollutant levels using artificial data analytics, while the proposed functional model in this current study aims to cluster the hidden features of spatial–temporal SO2 dynamics.

Fig. 7
figure 7

a Functional clustering results of the smooth SO2 data, b its group means, and c the spatial distribution map of the obtained clusters

Table 3 Summary of the cluster characteristics for spatial–temporal SO2 patterns in cities of Yemen

Clustering of functional PM2.5 data

In this study, we aim to investigate the dynamic behavior of PM2.5 air pollutants and identify potential spatiotemporal functional clusters over various locations in Yemen. Our proposed approach will provide a meaningful result and offer a graphical interpretation of the spatiotemporal variations of PM2.5. As shown in Fig. 8, our findings indicate that the highest PM2.5 variability is concentrated in the cities located on the western side of the Red Sea, as represented by cluster 2 (red color) which is characterized by two unique peaks: a low-volume peak and large-volume peak. Clusters 4 and 1 also show considerable fluctuations throughout the entire domain. It is worth noting that the PM2.5 air pollution concentration levels significantly decreased during the COVID-19 outbreak period. Table 4 provides detailed information on the cluster memberships, spatiotemporal features, and degree of PM2.5 pollution.

Fig. 8
figure 8

a Functional clustering results of the smooth PM2.5 data, b its group means, and c the spatial distribution map of the obtained clusters

Table 4 Summary of the cluster characteristics for spatial–temporal PM2.5 patterns in cities of Yemen

It is concluded that the levels of PM2.5 in Yemen follow a periodic cycle that is controlled by meteorological factors such as temperature and solar radiation. This conclusion aligns with the findings of Abdul-Rahim et al. (2022) who found that statistical analysis revealed a positive correlation between PM2.5 concentrations and temperature for both fall and summer samples. However, the analysis also revealed a positive correlation between PM2.5 concentrations and relative humidity for fall samples and a negative correlation for summer samples. While their study only focused on a small area in Sanna city, the results may be applicable to many other cities in Yemen. Additionally, previous studies have shown that the levels of PM2.5 in Yemen are influenced by the rhythm of human activities that modulate anthropogenic emission rates. The change in PM2.5 air pollution is tied to population-weighted exposure levels (PWEL). As per the research (Li et al. 2022b), areas with high PWEL and rapid increases in PM2.5 concentrations were primarily found in developing countries such as India, Bangladesh, Nepal, and Pakistan, as well as in the developed country of Saudi Arabia, and the least developed countries of Yemen and Myanmar. Moreover, the study by Fang et al. (2020) found that the regions with the highest levels of pollution are primarily located in China, Southeast Asia, South Asia, West Asia, and North Africa, particularly in the Arabian Gulf region. The study also identified energy intensity and per capita electricity consumption as the primary drivers of PM2.5 concentrations, whereas an expanding forest area was found to significantly decrease PM2.5 concentrations. In recent years, there has been a growing body of research that has focused on studying the PM2.5 pollutant using functional data analysis. For example, a study by Wang et al. (2019) adapted the framework of functional data analysis to compare the fluctuation patterns of PM2.5 concentration between provinces in China from 1998 to 2016, both spatially and temporally. Another study by Liang et al. (2021) used a spatial-functional mixture method to model and cluster PM2.5 concentrations across China. The current study is similar to these two studies in that it also uses the same functional framework. However, there are also several recent studies that have used a different approach to cluster PM2.5 pollutant data, such as the studies by Jorquera and Villalobos (2020), Liu et al. (2020), Su et al. (2020), and Park et al. (2022). These studies differ from the current study in that they use classical analysis frameworks to process and cluster PM2.5 pollutant data.

Clustering of functional O3 data

The ozone (O3) is formed when sunlight and heat cause chemical reactions between volatile organic compounds (VOC) and nitrogen oxides (NOX), also known as hydrocarbons. These reactions can occur both near the ground, in the troposphere, and high in the stratosphere. In the stratosphere, O3 forms a protective layer that shields the Earth from harmful ultraviolet radiation from the sun, but at ground level, O3 is a harmful air pollutant (Wang et al. 2020). In Fig. 9, the functional clustering of smooth functional ground-level O3 levels in multiple cities in Yemen is presented, along with the cluster mean and the spatial distribution of the obtained clusters. The data indicates that the highest ozone concentrations are found in coastal cities and islands located on the western side of the Red Sea, the southern side of the Arabian Sea, and the Gulf of Aden. This may be due to the higher levels of pollutants, such as volatile organic compounds and nitrogen oxides, present in these areas, which contribute to ozone formation. Additionally, the unique meteorological conditions in these regions may also make them more susceptible to ozone formation. Table 5 provides a summary of the obtained cluster profiles for ozone air population levels and the degree of air pollution in cities of Yemen.

Fig. 9
figure 9

a Functional clustering results of the smooth O3 data, b its group means, and c the spatial distribution map of the obtained clusters

Table 5 Summary of the cluster characteristics for spatial–temporal ozone patterns in cities of Yemen

High ozone concentrations near ground level can have serious consequences for human health, as well as for crops, animals, and other substances. O3 is a powerful oxidant that can irritate the respiratory system, causing symptoms such as coughing, sore throat, and chest discomfort. People with asthma and other lung conditions are particularly vulnerable to the effects of ozone pollution, as it can worsen their symptoms and increase the risk of respiratory infections. Long-term exposure to O3 can also lead to inflammation and damage to the cells lining the lungs, which can increase the risk of chronic lung diseases such as bronchitis and emphysema. Additionally, high O3 concentrations can weaken the immune system’s ability to fight off bacterial infections in the respiratory tract. There are many factors that influence the development of ground-level O3, including wind direction and speed, temperature, timing cycles, and vehicle driving patterns. O3 is formed when pollutants from cars, power plants, and other sources react with sunlight, so weather conditions play a key role in determining O3 levels. O3 is typically a pollutant in the summer, when temperatures are high and sunlight is abundant, and it is a major component of smog in many urban areas during the summer months. Due to its relation to climate conditions, ground-level ozone is also known as “summer smog.” It is important to note that Ozone, though it is harmful at ground level, is beneficial in the upper atmosphere where it protects the earth from harmful UV rays.

The analysis in this study builds upon the work of Schmutz et al. (2020) by utilizing a functional clustering framework to analyze O3 pollutant curves. However, it also diverges from previous studies, such as those conducted by Pineda Rojas et al. (2019) and Saeipourdizaj et al. (2022). Pineda Rojas et al. (2019) employed traditional clustering techniques to examine the spatial patterns that lead to peak ozone hourly concentrations, using Monte Carlo outcomes as the basis for their analysis. On the other hand, Saeipourdizaj et al. (2022) utilized a classical spatiotemporal mixture model-based clustering framework to cluster days of the year 2017, based on hourly O3 amounts collected from four stations in Tabriz. This study takes a different approach, utilizing the functional clustering framework to analyze O3 pollutant curves, which sets it apart from these previous studies.

Clustering of functional CO data

In this sub-section, we will present and discuss the main results of spatiotemporal functional clustering that have been adapted to the transformed air pollination data structure. Specifically, we will examine the functional clustering findings of the smooth functional CO pollutant over multiple cities in Yemen, including the group average and the spatial distribution of the obtained clusters, as depicted in Fig. 10. Overall, the spatiotemporal functional dynamic pattern of the CO air pollutant can be divided into three distinct phases. The first phase, which spans from January 1991 to December 2001, is characterized by a prominent polluting peak with a high volume of pollution. The second phase, which begins in January 2009 and ends in December 2019, is characterized by a stable and constant polluting pattern. The final phase, which is related to the COVID-19 pandemic, is representative of the COVID-19 lockdown period, during which pollution levels decreased dramatically. It has been observed that there has been a significant decrease in carbon monoxide (CO) air pollution in Yemen starting from January 2020 to April 2022. This has led to an overall improvement in air quality in the country. The objective of the study is to identify and classify the spatiotemporal patterns of functional CO data across multiple locations in Yemen. Our proposed method has been able to provide the best partition of potential clusters, which have been divided into four main groups. Specifically, cluster group 2 (colored red) comprises three major cities in Yemen—Sanaa, Taiz, and Ibb—which are considered to be more polluted compared to other cities in the country. Following group 2, group 4 comprises moderate polluting cities located on the western sides of Yemen. Group 3 (colored green) includes cities with zero CO pollution throughout the whole domain, owing to low population density and fewer human activities. The detailed profile, characteristics, and degree of pollution for CO concentration for each cluster are listed in Table 6. The study by Grace et al. (2020) employed the commonly used method of Fuzzy c-Means clustering to examine and present the data on CO pollutants alongside other pollutants, using real-time sensor data. On the other hand, the research conducted by Jain and Kaur (2021) introduced the use of machine learning and visualization techniques for forecasting and analyzing the air quality in major cities in India, taking into account six major pollutants, including CO.

Fig. 10
figure 10

a Functional clustering results of the smooth CO data, b its group means, and c the spatial distribution map of the obtained clusters

Table 6 Summary of the cluster characteristics for spatial–temporal CO patterns in cities of Yemen

Finally, the proposed method in this study can be compared to recent approaches that utilize Big Data and intelligent computation, such as those presented by Al-Janabi et al. (2021), Al-Janabi et al. (2020b), and Al-Janabi et al. (2019). These methods aimed to predict multiple air pollution concentrations using the Intelligent Forecaster of Concentrations caused air pollution (IFCsAP) (Al-Janabi et al. 2021), a pragmatic method based on intelligent big data analytics (Al-Janabi et al. 2019), and intelligent computation (Al-Janabi et al. 2020b). However, the main difference between our method and these previous studies is that our focus is on clustering and visualizing spatial–temporal air pollutant curves through functional data approaches, while their focus was on predicting discrete air pollutant data through intelligent big data analytics. As previously stated, it is more efficient to use statistical methods that can analyze the temporal and spatial variations of pollutants over time. The functional data analysis approach enables the examination of the entire time spectrum of pollutant variables. The table below compares the current study to recent studies that employed similar methods within the functional framework. However, the functional techniques used differ based on the purpose of the study. For example, Acal et al. (2022) focused on investigating the potential impact of the COVID-19 lockdown on air quality in the Pescara-Chieti urban area in Italy, which is known for high air pollution levels. Betancourt-Odio et al. (2021) used functional data and Kendall’s functional Tao (KFT) to study the relationship between O3 pollution levels in rural and urban areas in the Spanish Community of Madrid. Their findings indicate a complex, non-linear relationship between urban and rural areas. Torres et al. (2020) compared the effectiveness of three different analytical methods in identifying pollution episodes and outliers. More information about these studies is summarized in Table 7. A comparison of the current proposed method with other approaches applied to environmental pollution data is shown in Table 7. The comparison focuses on both functional and non-functional frameworks, univariate and multivariate settings, and lists the advantages and disadvantages of each approach. The disadvantages are based on the author’s opinion, but other drawbacks may also exist. This information can be useful for understanding how the current work builds upon or differs from previous research in the field and can provide insights into the strengths and limitations of different methods.

Table 7 Comparison of the currently proposed method with other approaches applied to environmental pollution data

Conclusions and recommendations

The study aimed to visualize and cluster the dynamic behavior of multiple air pollution concentrations using functional analysis techniques and functional mixtures clustering model. The method was applied to multivariate high dimensional air pollution data from cities in Yemen from January 1980 to April 2022. Fourie transformation, B-spline functions, and generalized-cross validation were used to reconstruct and smooth data. The study used two enhanced 3D visualization tools to explore the spatiotemporal variations in the functional air pollutants cycle and a functional mixture model was used to identify and classify the spatiotemporal functional air pollutants data. The study found four substantial clusters for all functional air pollutants variables and demonstrated the ability to identify, visualize, and classify the continuous functional dynamic patterns of air pollutants (SO2, PM2.5, O3, and CO) over multi-sites in Yemen. Some main results have been concluded as follows:

  • Yemen has experienced substantial dynamic patterns of air pollution concentrations over different spatial locations from the period 1980–2022.

  • The obtained results have also provided evidence that vehicular emission is the primary source of air pollution in Yemen besides industrial activity and mixing factors are also shown to be the secondary contributing factors towards air pollution variation.

  • The functional clustering findings showed a noteworthy decline in CO emissions during the COVID-19 pandemic; additionally, the cities of Sanaa, Ibb, and Tiaz were classified as the more polluted cities in Yemen.

  • Regarding the Ground-level O3 pollutant, the results showed great fluctuations with increase and decrease during the entire domain; however, there was no effect on ozone level concentrations due to the COVID-19 pandemic period.

  • Although PM2.5 concentrations have witnessed an extremally significant increase before the COVID-19 pandemic period, they have shown a noticeable decrease during the COVID-19 pandemic period.

  • In general, the results showed that there was stability and no significant changes in SO2 levels, particularly during the last two decades.

Overall, ambient air pollution can be controlled and reduced with the implementation of strategic measures, led by sound leadership and development efforts to help emerging economies recover from past losses. Successful pollution control methods, that are technically, politically, and economically feasible for a specific country, can be shared globally to minimize air pollution. Recommendations to control and reduce air pollutants include the development and implementation of new environmental standards, the use of intervention techniques to decrease concentration, the prohibition of polluting materials and fuels in urban and rural areas, regulation of private vehicles, and an increase in public transportation. Additionally, promoting the use of clean fuels and implementing effective policies to ensure standard operating protocols in workplaces, industries, and hospitals can help control the spread of pathogenic microbes.