1. Introduction
In 1982, Ramsay [
1] first proposed the definition of functional data, laying a foundation for the development of functional data analysis. In 2005, Ramsay and Silverman provided a detailed introduction to the general methods and steps of functional data analysis, including functional principal component analysis and functional linear regression models in their book [
2]. In 2012, Horváth and Kokoszka [
3] focused on the inferential methods in functional data analysis.
In 2009, Shin [
4] proposed a partial functional linear model (PFLM), which explores the relationship between a scalar response variable and mixed-type predictors. In 2012, Shin and Lee [
5] derived the asymptotic prediction rate of PFLM and compared it with that of other functional regression models.
In 2002, James [
6] proposed generalized linear models with functional predictors and applied them to standard missing data problems. In 2005, Müller and Stadtmüller [
7] proposed a generalized functional linear regression model where the response variable is a scalar and the predictor is a random function. They also considered the situation where the link and variance functions were unknown. In 2015, Shang and Cheng [
8] proposed a roughness regularization approach in making nonparametric inference for generalized functional linear models with known link functions. In 2019, Wong et al. [
9] investigated a class of partially linear functional additive models that predict a scalar response by both the parametric effects of a multivariate predictor and the non-parametric effects of a multivariate functional predictor.
In a generalized linear model, sometimes the link function may not be known exactly, but can be assumed to be of some general ‘parametric’ form. In 1984, Scallan et al. [
10] showed how generalized linear models can be extended to fit models with such link functions.
In 1994, Weisberg and Welsh [
11] used kernel smoothing estimation to estimate the link function and estimated regression coefficients through the link function, then alternated between these two steps, which effectively solves the fitting problem when the link function is unknown. However, kernel smoothing estimation may have problems at the boundary, so local polynomial fitting is introduced, which performs better near the boundary.
In 1998, Chiou and Müller [
12] considered the condition of the link and the variance functions to be unknown but smooth. Consistency results for the link and the variance function estimators, as well as the sampling distribution of the regression coefficients, were obtained. In 2005, Chiou and Müller [
13] introduced a flexible marginal modeling approach for statistical inference for clustered and longitudinal data under minimal assumptions. The predictor was longitudinal data in the model. The estimated estimating equation approach was semi-parametric. The semi-parametric model proposed was fitted by quasi-likelihood regression. The consistency of the estimates of the link and variance functions and the asymptotic limit distribution of regression coefficients were given. In addition, there are other methods to estimate unknown functions. In 2009, Bai et al. [
14] focused on single-index models for longitudinal data. They proposed a procedure to estimate the single-index component and the unknown link function based on the combination of the penalized splines and quadratic inference functions. In 2012, Pang and Xue [
15] generalized the single-index models to the scenarios with random effects. The link function was estimated by using the local linear smoother. A new set of estimating equations modified for the boundary effects was proposed to estimate the index coefficients. In 2017, Yuan and Diao [
16] developed a sieve maximum likelihood estimation for generalized linear models, in which the estimator of the unknown link function was assumed to lie in a sieve space. Various methods of sieves including the B-spline and P-spline-based methods were introduced.
In 2017, Kokoszka and Reimherr [
17] wrote a book that introduced the basic concepts, methods, and applications of functional data analysis. The book provided a clear and systematic overview, covering key areas such as representation, smoothing, interpolation, statistical modeling, and inference for functional data. It also included detailed explanations of practical examples and computational methods. In 2023, Rao and Reimherr [
18] introduced a novel neural network-based nonlinear model of functional data designed to exploit the structure of functional data and fit it with a derived function gradient optimization algorithm, demonstrating the effectiveness of these methods in dealing with complex functional models and providing new breakthroughs for deep learning applications in the field of functional data analysis.
The relationship between environmental factors and human health has been a topic of significant research interest in recent years. In 2012, Huang et al. [
19] explored the relationship between temperature and years of life lost (YLL). The study found that both high and low temperatures lead to an increase in YLL, with high temperatures having a greater impact. In 2020, Yang et al. [
20] applied a generalized additive model to assess the associations between daily PM2.5 exposure and YLL due to respiratory diseases in 96 Chinese cities during 2013–2016. They further estimated the avoidable YLL and potential gains in life expectancy under the assumption that daily PM2.5 level met World Health Organization standards. In 2021, Deryugina and Molitor [
21] explored the factors influencing life expectancy across the United States. The study found that individuals living in areas with severe air pollution, poor water quality, and inadequate healthcare facilities generally had shorter life expectancy and poorer health conditions.
In summary, the existing models with unknown link functions have not addressed the issue of the generalized partially functional regression model, which involves regressing the response variable on multiple functional and scalar predictors. To fill this gap, this study proposes a generalized partially functional linear model with an unknown link function. The proposed model avoids the problem of decreased model accuracy caused by selecting an incorrect link function. The predictors in the proposed model include both multiple functional data and multiple scalar data. It reveals the complex relationships between variables and provides a flexible and effective modeling approach. It can achieve better prediction and explanation.
The paper is organized as follows. All the published works and definitions that are referred to in the process of theorem proving are introduced in
Section 2. The abbreviations used in the article are introduced in the
Section 3.1. The generalized partial functional linear model with unknown link function is proposed in
Section 3.2. The estimation of the regression coefficients and the link function is discussed in
Section 3.3. In
Section 4, asymptotic normality of estimators are derived. Simulation results are reported in
Section 5. The average life expectancy study in 58 cities in China is given in
Section 6. In
Section 7, a brief summary and limitations of the research are provided. Possible applications and future directions are presented in
Section 7.
2. Preliminaries
In this section, we provide an overview of the published works and definitions that are relevant to our research. These preliminary concepts and references lay the foundation for a better understanding of the subsequent discussion.
(1) In 1982, Mack and Silverman [
22] provided a comprehensive analysis of the weak and strong uniform consistency properties of kernel regression estimates, highlighted their theoretical properties and practical significance in non-parametric regression modeling. In this paper, we directly apply the results of Proposition 4 as Lemma 1 for Theorem 1 in this paper.
(2) In 1995, Masry and Tjøstheim [
23] discussed the estimation and identification of nonlinear time series of ARCH type. They provided an estimation method to obtain consistent estimates of the parameters and proved the asymptotic normality. They also explored model identification methods. Their studies are of significant importance for modeling and analyzing financial time series. Theorem 3.3 in their work is used to prove Theorem 1 in this paper.
(3) In 1999, Chiou and Müller [
24] focused on the study of non-parametric quasi-likelihood methods. They provided the theoretical derivation process of this method, and explored its applications in statistical inference. Theorem 4.1 in their paper is used to prove Lemma 2 and Lemma 3 for Theorem 2 in this paper.
(4) In 2021, Xiao et al. [
25] proposed a generalized partially functional linear regression model where the response variable is 0 or 1 and the predictors were multiple functional and scalar, and the asymptotic property of the estimated coefficients in the model was established. The proof method of Theorem 1 in [
25] is used to prove Theorem 2 in this work.
3. Model and Estimation
The data we observe for the i-th subject are We assume that these data are independent, identically distributed (i.i.d) copies of . For the functional predictor is a random curve. are samples of and are square integrable on a real bounded interval T, i.e., . refers to the space of square integrable functions defined on T. And the scalar predictor vector is a q dimensional random vector. The response Y is a real-valued random variable that may be binary or count.
3.1. Abbreviation Introduction
Table 1 is a list of the abbreviations we use in this work along with their corresponding full forms:
3.2. Model
We establish a model for the relationship between the response variable
and the predictors
and
:
where
is the regression coefficient function that needs to be estimated for the functional predictors
;
is a
q dimensional vector with the elements to be the regression coefficients for the scalar predictors
that need to be estimated, i.e.,
. Here
is i.i.d copies of
, which is the random error variable and
,
, where
The relationship between the response variable
Y and
is established through
, i.e.,
.
is the link function that is unknown and needs to be estimated in this paper.
Let
be a variance function that satisfies
for a constant
, such that
To reduce the dimensionality of the functional predictors , we adopt the method of FPCA in this paper. First, we need to standardize the original data by centering them, so that and .
By KL expansion and Mercer’s theorem,
can be expanded as
where
represents the functional principal component scores, and
are called functional principal components, which are the eigenfunctions of the covariance operator of
. Notice that
form an orthonormal basis for the function space
. Then regression coefficient function
can be expanded as
where
represents the functional principal component scores.
After plugging the above two expansions into (
1), we have
In (
4), we truncated the predictors at
(depending on sample size
n), and
increases asymptotically with
.
3.3. Estimation
Define a parameter vector
, where
For the estimation of the parameter vector and the link function g, we use an iterative estimation method to obtain the final estimates. Let there exist a constant ; with this c and n, we can define . The norm of finite dimensional spaces used in this paper is the Euclidean norm. The overall iterative process is briefly described below:
Step 1 To obtain the estimate
of
by solving Equation (
5), it is assumed that the link function
is known. The link function
is required to be second-order continuously differentiable to ensure the existence of the Hessian matrix, moreover, for the variance function
is defined on the range of link function and is strictly positive.
where
,
near
,
near
,
and
Here,
and
represent the corresponding estimated value in step 1 but not the final estimate.
We introduce the following matrix:
Then, Equation (
5) can be expressed in matrix form, i.e.,
We can solve it by the weighted least squares method. A Taylor expansion of
, where
and then we can get
where
. Simplification yields estimates
where
,
.
Step 2 By local linear regression, the estimates , of the link functions g, are obtained.
Let the bandwidth
of the kernel function
converge to zero and define
. Since the convergence rates of
and
are different, their bandwidth choices should also be different. Let
denote the bandwidth of
, and
denote the bandwidth of
, but in this paper, for simplicity, the bandwidth
is chosen. Let the distributions of both the functional predictors
and the scalar predictors
Z belong to a compact support set
U, and we have
. To simplify the expression, we let
,
. For a fixed
, apply the method of local linear regression to obtain an initial estimate of
and
for
g and
, respectively. We minimize the weighted sum of squares at any point
u, and the formula for calculating the weighted sum of squares is
Through minimizing (
6), we can obtain
and
, and they can be represented as
,
, where
Step 3 Using the method of
Step 1, the link function is replaced by the estimated link functions
and
, where
. To update
, solve the estimation equation (
5) for
. From this we can obtain the estimated value of
Step 4 Using the method in Step 2, the parameter vector is replaced by the estimated , where From this we obtain the estimates and for g and , where
Step 5 Repeat the above steps until converge, and stop the iteration.
Step 6 The final estimate of the regression coefficient is obtained as , and the estimate of the link function g is obtained as .
5. Simulation
We consider a binary response and two functional predictors as well as three scalar predictors. The functional predictors and () are observed at 50 equal distant time points on the interval .
The sample sizes are
. Let the score coefficients
for each functional predictor satisfy the following assumptions:
where
.
where
.
We define the orthonormal basis functions
and
,
, which satisfy
Then,
can be represented through Karhunen–Loeve expansion as follows:
Figure 1 shows the 50 trajectories of the two functional predictors
and
.
The scalar predictor
satisfies the following assumption
We assume that the regression coefficient functions of the functional predictors satisfy the following assumption
where
and
. Moreover, we assume that the regression coefficients
of the scalar predictors satisfy
,
,
.
And we select the link function as
We generate binary response
as pseudo random sequence.
We obtain a sample
where
n is the sample size. The number of functional principal components that explain 85% of cumulative variation contribution are
,
, respectively. We run 100 simulations.
Figure 2 shows the asymptotic behavior of the link function under different sample sizes. The black lines in
Figure 2 shows the relationship between
and
, where
The additional colored lines shown in
Figure 2 represent the estimated link function
for different sample sizes. These lines are obtained through iterative processes, starting with an initial value of
g set to
. The iterative process continues until one of the following conditions is met: 100 iterations have been performed, or the error in the regression coefficients is less than 0.01. The purpose of these lines is to illustrate the relationship between
and
, where
Since in this case, both
and
are in
, we denote the argument of
g and
by
, and the x-axis in
Figure 2 is denoted by
and is shown in the interval
.
Table 2 presents the estimates of
evaluated through RMISE under different sample sizes. The RMISE is defined as follows:
where
is the number of simulations here. In summary,
Figure 2 and
Table 2 demonstrate that as the sample size increases, the estimated link function
becomes closer and closer to the true link function
g.
In
Table 3, it can be seen that both the SD and RMISE of the estimated regression coefficient functions
and
decrease as the sample size
n increases.
Figure 3 displays the estimated functional regression coefficients
and
, as well as their 95% confidence intervals under different sample sizes. The red curve in the figure represents the theoretical values of
and
, while the blue curve represents the estimated values
and
. The gray shaded area represents the 95% confidence interval of the estimates. It can be seen that as the sample size increases, the estimated values become closer to the true values.
Table 4 presents the estimated scalar regression coefficient
and corresponding standard deviation under different sample sizes. It can be seen that as the sample size
n increases,
becomes closer to the true values
. Moreover, as the sample size
n increases, the SD becomes smaller, indicating that the estimated values have more certainty.
Table 5 presents the M1 and M2 values for different sample sizes, where
, MAE=
,
, MSE =
, and
and
represent the real and the predicted values of the response variable, respectively. We can find that as the sample size increases, the values of M1 and M2 become smaller, indicating that the predictive performance of the model improves.
6. Application
As is well known, research on average life expectancy is crucial for social development, health policies, and population management. Studies on average life expectancy can help governments, health departments, and social institutions develop relevant policies and plans to improve people’s quality of life and health conditions. By understanding people’s life expectancy, the efficiency of healthcare systems and the effectiveness of social welfare and public health policies can be evaluated, providing a basis for resource allocation and planning. Additionally, research on average life expectancy can also help people understand population structure and trends, providing references for social-economic development, pension systems, and labor market planning. Therefore, in the application of our proposed model, we investigate factors that influence average life expectancy, including air quality index (AQI), temperature, GDP, and number of beds in hospitals.
6.1. Data Description
We collected average daily temperature (Temp) data for 58 cities in China in 2020 from the National Meteorological Science Data Sharing Service Platform, and average daily Air Quality Index (AQI) data from the National Environmental Monitoring Station. We also collected GDP, number of beds in hospitals, and life expectancy data for each city from local statistical bulletins and government documents. Among them, there are two functional predictive variables, which are daily AQI and temperature from 1 January to 31 December 2020, for 366 days in 58 cities. There are also two scalar predictive variables, which are GDP and number of beds in hospitals for the 58 cities in 2020. The response variable is the life expectancy of residents in each city in 2020.
Figure 4 shows the daily AQI and temperature for 58 cities in 2020.
6.2. Data Analysis
According to a report released by the National Health Commission, the average life expectancy of Chinese residents in 2020 was 77.9 years. Therefore, we divide the response variable as follows: when the life expectancy of a city is greater than 77.9 years, we represent it as 1; otherwise, when the life expectancy is less than 77.9 years, we represent it as 0. For the functional predictors, we first centralize the data. Second, we conduct FPCA and select the number of functional principal components that explain 75% of the variation. The number of components for AQI and temperature is and , respectively. We use GCV to demonstrate the predictive accuracy of the estimators. In this application, .
6.3. Results Analysis
By inputting the data into the generalized partially functional linear model, we obtain the regression coefficient function
for the functional predictors and the regression coefficients
for the scalar predictors. The results are shown in
Table 6 and
Figure 5, respectively.
Table 6 presents the estimated values of the regression coefficient
for scalar predictor variables. We can see that both GDP and number of beds in hospitals have a positive relationship with life expectancy, and are significant at the 5% level. This means that when a region has a higher GDP and more hospital beds, the life expectancy in that region is longer. In other words, the better the economic development and medical resources of a region, the longer the life expectancy.
In
Figure 5, we see the estimated values of the regression coefficient function
. For AQI, we can find a negative relationship between AQI and life expectancy in general. The higher the value of AQI, the more serious the air pollution is, and the lower the life expectancy corresponding to it. However, there is a more obvious positive relationship trend in February to April, which may be influenced by some other external factors. For temperature, we can find that the effect of temperature on life expectancy varies with the change of seasons. In spring, summer, and fall (March to October), the effect of temperature on life expectancy is negatively correlated, and in winter (November to February), it is positively correlated, which is consistent with the conclusion in Huang et al. [
19].
To confirm the necessity of considering the unknown link function model, we choose models without a link function and with the logit link function (i.e.,
) and compare them with our proposed models with unknown link functions. In order to evaluate the prediction performance of the three models, we use MAE, MSE, and
. Additionally, we calculate the accuracy using the confusion matrix, where we define TP as the number of samples correctly classified as positive, TN as the number of samples correctly classified as negative, FP as the number of samples incorrectly classified as positive, and FN as the number of samples incorrectly classified as negative (missed detections). We obtain the model’s accuracy using the formula
When the values of MAE and MSE are smaller, it indicates that the model has a smaller prediction error and better performance. When
is closer to 1, it indicates that the model has a stronger ability to explain the response variable. The experimental results are shown in
Table 7. It can be seen that the model we proposed has the best performance.
7. Conclusions
This article proposes a generalized partially functional linear model for scalar response and predictor variables that include both functional and scalar components, without specifying a link function. We use functional principal component analysis to reduce the dimensionality of functional data, estimate the regression coefficients using the maximum likelihood estimation method, estimate the link function using the method of local linear regression, iteratively obtain the final estimator, and establish the asymptotic normality of the estimator. The accuracy of the proposed model is validated through simulation studies.
The article applies the proposed model to the study of average life expectancy. Using daily AQI, temperature, GDP, and number of beds in hospitals for 58 cities in China in 2020, the study explores the impact of environmental, economic, and medical factors on life expectancy. The results indicate that GDP and number of beds in hospitals have a positive correlation with the life expectancy, while the AQI has an overall negative correlation. Temperature has a negative correlation with the average life expectancy in spring, summer, and autumn, and a positive correlation in winter. Overall, the study concludes that the average life expectancy is higher in areas with better environmental, economic, and medical development.
This model can be used in various fields, including economics, bio-medicine, engineering, etc. However, this model still has certain limitations. For example, the relationship between air quality and temperature needs to be further considered. There is a certain correlation between temperature and air quality. Generally, an increase in temperature can lead to the intensified volatilization and diffusion of pollutants in the air, thereby causing a decline in air quality. In the next phase of research, we will consider the interactions between functional predictors to make results more accurate. In addition, the algorithms and optimization methods of the model can be further improved to enhance computational efficiency. Combining this model with other machine learning methods can further improve predictive performance.