THE PERFORMANCE OF COUNT PANEL DATA ESTIMATORS: A SIMULATION STUDY AND APPLICATION TO PATENTS IN ARAB COUNTRIES

This paper provides four estimators of count panel data (CPD) models; fixed effects Poisson (FEP), random effects Poisson (REP), fixed effects negative binomial (FENB), and random effects negative binomial (RENB). In FEP and FENB models, we used conditional maximum likelihood (CML) estimation method. While for REP and RENB models, we used maximum likelihood (ML) estimation method. We conducted a Monte Carlo simulation study to compare the behavior of these estimators in the four models. The results of simulation show that the best estimator is FENB compared to other estimators (FEP, REP, and RENB), because it has minimum values for Akaike information criterion (AIC) and Bayesian information criterion (BIC), especially when the model or the data has an overdispersion problem. Moreover, a real dataset has been used to study the effect of some economic variables on the number of patents for seven Arab countries over the period from 2000 to 2016. Application results indicate that the RENB is the suitable model for this data, and the important (statistically significant) variables that effect on the number of patents is the gross domestic product per capita. 8174 AHMED H. YOUSSEF, MOHAMED R. ABONAZEL, ELSAYED G. AHMED


INTRODUCTION
Recently, panel data or longitudinal data sets have become one of the most exciting fields in econometrics literature due to new sources of data which observes the cross-sections of individuals over time. This allows constructing and testing more realistic behavioral models that could not be identified using a single cross-section or a single time-series data set. Therefore, panel data analysis is a core field in modern econometrics and multivariate statistics. Thus, panel data sets have become widely available, where there are many of the contributions and recent studies which have analyzed panel data, e.g. Baltagi [1] stated that the panel data refers to the pooling of observations on a cross-section of households, countries, firms, etc., over several time periods.
According to Vijayamohanan [2], the panel data refers to a data set containing observations on multiple phenomena over multiple time periods, where it has two dimensions; the spatial dimension (cross-sectional) and temporal dimension (time series). Greene [3] pointed out that the analysis of panel data is one of the important topics and common in economics, because it allows great flexibility in modeling differences in behavior across individuals and provide rich sources of information and rich environment for the development of estimation techniques. Furthermore, the researchers are uses time-series cross-sectional data to examine issues that could not be studied in either cross-sectional or time-series alone. Also, the analysis of panel data allows the model builder to learn about economic processes considering both heterogeneity across individuals, firms, countries, etc., and dynamic effects that are not visible in cross sections.
Abonazel [4] explained that pooling cross-sectional and time series data (panel data) achieves a deep analysis for the data and gives a richer source of variation, which allows for more efficient estimation of the parameters and more effective in identifying and estimating effects that are simply not detectable in cross-sectional or time series data. Also, panel data sets are more effective 8175 PERFORMANCE OF COUNT PANEL DATA ESTIMATORS in studying complex issues of dynamic behavior.
Panel data models have become increasingly popular among applied researchers due to their heightened capacity for capturing the complexity of human behavior as compared to crosssectional or time-series data models. Therefore, we will discuss the most popular models in panel data modeling, which is the fixed effects and random effects models.
In general, the fixed effects model has different intercepts, where the intercept is differing from unit to unit and fixed over time. The general form of the fixed effects model is [5,6,7]:  [6,8]. The random effects model is given by:

POISSON PANEL MODELS
The most common probability models for modelling CPD is Poisson panel model. In the Poisson distribution is the mean and the variance are the same, the higher the value of the mean of the distribution, the greater the variance or variability in the data [14]. The Poisson panel model assumes that the dependent variable ( ) has a Poisson distribution. The probability mass function of with parameter can be expressed as: where represents a variable consisting of count values and > 0.

Fixed Effects Poisson Model
In the FEP model, all characteristics that are not time-varying are captured by the individual effects ( ). The intercept (constant term) is merged into , hence the explanatory variables (x ) do not contain an intercept [15]. The conditional probability function of the FEP model as: where = (x ′ ). The last equality specifies an exponential functional form. To estimate the parameters of the model (4), it can use the CML estimation method that developed by Hausman et al. [16]. Since and ∑ =1 are follow the Poisson distribution, then the conditional joint density function (CJDF) for the ℎ observation is: , when taking the logarithm of CJDF and summing over all individuals, the conditional loglikelihood is: it can obtain the estimated parameters for the FEP model by solving:

Random Effects Poisson Model
In the REP model, the individual effects (unobserved heterogeneity) are expressed as instead of , while the intercept is included and merged into x . The individual effect must follow a specified distribution in order to estimate the parameters of the REP model. Therefore, many researchers assumed that the individual effect in the REP model has a gamma distribution with parameters ( , ), see e.g. [5,14,16,17,18].
The REP model assumes that the response variable ( ) has a Poisson distribution and the individual effect has a gamma distribution, then ML estimation method should be used to estimate the parameters of the REP model. The ML function for the ℎ observation is: and the log-maximum likelihood function is: thus, it can obtain the estimated parameters of this model by solving:

NEGATIVE BINOMIAL PANEL MODELS
The negative binomial model is one of the basic models for count data analysis. This model has found a widespread use in the fields of health, social, economic, and physical sciences when the response variable comes in the form of non-negative integers or counts [19].

Fixed Effects Negative Binomial Model
The FENB model assumes that for a given unit , the response variable ( ) is independent over time and ∑ =1 has a negative binomial distribution with parameters and ∑ =1 . These assumptions imply that: where = ⁄ , Hausman et al. [16] showed that the CJDF of the FENB model for the ℎ observation is: where (•) is the gamma function. In order to estimate the parameters of this model, Hausman et al. [16] used the CML estimation method. Thus, it can obtain the CML estimation of this model by maximizing the following log-conditional maximum likelihood function:

Random Effects Negative Binomial Model
For the RENB model, Hausman et al. [16] assumed that the dependent variable ( ) specified to be independent and identically distributed negative binomial, and .
The ML estimation of the RENB model can be obtained by maximizing the following logmaximum likelihood function: }.

SIMULATION DESIGN
We will use the Monte Carlo simulation for making a comparison between the behavior of FEP, REP, FENB, and RENB estimators of the four CPD models above. We used R language to conduct our Monte Carlo simulation [20,21]. Several studies have been relied upon when conducting a Monte Carlo simulation study such as [4,22,23,24,25,26].

In Case of Moderate and Large Samples
The simulation study was carried out in the moderate and large samples based on the following: thus we speak about overdispersion.

In Case of Small Samples
In this section, we will study the behavior of the four estimators in case of small samples. The data were generated by the same method in the case of moderate and large samples with the difference in cross section size to be 5, 10, 15, and 20 and time series to be 15 and 20, and the dispersion parameter is one.
The Monte Carlo experiment has been designed to compare the small, moderate, and large samples performances of ML estimators of REP and RENB models and CML estimators of FEP and FENB models based on AIC [27] and BIC [28].

SIMULATION RESULTS
The results of the Monte Carlo simulation study for the moderate and large samples have been provided in tables from 1 to 6, while figures from 1 to 4 displays the small samples results. Each       T =10  T =15  T =40  T =50  T =100  T =200 = .

EMPIRICAL STUDY: PATENTS IN ARAB COUNTRIES
There are many economic studies are interested with patent applications, e.g. [9,13,16,29,30,31,32]. In our application, we will follow the same methodology presented by Youssef et al. [13], their methodology is summarized the estimation steps and how to select the appropriate model for In our study, the dependent variable is the number of patent applications, and three explanatory variables: GDPC, IMPO, and UNEM; where GDPC is the gross domestic product per capita (U.S. Dollar), IMPO is the information and communication technology goods imports (percentage of total goods imports), and UNEM is unemployment rate (percentage of total labor force).
We repaired the data before estimating the parameters of CPD models. The data contains some missing values in the number of patent and IMPO, these missing values were estimated using the mean-imputation method [21,33]. We performed a unit root test for all variables, and the results indicated that the data are stationary in the level [34]. The variance inflation factor (VIF) is calculated to check the multicollinearity problem of the explanatory variables, the results indicated that the data not have multicollinearity problem because all values of VIF less than five. For more details on how to deal with the multicollinearity problem in regression models, see e.g. [20,35,36].
We estimated the parameters in fixed effects models using CML method, while the ML estimation method was used to estimate the random effects models. Table 7 presents the results of FEP and REP models, the two models are statistically significant because the P-value of the Wald test is less than 0.05. Based on the results of Hausman test, the P-value of chi-squared is greater than 0.05, then we can accept the null hypothesis, this means that REP model is more appropriate.      In the RENB model, we find that GDPC is statistically significant because the P-value of Z-value for this variable is less than 0.05, while IMPO and UNEM variables are not statistically significant.

CONCLUSION
In this paper, we used the Monte Carlo simulation for making a comparison study between the four estimation methods of CPD models. Furthermore, we examined the effect of some economic variables on the number of patent applications in seven Arab countries by applying four CPD models. We can summarize the main conclusions of our Monte Carlo simulation and the empirical study in the following points: In future work, we plan to study the efficiency of ML estimators in case of outliers [19,21,26] or missing data [21,33] in CPD models. Moreover, we can study the impact of the COVID-19 pandemic [37] or the food and non-food expenditures [38,39] on the number of patents in the Arab countries using modern CPD models.

CONFLICT OF INTERESTS
The author(s) declare that there is no conflict of interests.