Value-added in higher education : ordinary least squares and quantile regression for a Colombian case

Colombia applies two mandatory National State tests every year. The first, known as Saber 11, is applied to students who finish the high school cycle, whereas the second, called Saber Pro, is applied to students who finish the higher education cycle. In this paper, the result obtained by a student on the Saber 11 exam along with his/her gender and socioeconomic stratum are our independent variables while the Saber Pro outcome is our dependent variable. We compare the results of two statistical models for the Saber Pro exam. The first model, multi-linear regression or ordinary least squares (OLS), produces an overall well fitted result but is highly inaccurate for some students. The second model, quantile regression (QR), weights the population according to their quantile groups. OLS minimizes the errors for the students whose Saber Pro result is close to the mean (a process known as estimation in the mean) while QR can estimate a value in the θ -quantile for every 0 < θ < 1. We show that QR is more accurate than OLS and reveal the unknown behavior of the socioeconomic stratum, the gender, and the initial academic endowments (estimated by the Saber 11 exam) for each quantile group.


Introduction
The representation of the educational phenomenon through mathematical models, where variables of the cognitive state participate at the beginning and at the end of the cycle, as well as, features of the accomplished process, allows developing studies of impact and the efficiency of the displayed projects by a universe of educative institutions.
Particularly, the contribution to a group of student academic achievement, conferred by the institutions and their professors, requires the employment of valid assessment tools for estimating reliably the reached states at the beginning and at the end of a period.In this way, it ensures the credibility of the calculated efficacy (Amrein-Beardsley, BOGOYA, BOGOYA, AND PEÑUELA 2008, p. 71).The impact of the educational project facilitates the accountability of inclusive institutions, considering feasible goals, since the cognitive state proven at the end of a cycle depends on, in a high level, the respective state that the students show at the beginning of it (Hanushek & Raymond, 2001, p. 375).
To estimate the educational performance of a student i in a certain moment t, the Equation (1) has been formulated (Hanushek, 1979, p. 363), whose variables are: innate capacities I i ( ) , accumulated characteristics until the moment t B i t ( ) , peer influence P i t ( ), and institution contribution S i t ( ) .
A it = f B i t , P i t ,S i t , I i ( ). (1) Moving towards the value-added notion, Equation (2) has been proposed to estimate the educational performance of a student i at the end of a period (Hanushek, 1979, p. 364), according to the state of the considered variables at the beginning of such period t*. (2) From the variability of the proposed models, it highlights (3) (Hanushek & Rivkin, 2012, p. 134), in which the educational performance of a student i is estimated according to the following variables: peer and scholar influence (S i ), family and neighbors incidence (X i ), and student individual capacity ( μ i ). (3) Concerning the connection between the familiar background and academic performance, it has been reported very low and often negative correlation values (Woessmann, 2004, p. 17).In light of the average performance of a mathematical test, based on data from TIMSS 95, the coefficient found is equal to -0,11.It shows the difference of the academic performance between students with parents without a high school degree and those with a professional degree.
Other types of models focus attention on variables that can be oriented to the educational institutions; for example, the Equation ( 4) estimates the quality of an educational institution (Bishop & Woessmann, 2004, p. 8).This is determined by the learning ability and the effort of the students (AE), and the quantity of resources and their effectiveness of use (IR).
Related to the cognitive progress of a group of students, the linear Equation ( 5) introduces the value-added v, in which β 1 , β 2 , and β 3 are real constants, x 1 and y represent the cognitive state at the beginning and at the end of an educational cycle, respectively, x 2 reflects the socioeconomic condition, s is the student program, and ε is the estimated error.The group value-added is calculated as the average of the deviations of the observed results to the individual level (Bogoya & Bogoya, 2013, p. 78). (5) For a case study, the authors proposed three approximations to the student level: • The value of the cognitive state variable at the beginning of the higher education cycle in Colombia as the Saber 11 exam result.
• The value of the cognitive state variable at the end of the higher education cycle in Colombia as the Saber Pro exam result.
• The value of the socioeconomic condition variable as the socioeconomic stratum.
With the case study data, the model solution leaded to the following finding: the cognitive state at the beginning of the cycle explains one portion of the variance of the corresponding variable at the end of that cycle; it is thirteen times greater than the variance related with the socioeconomic condition (Bogoya & Bogoya, 2013, p. 81).
The use of value-added models predicted four considerations.First, it is necessary to remind that the findings significance depend on, among other variables, the number of evaluated students.The greater the population is, the more reliable the estimated value for the effectiveness of an educational institution (Ray, 2006, p. 34).Second, when conducting studies of trends the variation of the student cognitive state, at the end of a cycle, fluctuates relatively seldom among two consecutive years.It implies that volatility of the variation reduces the reliability of the estimation and thus it is important to have averages of several years in small populations (Ray, 2006, p. 34).Third, it is uncertain the variation estimation of the student cognitive state that at the beginning of a period are placed in the top of the generated ladder; in this case it is possible to take the average of several students (Tymms & Dean, 2004, p. 14,15).Finally, in a regression, the coefficient of determination is greater for aggregated data than for individual data.It must be avoided the ecological fallacy, due to that the independent effects tend to be mistaken in that aggregated and it is hard to clarify them (Hanushek, Jackson, & Kain, 1974, p. 100).
However, in order to use the quantile regression methodology to solve value-added models, we found the initial definition about quantiles of an ordered observations set sample, which are structured in a linear model.Considering {y t : t = 1,…,T} as a sample of a random variable Y with cumulative distribution function F, any solution of (6) can be defined as the quantile sample θ, 0 < θ < 1 (Koenker & Basset, 1978, p. 38).
The adjust procedure for quantile regression has been improved in an analogous form as it happens in conventional statistics R 2 of the least squares regression (Koenker & Machado, 1999, p. 1).Simultaneously, several inferential procedures can be formulated for proving hypothesis about combined effects of covariance of a whole range of quantile conditional functions.It is stated that the quantiles are linked with ordering operations and classification of the observations that are used to define them (Koenker & Hallok, 2001, p. 145).It is possible to delimit the quantiles as an optimization problem, taking the sample mean as the solution to minimize the sum of squared residuals and the mean as the solution to minimize the sum of absolute residuals.By symmetry, the minimization of the absolute residual sum must be equal to the positive and negative residuals to guarantee the same number of observations above and under the mean.
It is important to point out that even if the quantile regression has had a considerable development and a variety of applications, there are numerous aspects for research, especially about regularization parameters (Koenker, 2004, p. 88).There are different versions of the model, which might extend the optimal structure for the fixed effects, which incorporate ordinal factors and nonparametric components.The analysis of the method performance for the samples of fixed size is equal to a research route, likewise applications growing curves that can appear as the natural laboratory of future developments of quantile regression models for longitudinal data.

Econometric methodology
The learning outcomes of higher worldwide education programs come from several conditions and variables (Hanushek, 1979).We study, in two different ways, some possible relationships between them.These variables are approximations of certain general conditions for each individual, such as: socioeconomic and cultural environment, learning level of the students at the beginning of their university studies, and the existence of a wide variety of academic value-added elements of such projects.
We define the following input variables: the score obtained by the student on the national higher education admission exam (Saber 11) as a synthesis of the partial scores observed in the evaluated areas; the student socioeconomic stratum at the end of his/her university studies; and the student gender.Saber 11 result is understood as a proxy of the initial academic level of a student when starting a university program, while the socioeconomic stratum is understood as a proxy of the family income and socioeconomic conditions.For economic decision purposes, the Colombian state uses a number between 1 and 6, called "socioeconomic stratum", to indicate the relative people wealth in certain location; we use this indicator as an input variable.On the other hand, the student gender is a frequently used control variable in this kind of studies.
The output variable is the student score on the national higher education exit exam (Saber Pro), understood as a proxy of the academic level when finishing a university program.Our objective is, using the same input variables, to compare two statistical models for the output.The first one is the well-known multi-linear regression or ordinary least squares (OLS) and the second one is quantile regression (QR).Generally speaking, the QR method gives us a detailed OLS-view when analyzing linear models, by supplementing focus on the estimation of the outcome variable for each possible quantile (Brennan, Cross, & Creel, 2015;Frumento & Bottai, 2016).Thus, OLS and QR are different econometric tools and we are interested in comparing them in our specific study.
Let x be a n × p matrix of independent variables (Saber 11 outcome, socioeconomic strata, and gender) and y ∈ !n a vector of dependent variables (Saber Pro outcome).We assume the following linear model where β ∈ !p and ε ∈ !n are constant vectors.Let x j ∈ !p be the j−st row of the matrix x.We can split the Equation ( 7) as The vector β, which minimizes ∑ , is given by the multi-linear or ordinary least squares regression (OLS) of y with respect to x; the well-known solution is x T y, here x T stands for the transpose matrix of x.This solution is based on the assumption that the expected value of the errors ε j is zero.In statistics, β is known as the regression vector and ε as the error vector.OLS minimizes the errors ε j for the students whose Saber Pro result is close to the mean of y while paying less attention to the rest of the population; this behavior is known as estimation in the mean.Now, the quantile regression (QR), as the second model that we study, will be described as follows.For each real Assuming (2.2), with β = β θ ( ) , we can write (9) as, θ ε j where !ε j := θε j if ε j ≥ 0 , and !
, i.e. minimizes the sum of the absolute values of the errors with certain weights.In our case, a θ quantile is a value for the outcome variable y that is bigger than the θ portion of the observations and less than the remaining 1 − θ portion.Additionally, some authors give a nice step-by-step explanation of how to run QR in Stata software (Cameron & Trivedi, 2010).

Business administration
An extensive data mining results for 160.207 students which presented the 2009 Saber Pro exam in Colombia was used.For these students we know their Saber 11 result, socioeconomic stratum, gender, and the selected higher education program.From this universe, the set of students evaluated through the business administration Saber Pro exam (the largest) is considered.Because of reliability issues, only programs with 20 or more students are taken into account.The database used4 reports of 10.783 students.The socioeconomic stratum and the gender variables, being categorical, are treated as dummies.and stratum that means, in general, gender 1 (male) gets higher Saber Pro results than gender 0 (female) but the difference decreases as the stratum increases.Saber Pro is scaled with mean 100 and standard deviation 10, while Saber has 30.In the analysis of the models outcomes, we normalized them both, i.e. mean 0 and standard deviation 1.
Figures 1 and 2 show some characteristics of the behavior of the main variables.With colored regions, Figure 1 shows the close to distribution of Saber Pro in each stratum.Figure 1 also reveals the linear relation between the two variables: in general, the higher the stratum of an individual, the higher his/her Saber Pro result will be.Figure 2 shows also the linear relation between Saber Pro We assume the model ( 7) where y j is the Saber Pro outcome for the student j, the row vector x j is (x j,1 , x j,2 , ... x j,7 ) where the entry x j,1 is the Saber 11 test outcome for the student j, x j,1 for k = 2, ... ,6 takes the value 1 if the student j lives in a socioeconomic stratum k area and the value 0 otherwise, fi nally x j,7 takes the value 0 for a male and the value 1 for a female.The previous description means that socioeconomic stratum 1 and male gender play the role of base variables.

Results
Table 1 shows the numerical results produced by the OLS model.The data were obtained with Stata.We get R 2 = 0,842 showing an accurate model., the gray band is the 95 % confi dence interval, and the dotted (black) line is the OLS value for β ℓ .Thus, when the dotted line falls outside the gray band (see Figure 3) the OLS model will generate big errors.Source: Authors.Source: Authors.Source: Authors.3. In our modeling procedure, we take stratum 1 as a base variable; thus Figures 4 to 8 are actually showing the strata-quantile behavior in relation with stratum 1.
Taking a look at the vertical axis labels, we can see a progressive increasing value for the regression coefficient (see also Figure 1), which reveals an academic inequality related with the socioeconomic stratum.Additionally, note that all the strata have a similar behavior and that the OLS model is accurate enough for these variables.
4. Figure 9 shows the male-female regression coefficient for the different quantile groups.It reveals that we can expect higher Saber Pro results from males in every quantile group and that the OLS-method overestimates the lowest Saber Pro outcomes and underestimates the highest one.Showing again that the QR method gives us a different regression coefficient value for each segment of the population, taking into account the dependent-independent variable interplay.The OLS method forbids us to note the changes in the respective variable association.

Figure 2 .
Figure2.The Saber Pro, stratum, and gender interplay.Genders 0 and 1 stand for female and male, respectively.Source: Authors.
Tables 2, 3 and 4 show the numerical results produced by the QR model.The data were obtained with Stata.coeffi cient β ℓ , the standard deviation σ ℓ , and the 95 % confi dence interval for each variable x j ,ℓ

Table 1 .
OLS results.All the variables are meaningful.β ℓ is the ℓ-th component of the vector β and σ ℓ is the standard deviation of the column vector x j ,ℓ () , see (2.2). ℓ 95 % conf.int.

Table 2 .
QR results for θ = 1 4. The variable stratum 6 is not meaningful in this quantile.

Table 3 .
QR results for θ = 1 2. All the variables are meaningful in this quantile.

Table 4 .
QR results for θ = 3 4.All the variables are meaningful in this quantile.