School-level inequality measurement based categorical data: a novel approach applied to PISA

This paper introduces a new method to measure school-level inequality based on Item Response Theory (IRT) models. Categorical data collected by large-scale assessments poses diverse methodological challenges hinder measuring inequality due to data truncation and asymmetric intervals between categories. I use family possessions data from PISA 2015 to exemplify the process of computing the measurement and develop a set of country-level mixed-effects linear regression models comparing the predictive performance of the novel inequality measure with school-level Gini coefficients. I find school-level inequality is negatively associated with learning outcomes across many non-European countries.


Introduction
Although the relevance of socioeconomic factors as predictors of children's cognitive learning attainment is a highly disputed issue in terms of causality (Mayer, 1997), there is extensive and long-standing research recognising their important role in explaining educational disparities in terms of access and outcomes (Coleman, 1966;Del Bello et al., 2015). Furthermore, research from a range of disciplines has highlighted a negative association between socioeconomic disparity and individual outcomes, offering various explanations for a detrimental role of inequality on domains such as health and subjective well-being (Deaton, 2003;Schneider, 2016;Wilkinson & Pickett, 2006).
Socioeconomic variables play also an important role in Large-Scale Assessments to explain or control for differences among groups in terms of learning outcomes and other variables of interest (Hopfenbeck et al., 2018). However, the possible interplay between school-level inequality and educational outcomes has been less addressed. Although previous research has developed alternatives to address the measurement of inequality based dichotomous or ordinal data, there has not been to my knowledge an alternative that computes inequality in the same statistical framework used in Large-Scale Assessments by using Item Response Theory (IRT). In this paper, I develop a novel method to measure school-level assets inequality utilising IRT models based on the discrimination parameter α . The proposed inequality measure computes the dispersion of the data at a certain aggregated level-such as schools or countries. The measure allows both to rank observations in terms of inequality, and to compare the average of inequality across the schools. I exemplify this case computing inequality based on PISA in 2015 home possessions index (HOMEPOS).
The remainder of the paper is structured as follows. "Socioeconomic measurement in PISA" section discusses the role and limitations of socioeconomic variables in PISA and International Large-Scale Assessments (ILSAs), while "The complexity of measuring inequality based on categorical data" section reviews the relevant literature regarding the measurement of inequality using categorical data, discussing the main methods previously developed in recent literature. "Alpha inequality: inequality based on an Item Response Theory paradigm" section briefly introduces IRT and summarises the methodological construction process of the inequality measure, named as Alpha Inequality. "Methods" section introduces the criteria used to analyse Alpha Inequality and the data used in the empirical section. "Results and discussion" section presents the findings of the construction process of Alpha Inequality and a comparative analysis of results with a Gini coefficient in terms of descriptive and inferential parameters, while "Conclusion" section concludes the study.

Socioeconomic measurement in PISA
The relevance of socioeconomic background questions in PISA as well as in ILSAs is twofold. First, socioeconomic variables are constantly used as control regressors as well as in the analysis of equality of opportunities of educational systems. For instance, PISA reports differences among scores within quintiles of wealth and report gaps explained by less privileged socioeconomic backgrounds (OECD, 2016). Second, due to the nature of PISA and other ILSAs, where there is limited time to cover diverse aspects of knowledge, students are exposed only to a portion of cognitive tests. Subsequently, socio-economic information is used as auxiliary information to impute final learning scores, through a technique called plausible values, which are "drawn from a posteriori (data) distribution by combining the IRT scaling of the test items with a latent regression model using information from the student context questionnaire in a population model" (OECD, 2017, p. 128).
Extensive research has been done analysing background questionnaires in PISA, showing diverse limitations on socioeconomic indicators. For instance, there is evidence of cross-country comparability deficiencies within and between PISA cycles (Lee & Von Davier, 2020;Sandoval-Hernandez et al., 2019) and poor model fit (Rutkowski & Rutkowski, 2013). One of the main consequences is the distortion of achievement estimates-see, for example, Rutkowski (2011Rutkowski ( , 2014 and also Rutkowski and Zhou (2015). Additionally, prior research also reports deficiencies regarding the cultural validity of some questions. For instance, there is a particular bias towards describing better contexts of developed countries, such as the number of questions that reflect a certain type of cultural possession , 2013, The greater access to electronic goods or internet in current days does not necessarily differentiate among higher and lower classes as could happen in a recent past (Avvisati, 2020).
Turning specifically to HOMEPOS in PISA 2015, I observe questions' wording that raises concerns regarding their weight in the index computation. For instance, 6 of the common 22 questions (27%) refers to the possession of different books, while 4 questions (18%) refer to electronic possessions. In that dimension, two questions present similar topics ('Computers [desktop computer, portable laptop, or notebook]' and ' A computer you can use for school work' , which presents a strong polychoric correlation, r(492,640) = 0.739, p < 0.001). Additionally, there is one general question that does not seem to reflect socioeconomic status ('a quiet place to study'), but an educational or academic environment. Finally, the question asking about the possession of 'works of art' at home is open to diverse interpretations, which may confuse respondents. This last question parameter is not included in official reports, although it was not formally excluded from the index (OECD, 2016(OECD, , 2017. Another relevant topic relates to the national items-three questions used by each country, which has been praised as a step forward in terms of each country better contextualisation of socioeconomic status (Rutkowski & Rutkowski, 2013). However, diverse points can be raised about those questions: first, they do not necessarily discriminate socioeconomic status but household choices (e.g., expresso machine in France or cultural television programs with payment in Albania). Second, they may refer to outdated technology ('BluRay player' in Mexico) or are biased towards specific sensitivities ('Violin/Cello' in Hong Kong, 'Piano or violin' in Taipei and Macao, or a 'piano' in the Netherlands). Third, only in a few cases, they relate to the possessions of luxury goods ('summer residence' and 'swimming pool' in Malta), which produce extreme parameters. It is also possible to detect redundancy of those national questions with the common questions. For instance, many questions regarding electronics are repeated (e.g. 'laptop' in Moldova and Finland or 'tablet' in Norway, Spain, Switzerland and UK; 'musical instruments' in the United States; an 'encyclopaedia' in Colombia), while local dependencies and inconsistencies among answers are not explicitly assessed by PISA (Avvisati, 2020). Finally, it is possible to find important differences in terms of factor loadings among countries (OECD, 2017), which suggests room for improvement in terms of capturing wealth in families. Additionally, one of the trade-offs of extending national items in HOMEPOS is the difficulty to address cross-country comparability issues using fewer common items across countries. While many criticisms can be made to HOMEPOS highlighting limitations and challenges, there still are a relevant source to be used with caution to shed light on the role of socioeconomic differences in schools.

The complexity of measuring inequality based on categorical data
Measuring inequality based on ordinal or binomial data-or a mixture of both, portrays a set of methodological challenges. First, certain distributional statistics such as the mean or variance or standard deviation cannot be properly drawn (Cowell & Flachaire, 2017;Zheng, 2008). Proportions and modes will be appropriate tools to analyse this type of data. Second, in many cases, ordinal data depict an arbitrary scale or asymmetric intervals in their response alternatives, which may also bias the analysis. For instance, a 5-point Likert scale question does not necessarily represent the same difference between pairs of options. I could either choose the category to 'agree' or 'strongly agree'-both options are closer in my mind in this case-with an opinion regarding certain policy addressing inequality within schools, although I will never choose the middle-point category-'neither agree nor disagree'-because I understand as very far from the 'agree' I might have chosen.
One of the consequences of dealing with categorical data is that traditional inequality measures, such as the Gini coefficient and generalised entropy indexes-for example, Theil or Atkinson indexes, which refer to inequality as a deviation from the mean or are mean-normalised, cannot be suitably employed to measure inequality using categorical raw data (Cowell & Flachaire, 2017;Zheng, 2011).
Recent research has been developing alternatives to develop inequality measurements based on categorical data. Allison and Foster (2004) suggest comparing one-variable cumulative distributions of Likert-type questions by ordering the data and identifying the distance from the median as an inequality measure. As they mention, their method only applies when each case's median coincides among them. Additionally, this method does not meet a desirable characteristic of any inequality index-the normalization axiom, where a distribution of identical observations, where there is total equality, desirably portrays a zero value. Based on that seminal idea, Abul Naga and Yalcin (2008) introduce a family of inequality indices based on the analysis of one variable normalising different questions' scales. Under their method, different Likert-scale questionsportraying 3, 5 or 7 alternatives-can be compared in terms of inequality. Zheng (2011) extends the approach to measuring inequality based on two variables. However, if the median does not provide an adequate reference for inequality-for example, when there is skewness on data, all previously measures may not capture the extent of the inequality.
A second approach developed to address this limitation is proposed by Cowell andFlachaire (2012, 2017). Instead of using the median as a reference, they compute inequality relative to a reference status. They suggest counting ranking positions of all observations and expressing them as proportions of the population. The measure could be either 'downwards' or 'upwards' in terms of relative position on a scale. Although very suggestive, this method does not seem adequate for measuring assets inequality due to the multivariate nature of a continuous wealth trait. However, the idea of maintaining the ordinality of the scales and ranking them rather than measuring inequality remain concepts in my proposed approach.
A third approach that addresses multiple variables consists of computing inequality based on latent variable methods. For instance, Mckenzie suggests a relative inequality measure towards identifying subpopulations' disparity based on a polychoric Principal Component Analysis index data (2005). His method computes each subpopulation's standard deviations divided by the variance explained by the first principal component, which additionally allows the comparisons of subgroups to the overall population inequality. The idea of ratios and comparing to the overall inequality average are kept in my proposal. In this case, IRT is chosen over polychoric PCA as a specific approach to model categorical data.
Finally, at least three caveats can be drawn when assessing school-level inequality based on HOMEPOS. First, HOMEPOS is derived through a posterior weighted maximum likelihood estimation (WLE), which assumes a normal distribution (Warm, 1989). In the case of PISA 2015, significant differences between countries occur in terms of the mean of HOMEPOS while there are fewer variations in the distribution across countries (see Fig. 5 in Annex 2). Second, simulations show that WLE tends to overestimate within-school variance (OECD, 2009). This is relevant for our case as school-level inequality is relative to the variance of school HOMEPOS. Third, WLE is sensitive to ceiling and floor effects if items are too easy or difficult, respectively. This contradicts another desired property of any inequality measure-scale invariance, where proportion changes to answers should not modify inequality. For example, if we add 10% of wealth to everyone, inequality remains the same as previous. Finally, as WLE are only a single possible realization of the estimation it does not addresses the uncertainty of the model, which could be adapted by using plausible values as independent variables (Pokropek, 2015). However, to address current limitations with measuring inequality based on WLE, I compute inequality based on the raw answers of family possessions rather than using the derived-index HOMEPOS.

Item Response Theory models
The proposed inequality measure-hereafter, Alpha Inequality-builds upon the discrimination parameter from IRT models. IRT is a statistical family of latent construct analysis that focuses on categorical data and is mainly used in educational and psychological fields. IRT assumes that each person has a certain level-called individual trait-of an unobservable continuous construct (e.g., knowledge, competences, attitudes) that predict the probability of answering correctly or endorsing an observable item (e.g., cognitive questions). In this case, the higher the possession of the construct-family wealth, the higher the probability of answering the possession of an item-electronic good.
It is based on the notion that the probability of a correct response or endorsement to an item is a function of both the person's trait and certain item parameters-such as difficulty, discrimination or pseudo guessing (Embretson & Yang, 2006). The item parameters determine the information offered by each item to any person's trait level.
The simplest IRT model is often called the Rasch model (Rasch, 1960). According to the Rasch model, an individual's response to a binary item (i.e., right/wrong, agree/disagree) is determined by the individual's trait level and one item parameter-the difficulty of the item. Because this model uses the logistic density function and uses a single item parameter, it is called the one-parameter logistic model (1-PL) (Fischer, 1995)-although there are some conceptual differences between Rasch and 1-PL. Other IRT models have been developed covering ordinal and nominal data; adding parameters to the logistic function such as the discrimination or guessing parameters (Embretson & Yang, 2006); and also using distinct methods towards dichotomising data for the analytical modelling process.
For instance, in 2015, PISA uses two IRT models: the generalised partial credit model (GPCM) (Muraki, 1992) for multi-item questions and the two-parameter logistic model for dichotomous items. In both cases, it adds the item discrimination parameter α i to the function, which will be explained later. The GPCM presents the following notation: which expresses the probability of an individual i correct response (or endorsement) X i to an item j for the total number of categories K of each question. θ j represents the individual's trait level, while β k refers to the item difficulty or location. The parameter ak k indicates the ordering of the categories from 0 to k − 1 (Chalmers, 2012). The discrimination parameter α i represents the degree to which an item differentiates between respondents in different regions of the measured latent trait θ j (in this case, household possessions). The parameter defines the steepness of the slope when P(θ) = 0.5 , where higher values suggest a better separation between individuals with higher and lower latent traits. Therefore, if α i → ∞ , the item represents a perfect separation between those who respond correctly, in this case, have a specific possession, and those who do not have ownership of it. Figure 1 is a simulated example of item characteristics curve (ICC) for three items, where item 3 has a higher discrimination parameter than the other two items because it 3 shows a steeper curve than items 1 and 2. The item discrimination parameter α i reflects the sensitivity of the response probability to trait levels changes (Embretson & Yang, 2006) and gives information on the importance of the item to the individual trait-in this case, how relevant possessing certain good reflects family wealth. Now I depart from the usual IRT parameter interpretation to turn into the consideration of inequalities. First, let us remember that inequality is an aggregated measure and not an individual condition. Therefore, we can think the latent trait as a continuum of equality (or inequality) of wealth for all respondents. In the hypothetical case that all respondents fall into the same value of θ , then the item represents an egalitarian condition-irrespective of the location in the x-axis of P(θ ) = 0.5 , where values in the left of the axis would represent poverty while in the right would represent richness). If the same occurred for all items, then there will be a status of full egalitarianism. Additionally, as the parameter defines the steepness of the ICC, larger item discrimination also means that the gap between those that are below the 50% probability of endorsing the item and those over that threshold has greater weight in terms of splitting individuals

Developing Alpha Inequality
The building process of Alpha Inequality, I j (x) , of any economic variable of interest-in this case, household assets possession-implies the following steps. First, the method involves modelling any IRT or latent variable model that considers the binary or ordinary nature of the responses-such as the graded response model, continuation ratio model, among many-and assumes the existence of a discrimination parameter that differs between items-which is not the case of a 1-PL model. In this example, I use GPMC for polytomous questions and 2-PL for binary items to coincide with the PISA 2015 modelling strategy.
The first step involves computing the IRT models for each item used in building the index and extracting the α i parameters. The second step consists of normalising all answers alternatives, ς i , into the same range of values, in this case, from 0 to 1. This is done to give the same importance to polytomous and binomial questions in terms of a similar contribution to the inequality measure. The third step involves the sum of the product of each parameter α i and the observation score ς ij for each observation (person), j , of the dataset. This is noted as follows: In the case of missing data, I weight each observation j according to the number of questions answered, q j to differentiate questions not answered from the absence of possession of an item, such as in: The final step implies computing the inequality measure for each school, I ϕ , which allows comparing between school, as well as assessing if schools reach an egalitarian status, where I ϕ = 0 . The inequality measure for each school ϕ is computed as the ratio between the standard deviation of ωξ j by the standard deviation of the entire population c , in this case, each country, ξ c , which can be expressed as: Following McKenzie (2005), this provides additional information such as if I ϕ is greater that one, the school displays more inequality than the country average inequality.
Every inequality measure has some properties to fulfil to provide reliable information regarding the distribution of any variable, in this case, wealth: scale and anonymity invariance, population independence, and binding the Pigou-Dalton transfer principle (Cowell, 2016). The Lemma containing how I ϕ fulfills all main axioms and its proof can be found in Annex 1. ( . Sempé Large-scale Assess Educ (2021) 9:9 Methods Data I use the wealth index, HOMEPOS from PISA 2015 to exemplify and evaluate the performance of Alpha Inequality. PISA 2015 collects data from dichotomous and ordinal questions on 25 household indicators across 73 countries and economies. The target population and sampling strategy aimed to represent the universe of 15-year-old students enrolled in each educational system. Students are sampled following a stratified design, where a minimum of 150 schools with proportional probabilities to the student population is initially selected. The minimum sample expected by a school is 20 students to ensure adequate accuracy in estimating between and within schools variance (OECD, 2017).
Following PISA's criteria (OECD, 2017), I subset those observations with at least 3 answers on the HOMEPOS scale and no missing values for the computed HOMEPOS scale. I exclude observations from schools with less than 20 observations. Additionally, data from two USA states and Puerto Rico, which did not provide identification of schools, are also excluded. The sample was reduced from 519,334 to 454,734 observations belonging to 69 countries, administrative regions, and economies and 13,387 schools. Descriptive statistics per country used in this study are in Tables 1 and 7 in Annex 2 shows the frequency of observations per country.
PISA's modelling strategy for HOMEPOS is a two-step process. First, a multiple group IRT two-parameter model is estimated (GPCM for ordinal questions and 2PL for dichotomous questions). Subsequently, HOMEPOS is computed based on the posterior weighted maximum likelihood estimation (WLE) (OECD, 2017). As HOMEPOS published parameters by PISA are estimated from a sample and do not reflect the observations used in this study (OECD, 2017), I replicate the first step of PISA's modelling strategy to extract the α discrimination parameters for each country and items. Following PISA, I estimate 22 common questions with equal parameters while 3 questions had parameters freely estimated per country. Correlations between PISA's HOMEPOS and the replicated index are over 0.939 for each country (see Table 8 in Annex 2).
Great variability is seen in terms of discrimination across items (Table 2), where, for instance, the questions 'book of poetry' and 'classic literature' present lower values, and in the opposite side, 'internet access' and 'computers' present the highest values among the common parameters.
There is also large variability in the parameters of the national-specific items, shown in Table 3. For instance, some countries present higher values in all three items, such as the case of Thailand, while the opposite also occurs, such as in the case of the United Kingdom. Germany is the only case that presents a negative discrimination parameter for the question ' A TV in your own room' . A negative discrimination parameter suggests the latent trait diminishes with the ownership of the good.
As the objective of the study is to exemplify the construction of the inequality measure, I do not address and evaluate model fit and invariance analysis. I rely on PISA's item invariance analysis-named root mean square deviance (RMSD), which states that invariance of HOMEPOS items across countries was analysed and 'unique parameters were assigned if necessary' (OECD, 2017, p. 342). However, as I was previously mentioned, prior research reports dispute the reliability and validity of socioeconomic scales in PISA. I acknowledge those limitations and focus, on the present study, only on the methodological contribution of building an inequality measure.

Criteria to assess Alpha Inequality validity
The strategy chose to examine Alpha Inequality assessing its validity in comparison to prior evidence and comparing results to a well-known inequality index based on HOME-POS such as the Gini coefficient. The Gini coefficient is computed based on HOMEPOS applying a correction for finite populations (Nygärd & Sandström, 1985). HOMEPOS was transformed into a range of positive values [0, 15.457] to address a requirement of the Gini coefficient computation. First, I compare cross-countries rankings statistics from both measures and exemplify the relevance of inequality on learning scores in the case of the USA by comparing schools at both extremes of the inequality continuum. Second, I model a set of textbook regressions to examine how Alpha Inequality and the Gini coefficient are associated with Mathematics scores. For each country, I fit two sets of two-level mixed-effects linear models, allowing random intercepts to vary at schoollevels. This addresses the hierarchical structure of PISA, where students are nested in schools. Formally, the equation of two-level random intercept model reads as: where Y ij denotes the outcome variable for the i-th observation (student) of group j (School), β 0j the school intercepts (which are random variables enabling the quantification of the differences between groups). β ′ s are regression parameters invariant across groups. The different inequality measures are denoted by x 1ij , while u j is the group-dependent deviation from the intercept mean and ∈ ij represents the error term.
(5) Y ij = β 0j + β 1 homepos 1ij + β 2 x 1ij + u j + ∈ ij ,  HOMEPOS was included in the model due to the influence of the difficulty parameter on the posterior estimation of HOMEPOS, which may allow a better understanding of the role of an inequality measure independent from the wealth possessions.
There are three key methodological considerations which should be considered when modelling data from PISA. First, it is important to consider that PISA is based upon a two-stage stratified sampling strategy to select schools and students. I address this using sampling weights to account for differences in the probabilities of students, classes and schools being selected in the sample . Considering a multilevel analysis setting, I follow current PISA's practice since 2012 (OECD, 2017) using weights both at the student and school levels in the regression analysis. For the student level, I scale student weights following Rabe-Hesketh and Skrondal (2006), which adjusts students' weights by the ratio of the school size and the sum of students' weighs, as follows: School level weights correspond to the sum of W_FSTUWT RH−S for each school. Secondly, due to PISA's design, tests scores are estimated as plausible values, where each student has 10 different marks. To address this uncertainty, I apply Rubin's rules for handling multiple imputations (Rubin, 1987) both in terms of computing schools averages and modelling regressions for each plausible value, where I compute adjusted sets of coefficients and standard error estimates and join them  in a final estimate. Finally, due to the stratified multistage sampling design mentioned earlier, I estimate the uncertainty associated with the sampling using PISA's approach-Fay's modification of the balanced repeated replication (BRR) method, which allows computing the sampling variance. Item parameters are estimated through an iterative marginal maximum likelihood approach (Bock & Aitkin, 1981), using the expectation-maximization algorithm provided by mirt package (Chalmers, 2012) in statistical software R (R Core Team, 2020) and statistical analysis was performed using package BIFIEsurvey (Robitzsch & Oberwimmer, 2015).

Comparison between two school-level inequality measurements
Comparisons between countries are only feasible if we assume the existence of measurement invariance across countries, which allows further inferential analysis in the same metric. Conditionally to the assumption of measurement invariance claimed by PISA (OECD, 2017, p. 342). Table 3 presents the average inequality per country and the inequality coefficient of variation (CV) for both inequality measurements. While Alpha Inequality/Gini aims to assess the level of school-level inequality per country, CV provides a sense of the variability of inequality within the educational system.
Looking at the Alpha Inequality values, countries from Latin America and South Asia such as Peru, China (4 cities), Indonesia, Thailand and Colombia present the lowest values of Alpha Inequality and, at the same time, high values of CV. The opposite occurs with countries such as Iceland, Finland, Estonia, Poland, and Norway, which present Alpha Inequality close to 1 while having low values of CV. This suggests important differences between the two groups of countries. The first group of countries are characterised by educational systems with socioeconomically more homogeneous schools and larger degrees of segregation between schools, dividing poor and rich in different schools. The second group presents relatively smaller socioeconomic differences between schools while having larger within schools' economic diversity. This coincides with recent research focused on the analysis of segregation on different waves of PISA (Gutiérrez et al., 2019). Additionally, Alpha Inequality allows comparisons between countries (Table 4). For instance, Iceland, Kosovo, Moldova, Montenegro, Iceland, New Zealand, and Qatar present more than 35% schools with school-inequality above their national average, while Indonesia, Israel, Peru, China (4 cities) and Thailand only present less than 5% schools above the national average of inequality (see Table 9 in Annex 2). Figure 2 shows the distribution of Alpha Inequality for each school by countries. Alpha Inequality presents different distributions across countries, as could be expected based on prior cross country analysis (Thomas et al., 2001). In some cases, they approximate to Gaussian functions, such as the case of Brazil, Indonesia, and Australia, while in other cases there are bimodal distributions such as in the case of Malta, Macedonia, and Trinidad and Tobago. In many cases, kurtosis and skewness are relevant features to be observed on the distributions and inferential analysis.
On the other hand, the Gini inequality presents, in general, very low coefficients across countries and schools. National averages are in a range between 0.003 and 0.006, and countries such as The Netherlands, Denmark and the Slovak Republic appear with the  of segregation in schooling systems (Gutiérrez et al., 2019). Figure 3 shows school Gini density functions for each country, where in general, they present heavy-tailed distributions. Exceptions of bimodal distributions are Macedonia and Montenegro. Country-level correlations of both inequality measurements present an overall mean of 0.612(SD 0.131) ranging from 0.105 (Israel) to 0.846 (Qatar) ( Table 10 in Annex 2).
To examine the impact of differences between both measurements, I turn to the case of the USA, which has more prior empirical analysis on segregation and inequality. The Gini coefficient does not provide a hint of difference between schools in the top 20% and the bottom 20% of the Gini index in terms of the average of mathematics learning scores per school. This contradicts prior estimations (Rutkowski et al., 2018) as well as cross-country studies that focus on the segregation levels of USA schools and educational scores (Benito et al., 2014;OECD, 2018). Contrarily, Fig. 4 shows how schools with lower Alpha inequality outperform in terms of Mathematics Average by 0.57 standard deviations schools with the largest share of inequality with statistically significant differences between groups, t(60.36) = − 7.01, p < 0.001. This represents about 2 more years of schooling according to PISA (2009).

Models' coefficients
Results from country-level mixed-effects regressions models can be seen in Table 5 with Alpha Inequality as a predictor of Mathematics score. I find that in 67 out of 69 countries show statistically significant negative parameters, while in the case of Indonesia and Vietnam the null hypothesis of a parameter different from 0 cannot be rejected under a standard cut-off of p < 0.05.
On the other hand, Table 6 presents the estimations of regression parameters using the Gini coefficient for each county. In this case, the number of cases not showing a statistically significant association raises to 5, being Estonia, Iceland, Latvia, United Kingdom, and the United States of America. The case of the United States, as previously discussed, raises concerns in terms of estimation reliability of the Gini parameter due to the lack of ability to find statistical significance given the previous empirical evidence found in the literature. Additionally, Luxembourg is the only case portraying a positive coefficient for the slope of school inequality and mathematics scores.

Conclusion
This paper has found that a set of multivariate household possessions collected as categorical data can be used to provide a novel measure of inequality. The proposed inequality measure is independent of the scale of wealth and fulfils the main properties of inequality measures. Additionally, Alpha Inequality also allows for comparisons between and within countries.
Computing school-level inequality using data of PISA 2015, I find a consistent significant negative association of school-level inequality and Mathematics scores across countries-the great exemption being a majority of European countries. It is also suggested that the inequality measure outperforms the Gini coefficient in terms of assessing the association of school-level inequality and learning outcomes. This is consistent with previous research on the topic that identifies different levels of inequality within and crosscountries. In the case of known negative effects of inequality, Alpha Inequality is shown to better grasp the relevance of socioeconomic disparities between schools in terms of learning scores.
There are important limitations to be acknowledged. While the improvement of socioeconomic scales such as HOMEPOS focusing on the need of updating items to represent wealth in current times, cross-compatibility and model fit becomes a requisite to apply and study thoroughly the effects of school-inequality, further research could point to different directions such as the assessment of inequality on cognitive and non-cognitive  educational outcomes across different waves of PISA as well the interplay between inequality, segregation and educational outcomes.
Second, there is a methodological debate regarding the inclusion of survey weights design into IRT scoring procedures to take account of the complex sampling designs and nested structure of item response data of PISA and other ILSAs. This uses multilevel item response methods and different weighting strategies (Zheng & Yang, 2016).
Third, alternative sampling weights scaling methods at both levels were explored (Mang et al., 2021) addressing the complexity of using within and between weights in multilevel clustered analysis. Although the number of statistically significant models varied, similar negative coefficients were found in all cases, and, in all cases, models with Alpha Inequality predictors were more sensitive than Gini. However, in some weighting configurations, large standard errors were found suggesting model identification or convergence issues.  This is relevant as sample design in PISA is informed by school socioeconomic attributes and the estimation of parameters-discrimination, among them-could be affected by the lack of weights. Further research could point the relevance of weighting IRT models to address socioeconomic sampling variances. In this case, I mimic IRT modelling single-level strategy and address the stratified complex sampling design in the multilevel regression model regression analysis including replicate and scale weights.

Annex 1
Lemma 1 I ϕ satisfies the main properties of an inequality measure.
• I ϕ is continuous on the domain of distributions I. • I ϕ is invariant to permutations of the measure among students in the same population (anonymity invariance). • I ϕ is invariant to any multiplication of each student score observation by any positive integer constant. The inequality measure is, therefore, independent of the aggregate level of income (scale invariance). • I ϕ remains invariant to the size of the population, and therefore, to the replication of observation of the original population (population independence). • Redistributing benefits from richer to poorer individuals (without individuals' reranking) reduces I ϕ , as the standard deviation at the numerator decreases while the denominator remains unchanged (Pigou-Dalton transfer). • I ϕ takes a zero value when all individuals rank their health status identically (normalisation).
Proof of Lemma 1 (Continuity) I ϕ1 and I ϕ1 represent two inequality measures. If I ϕ1 ≈ I ϕ2 , then they will have very similar inequality values.
(Anonymity) Let x denote any distribution of assets with elements x 1, x 2 , . . . . As I ϕ (x) depends only on the set x 1, x 2 , . . . , any permutation of elements of x does not produce changes in I ϕ , so I ϕ (P(x)) = I ϕ (x).
This section suggests the inequality measure fulfils main properties customarily deemed desirable for an inequality measure, and therefore, can be accepted as a desirable measurement of inequality.