COVID-19 crisis in Cambodia: A dataset containing linked survey and administrative data of ninth-graders in rural areas

Between 2019 and 2021, we collected detailed administrative data of grade 9 students from 54 lower-secondary schools in rural Cambodia. We also collected phone-survey data from these students in July and August 2020, to understand the implications of the nation-wide lockdown, that was implemented to curb the spread of COVID-19 in March 2020. The administrative data contain information on students’ grades, school characteristics and teacher characteristics from the school year 2019–2020, as well as information about the students’ enrollment in high school after the end of the school year (January 2021). This information is available for 3258 students. The phone-survey data contains information on students’ socio-economic background, parental education and occupation, as well as students’ study behavior, time use, educational aspirations and expectations, and perceptions of the COVID-19 crisis. This information is available for 2197 students.


Specifications
Microeconomics, Economics of Education Specific subject area Students' study behavior, time use and perceptions of COVID-19, and their relationship with parental background, experience of economic hardships in the family, and academic performance. Type of data Stata data file (.dta) Comma delimited file (.csv) How data were acquired Phone-survey and administrative data, both collected by research assistants Instruments: Open Data Kit (survey) and Excel (administrative data) Questionnaire available in the supplementary material Data format Raw data Coded data Parameters for data collection Administrative data were collected from grade 9 students of 54 lower-secondary schools across four provinces in Northwest Cambodia. Sample schools were eligible for high school scholarships provided by Child's Dream, an international NGO (65% of schools), or similar in characteristics to these schools and from neighboring districts (35% of schools). For the phone survey, one or two classes of grade 9 were randomly selected to participate in the phone-survey, of these classes all students were contacted. Students were asked for informed consent before the interview. Description of data collection Administrative data were collected on a monthly basis by a total of three local research assistants, who contacted school principals and class teachers during a period of 18 months. Phone-survey data were collected between July and August 2020. Eligible students were called by local interviewers (under the supervision of two team leaders, one supervisor, and the principal investigators

Value of the Data
• The main strength of this dataset is its combination of phone-survey data collected during the nation-wide COVID-19 lockdown (July-August 2020) with detailed administrative data collected before and after the phone survey. This dataset contains comprehensive information about students' socio-economic characteristics, parental occupation before and during the COVID-19 pandemic, as well as students perceptions of the pandemic. • Researchers, practitioners, and education officials can make meaningful use of the data to understand the implications of the COVID-19 pandemic on students from low-income settings, and to link experiences made during the crisis and perceptions of the crisis to academic performance before and after the pandemic. • Since the dataset is linked at the individual level with administrative information on academic performance and school, teacher, and class characteristics, it can also be used to study shock experiences at the group versus individual level.

Data Description
The data contain information about students of grade 9 from rural Northwest Cambodia that was collected between 2019 and 2021. The data collection was designed to understand the implications of the nation-wide lockdown-implemented in March 2020 to curb the spread of COVID-19-on students school performance, as well as on educational and career aspirations. The first source of data is administrative data that cover the entire (prolonged) school year (November 2019 to November 2020) and the transition to high school in January 2021. The second source is data from a phone survey that was conducted during July and August 2020 with a subset of these students.
The dataset is available in STATA format and in open-source format (CSV), and is complemented by the code to replicate [1] (STATA do-file), as well as the phone-survey questionnaire.

Administrative data
The variables in the administrative data are summarized in Tables 1-3 . This dataset contains information on students' gender, age, and whether she or he is a class leader (maximum of three per class) as summarized in Table 1 . In addition, it contains monthly information on grades in the subjects Khmer, Math, English, and the total over all subjects, as well as days absent, covering the time-period from November 2019 until March 2020. The maximum points for Khmer, Math, and English are 100, 100, and 50, respectively. The total grade was standardized within a class by the authors, as the number of subjects included in that total grade varies across schools.
In addition to monthly grades, this dataset contains students' grades in the first midterm exam (Semester 1 exam) and in the lower-secondary graduation exam (final exam). The midterm exams usually determine whether the student will be allowed to participate in the final exam. This requirement was relaxed in 2020 due to the COVID-19 crisis. In some schools, the first midterm exam was conducted just before the lockdown (as indicated by the variable "Semester 1 exam was conducted before school closure"), in others this exam could not be conducted anymore. In that case, the Semester 1 grade is the average of the months December, January, and February. 1 The final exam is a standardized, national exam consisting of 11 subjects. The total grade of the final exam is not standardized by the researchers as it's content is comparable across schools. Students pass the exam with 260 points and above. We also created dummy variables that indicate whether a particular student participated in the final exam (2864 students did so), whether the student ranked among the top 20%, 15%, 10%, or 5% of students in the final exam (the means are somewhat lower that the respective percentiles because this variable is coded as zero for students who did not participate in the final exam), whether the student passed the final exam (i.e. obtained more than 260 points), and finally whether the student dropped-out during school closure, which is defined as one if the student did not participate in the final exam. The variable indicating whether the student transitioned to high school has fewer observations than the overall sample, as this information was collected from lower-secondary teachers (and in some cases students' classmates) after the new school year started in 2021. Not in all cases could teachers or classmates say with certainty if a particular student had indeed enrolled in high school.
The administrative data also include information on the location of the students' homes (village name). We used this information to calculate the number of students living within a radius of 1 km of each other, as well as the geodesic distance of the students' hometown to the school, to the district capital, to the province capital, and to the (Thai) border. For privacy reasons, we do not disclose the location of the students' village or of the school, but only the distances between those, see Table 2 .
In terms of school and class characteristics, the dataset contains the class size (as of February 2020), information on the number of classes in grade 9 at the lower-secondary school, the dropout rate in grade 9 in previous years, and the share of students of grade 9 that transitioned to high school in previous years (all obtained from the school's principal), 2 the class teacher's age, gender, and experience at this particular school, whether the class teacher has a university degree, and the distance at which the teacher lives from the school (all collected from the class teacher). The variables are summarized in Table 3 . Finally, this dataset contains the number of flood days experienced at the school during the first week that students were allowed to return to school after the first lockdown (September 7, to 13, 2020) as well as during the entire period between school reopening and completion of the final exam (10 weeks), as a severe flood hit the region right at this time (also summarized in  Table 3 ). To construct these variables, we collected daily flood maps from MODIS Near Real-Time Global Flood Mapping Project [2] , and constructed two sets of variables: the first-school-week variables range from 0 to 7, and indicate the number of days in that first school week on which a flooded area was detected within a 5, 10, or 30 km radius around the school. The variables for the total period range from 0 to 70, and indicate the total number of days in that entire 10-week period on which a flooded area was detected within a 5, 10, and 30 km radius around the school.

Phone-survey data
In the phone survey, we collected information on student and family characteristics, on students' study behavior, time-use, educational aspirations and expectations, as well as COVID-19 perceptions. These variables are summarized in Table 4 , as well as in Figs. 1-6 .
In terms of student and family characteristics, we collected the age and gender of the student and whether she or he has a smartphone, as well as the migration incidence in the family (see Table 4 Phone-survey data.    Table 4 ). We also collected information about parental education and parental occupation, as summarized in Figs. 1 and 2 .
Further, we inquired about students' aspirations and expectations regarding the highest level of schooling, and what kind of job students would like to do as adults (open ended). For educational aspirations and expectations, the dataset contains the original variables (with categorical answers, see Fig. 3 ) as well as constructed variables that translate the categories into years of schooling (see Table 4 ). The likelihood variables allow students to state their confidence in the achieving their aspired educational level or career on a scale from 0 (not at all confident) to 10 (extremely confident).
We also collected students' expectations about the costs of high school, what students expect the most expensive item to be, the cost of that item, and whether they had already applied for  a high school scholarship (see Table 4 ). We then asked about students' motivation to return to school after the lockdown. Information about students' study behavior during the lockdown includes regularity of contact with the teacher, and in what type of remote learning activities the student was involved in (see Fig. 4 ). Time-use information contains whether the student studied in the last 7 days, whether the student worked in the last 7 days (and if yes, how many hours), and what the student's main activity was in the last 7 days (see Table 4 ).
In terms of COVID-19 effects, we collected information about whether one (which one) or both parents lost their job due to COVID-19, or whether one had lower income (see Table 4 ). In case one or both parents lost their job, we also asked about the pre-crisis occupation of the parent. These questions are similar to what has been collected in other COVID-19 related phone surveys [3] . We use this information in three ways: first, from the survey answers, we construct the pre-crisis occupation of each parent (initial job) according to the ISIC Rev.4 classification [4] . Second, from the income/job loss questions, we calculate dummy variables (mother/father experienced income losses/changed job(s)) that equal one if the respective applies (see Table 4 . Third, we calculate the probability of experiencing income losses per sector of occupation, and assign this value to each parent based on her/his initial occupation as described in Gehrke et al. [1] .   The data also contain information about whether a migrating family member was forced to return to Cambodia, whether students had to change their working situation, and whether students adjusted their educational expectations due to COVID-19 (see Fig. 5 ).
In addition, the data include information about students' perceptions of the crisis evaluated on a 4-point Likert scale, as summarized in Fig. 6 . We constructed three indices from these statements by averaging over the level of agreement to the respective sub-questions. The financial worries index is the average agreement to the statements A, G, and J. The job prospects index is the average of the statements B (reversed), D, and I. The studying index is the average of the statements C, F (reversed), and H.
Finally, the data contain information entered by the interviewer concerning the quality of the call, whether the student was motivated during the call, and whether she or he seemed to have understood the questions.

Other variables
A few additional variables are included in the data for replication purposes (see Table 5 ). The variable priority class equals one if a student's class was (randomly) selected to receive the intervention described in Gehrke et al. [5] , or was selected to serve as control class. 3 Student participated in RCT equals one if the student was present on the day of the intervention. Unique combination of paternal/maternal occupations are categorical variables used to cluster standard errors in Gehrke et al. [1] .

Sample
The sample consists of students of grade 9 from rural lower-secondary schools in Northwest Cambodia. Students of grade 9 are usually about 15 years old, and in their final compulsory schooling year. To select the schools we collaborated with Child's Dream, an NGO that offers high-school scholarships in the study area. We received the contact information from the schools that Child's Dream was working with ( i.e. schools whose students were eligible for their scholarships), and of those schools sampled all schools with more than 30 students in grade 9 (resulting in a sample of 39 schools across 4 provinces: Banteay Meanchey, Battambang, Oddar Meanchey, and Siem Reap). To increase sample size, we added 21 schools from other districts, but in the same provinces, to our sample. These schools have similar characteristics to the schools that partner with Child's Dream. We were able to receive students' administrative records and school characteristics from 54 of the 60 sampled schools; in six schools the principals were not willing or able to share this information. Within the sample schools, we then aimed at collecting information from all grade 9 students in at least one class. Some schools have more than one class, in those we randomly selected one (in some cases two) classes to be part of our sample.
This sample of students is not representative of students at lower-secondary schools in rural Cambodia. However, at least in terms of school characteristics, our data are broadly comparable to the average school in rural areas of the country, as discussed in Gehrke et al. [1] .

Timeline
Prior to our study, we contacted the principals of all schools and asked for permission to conduct a study at their school. If willing to participate, we collected a few school characteristics from the principal, as well as the contact details of the class teacher of the selected class.
The administrative data were collected from the class teacher on a monthly basis since November 2019. Around this time, we randomly assigned half of the schools to be part of an educational intervention that was conducted between February and March 2020 [5] .
On March 16, 2020 a national lockdown was announced to slow the spread of the COVID-19 virus. At that time, we had successfully visited and conducted a half-day workshop at 18 schools. Schools were closed until further notice, and the government quickly set up a remote learning system which encouraged students to keep studying during the school closure via TV programs and teacher assignments.
A few months into the first lockdown, we conducted a phone survey with students in our sample. Our target sample were all students from the administrative sample that had still been participating in exams just before the first lockdown ( n = 3258 ). Participation in the phone survey was entirely voluntary, and we explained the aim of the project and asked students for consent to participate in the study before starting the interview. We were able to reach 2197 students (response rate = 67%).
Shortly after the phone survey was completed, schools were reopened for grade 9 students for a period of two months (between September 2020 and November 2020) to allow students to prepare for their final exam, which determines admission to high school. Students had about 8 weeks to prepare for the final exam, which took place in November 2020 and was spread over a period of two weeks. A few weeks after the final exam was completed, the Prime Minister announced that all students who had participated in the final exam were allowed to transition to high school irrespective of whether they had obtained more than 260 points [6] . The new school year started in January 2021; however schools were closed again after a few weeks as a new lockdown was imposed.
Between September 2020 and July 2021, we continued to collect administrative data to learn about the student's performance in the final exam, and-after the new school year startedwhether students had enrolled in high school. The process of data collection is also detailed in Gehrke et al. [7] .

Ethics Statement
This data collection obtained ethical approval from the Ethics Committee of the University of Göttingen (IRB approval date 11 February 2020), as well as from the Social Sciences Ethics Committee at Wageningen University and Research (IRB approval date 25 May 2020). Participants gave consent to the interview over the phone.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Data Availability
COVID-19 crisis in Cambodia: A dataset containing linked survey and administrative data of ninth-graders in rural areas (Original data) (Mendeley Data).