Data Resource Profile: Understanding the patterns and determinants of health in South Asians - South Asia Biobank

Background and aims: This paper describes the data resource profile of South Asia Biobank (SAB), which was set up in South Asia from November 2018 to March 2020, to identify the risk factors and their complex interactions underlying the development of type-2 diabetes mellitus, cardiovascular disease and other chronic diseases in South Asians. Data resource basics: This cross-sectional population-based study has recruited 52713 South Asian adults from 118 surveillance sites at five centres of excellence in South Asia (Bangladesh, North India, South India, Pakistan and Sri Lanka). Structured assessments of participants included six complementary domains: i). Registration and consent; ii). Questionnaire (information on behavioural risk factors, personal and family medical history, medications, socioeconomic status); iii). Physical measurements (height, weight, waist and hip circumference and bio-impedance for body fat composition, blood pressure, cardiac evaluation, retinal photography, respiratory evaluation); iv). Biological samples (blood and urine); v). Physical activity monitoring and vi). Dietary intake by a 24-hour recall. Aliquots of whole blood, serum, plasma, and urine were put into storage at -80{degrees}C for further analysis. Key results: The prevalence of obesity is 6.6% in Bangladesh, 19.7% in India, 33.9% in Pakistan and 15.7% in Sri Lanka. The prevalence of diabetes is 11.5%, 27.7%, 25.3%, and 24.8%, and the prevalence of hypertension is 26.7%, 36.9%, 44.5%, 35.0% in Bangladesh, India, Pakistan and Sri Lanka respectively. Collaboration and data access: SAB is the first comprehensive biobank of South Asian individuals. Collected data are available to the global scientific community upon request.


Data resource basics
Type-2 diabetes mellitus (T2DM) and cardiovascular disease (CVD) are leading and closely interlinked global health challenges 1 . The burdens of T2DM and CVD are especially high in South Asia, one of the most populous and the most densely populated regions of the world 3,4 .
The prevalence of diabetes in South Asia has risen more rapidly than in other large geographic regions, 5 while it is projected that South Asia will account for 40% of the global CVD burden by 2020 6 . In addition, T2DM and CVD develop at an earlier age in South Asians than in their European counterparts 7,8 .
Identification of the primary risk factors for T2DM and CVD is central to the development of effective approaches for the prevention and treatment of chronic diseases such as T2DM and CVC 9 . However, epidemiological data are currently sparse for South Asia, with evidence on the drivers of T2DM and CVD being predominantly based on cross-sectional studies that recorded a narrow range of exposures, and without longitudinal assessments 4,6,7,10 . The few available prospective studies are largely derived from investigations of South Asians residing in Western countries, and are further limited by small sample size and incomplete phenotypic characterisation 4 . To better understand the wide range of exposures that contribute to the development of T2DM and CVD in South Asians, a large-scale population-based study that collects information on demographic, lifestyle, clinical, environmental, and genomic variables is needed.
To address this important need, we have established a unique cross-sectional population study focussed on the South Asian population-South Asia Biobank (SAB). SAB was launched in 2018 as a partnership between collaborating centres in Bangladesh, India, Pakistan, Sri Lanka and the UK 11 . SAB includes rich baseline demographic, lifestyle, clinical, environmental, and phenotypic data, biological samples from more than 50,000 South Asian participants. This resource will enable a broad range of epidemiological research, including the development of All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint prevention and treatment approaches, discovery of novel molecular biomarkers, risk stratification algorithms and innovative therapeutic approaches for better prevention of T2DM and CVD in South Asians. The specific initial objectives of SAB are to: 1. Establish a network of NCD surveillance sites in Bangladesh, India, Pakistan and Sri Lanka, using common protocols and platforms, in partnership with regional centres of excellence in South Asia; 2. Complete structured health assessments on a representative sample of at least 50,000 South Asians aged 18 years and above residing at all surveillance sites; 3. Use the data to identify the genetic and environmental factors underlying noncommunicable diseases in South Asians, and translate the findings into new approaches for maintenance of health and well-being.

Data collected
SAB is a cross-sectional population-based study that recruited participants in five study regions: Bangladesh, South India and North India, Pakistan and Sri Lanka. Health information and biological samples of 52713 South Asians were collected from 118 surveillance sites.
Recruitment started in November 2018 and ended in March 2020 (due to the pandemic of COVID-19).

Inclusion and exclusion criteria
We recruited men and women of self-reported South Asian ethnicity aged 18 years and above.
We excluded women who were currently pregnant, as well as people who were not permanent residents of the surveillance site (residence for 12 months or more). We also excluded people with serious illness expected to reduce life expectancy to less than 12 months, those who plan All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint 6 to leave the surveillance site within the next 12 months, and those unable or unwilling to give informed consent.

Study settings and participants
The surveillance sites at which recruitment occurred in each study region are summarised in Supplementary Table 1 (available as Supplementary data at IJE online). Governmental census data and available household listings were used, together with house-house visits by research teams and local primary care workers, to identify (enumerate) the resident population. In each household, demographic details of the eligible adults were obtained, and all people meeting study entry criteria were invited to take part in. We worked closely with community senior members (e.g. teachers, employers, religious leaders) to support and facilitate engagement in the study. Explanations of the project's purpose were provided in writing and using videos, in relevant South Asian languages, supported by bilingual translators.
By March 2020 we recruited a total of 52853 subjects: 13954 from Bangladesh, 8620 from South India, 9469 from North India, 5875 from Pakistan and 14935 from Sri Lanka.

Measures
Participants were invited to attend the survey sites between 7am and 11am in the fasting state (water only after midnight). Structured assessments of participants were conducted in six complementary domains: i. Registration and consent; ii. Health and lifestyle questionnaire; iii. Physical measurements; iv. Biological samples (blood and spot urine); v. Physical activity monitoring and vi. 24-hour dietary recall. Procedures and training were standardised between countries and surveillance sites.
1. Registration and consent. Written, informed consent was obtained from all participants for data collection, and inclusion in the research. Informed consent included permission for the All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint 7 data and samples collected to be used for chronic disease research, including data sharing with national and international bodies concerned with prevention and control of T2DM and CVD, as well as for molecular epidemiological research. Consent was facilitated using videos (available in major South Asian languages). A unique study ID was allocated to each participant.

2.
Questionnaire. An interviewer-administered health and lifestyle questionnaire was used to collect information on behavioural risk factors (smoking, alcohol use, physical activity and consumption of fruits/vegetables), personal and family medical history, medications, and socio-economic status. The questionnaire was founded on the extended WHO STEPwise approach to Surveillance (STEPS) questionnaire that is widely used in global disease surveillance, which was adapted for use in South Asia context, through incorporating additional questions 12 .
3. Physical measurements. These included: a) Anthropometry (height, weight, waist and hip circumference and bio-impedance for body fat composition); b) Blood pressure by digital device; c) Cardiac evaluation by 12 lead ECG to identify arrhythmia, left ventricular hypertrophy and previous myocardial infarction; d) Retinal photography for assessment of retinal disease, including hypertensive and diabetic retinopathy; and e). Respiratory evaluation by spirometry to assess for smoking/environment-related lung injury.
4. Biological samples. 25ml venous blood was collected using venesection by trained phlebotomists and then distributed into EDTA, serum and citrate vacutainer tubes, and into tubes designed for RNA preservation (Tempus tube). Fasting glucose, and cholesterol were measured by point of care tests. An Oral Glucose Tolerance Test was carried out in a subset All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint 8 of participants, enabling validation of diabetes classification. A spot urine sample (6 ml in 3 aliquots) was also collected for analysis of albuminuria and other biomarkers. Aliquots of whole blood, buffy coat, serum, EDTA plasma, citrate plasma, and urine (Supplementary Table 2, available as Supplementary data at IJE online) were stored at -80℃ for future molecular epidemiological research (including genomics) to investigate the mechanisms underpinning the development of T2DM and CVD, and other complex diseases that are of importance to South Asians (including but not limited to: obesity, cancer, dementia, COPD, chronic kidney disease).
5. Physical activity was also objectively quantified in 100Hz resolution using a wrist-worn triaxial accelerometer, worn on participants' non-dominant wrist for seven days. This device is small, light-weight, wrist-watch shaped, battery-powered and uses triaxial acceleration in gravitational units to infer participant movement. It has been used recently to measure physical activity patterns amongst 100,000 people in the UK Biobank study 13 .
6. Dietary intake was recorded by interviewer-administered computerised 24-hour dietary recall based on the multiple pass method using the Intake24 system (https://intake24.org/). perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint All study participants received a report summarising the clinically relevant results of their health assessment, together with an explanatory booklet and a link to access an explanatory video. Participants identified with significant health conditions (e.g. T2DM, hypertension) had the opportunity to discuss the results with the study team, and to be referred to an appropriate healthcare facility for further assessment, counselling or treatment.

Environmental mapping
In each surveillance site, an environmental mapping exercise was carried out. The aim was to characterise the built environment in terms of retailers and advertisements for food and tobacco and physical activity facilities. The methodologies were adapted from food modules Convention on Tobacco Control 14-16 . In addition to geolocations, the main variables included food (e.g., fruit, vegetables, confectionery), drinks (e.g., soft drinks, sugar-free drinks), and tobacco products (e.g., cigarette, beedi) being sold or advertised. Data collection used KoboToolBox for Android (https://www.kobotoolbox.org) and covered each surveillance site with a 500-meter buffer beyond the site boundary.

Identification of outcomes
The primary outcomes were T2DM and cardiovascular disease. The secondary endpoints included respiratory and chronic kidney diseases, or cancer.

Quality control and data management
The surveillance teams, comprising research assistants, laboratory technicians, physicians and coordinators, were trained to follow standardised protocols (Supplementary Table 3, All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint available as Supplementary data at IJE online) Their training modules included interviewing techniques, ethics and specific instructions for data variables (demographic, socio-economic, food security, behavioural risk factors, medication and lifestyle practices, physical measurement and collection of biological samples).
Revalidation of the research teams in study procedures was done at regular intervals during the study to ensure high-quality data collection that was harmonised across surveillance sites.
Standardised operating procedures were established for all data collection procedures.
Questionnaires were translated into the local languages local to the communities, and back- The data management teams reviewed the data collected routinely for completeness and data quality, including using custom computer scripts to assess for biases in data entry, logical inconsistencies, internal correlations, digit preference, measurement drift or bias between machines and observers. Quality control reports were circulated at weekly intervals between the study investigators, to drive continuous evaluation and improvement in study processes. A random subset comprising up to 2% of the study participants, and/or a subset of biological samples were reassessed to provide additional quality control information. Data collection methods used were "field-friendly", culturally-acceptable and minimally-invasive in order to reduce participant attrition and improve logistical feasibility.

translated. Equipment used for physical and biological measurements is listed in Supplementary
Personal and clinical data were separated by pseudonymisation to enhance data security. All data were encrypted during transmission and stored securely both locally and in a cloud-based infrastructure. Data and all relevant documents will be stored for a minimum of ten years.
Samples collected were split and stored in both South Asia and the UK to ensure long-term (>20 years) sample integrity and preservation. While some laboratory assays on stored samples All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint were done in South Asia, the majority of assays were carried out in the UK or other countries with relevant technologies in the future.

Data resource use
Data collected in this cross-sectional investigation could be used to assess the epidemiology of T2DM, CVD and other chronic diseases in South Asia. The exploration of possible risk factors for T2DM and CVD could provide a scientific basis for evidence-based public health policymaking and interventions. Upon request, the rich resources of SAB are available to researchers from all over the world.

Key results and findings
A total of 52853 participants from the four participant countries took part in SAB; the locations of all surveillance sites are demonstrated in Figure 1. The response rate based on enumerated population in each surveillance site ranged from 17.6% in North India to 72.3 % in Pakistan (see Supplementary Table 5 for more details, available as Supplementary data at IJE online). The demographic structure of the study participants and the comparison with the National Population data in 2015, obtained from the United Nations Population Division, are shown in Table 1 17 . Based on the definition of obesity by WHO, the prevalence of obesity (body mass index ≥30 kg/m 2 ) is 6.6% in Bangladesh, 19.7% in India, 33.9% in Pakistan and 15.7% in Sri Lanka. The prevalence of diabetes, defined as a fasting glucose level >126 mg/dL or a physician-diagnosis, or currently on antidiabetic medications, is 11.5%, 27.7%, 25.3%, and 24.8%, and the prevalence of hypertension, defined as a systolic blood pressure ≥ 140 mmHg, or a diastolic blood pressure ≥ 90 mmHg, or a physician-diagnosis, or currently being on antihypertensive medications, is 26.7%, 36.9%, 44.5%, 35.0% in Bangladesh, India, Pakistan and Sri Lanka respectively. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020.

Data resource access
Reports and major results of SAB will be released regularly on the SAB website (https://www.ghru-southasia.org/). Subject to data privacy requirements, and the permissions included in the consent form, individual-level data and samples are available for use to approved investigators.

Supplementary Data
Supplementary data are available at IJE online.
All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint  perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted August 14, 2020. ; https://doi.org/10.1101/2020.08.12.20171322 doi: medRxiv preprint