Fine-scale population structure of Malays in Peninsular Malaysia and Singapore and implications for association studies

Fine scale population structure of Malays - the major population in Malaysia, has not been well studied. This may have important implications for both evolutionary and medical studies. Here, we investigated the population sub-structure of Malay involving 431 samples collected from all states from peninsular Malaysia and Singapore. We identified two major clusters of individuals corresponding to the north and south peninsular Malaysia. On an even finer scale, the genetic coordinates of the geographical Malay populations are in correlation with the latitudes (R2 = 0.3925; P = 0.029). This finding is further supported by the pairwise FST of Malay sub-populations, of which the north and south regions showed the highest differentiation (FST [North–south] = 0.0011). The collective findings therefore suggest that population sub-structure of Malays are more heterogenous than previously expected even within a small geographical region, possibly due to factors like different genetic origins, geographical isolation, could result in spurious association as demonstrated in our analysis. We suggest that cautions should be taken during the stage of study design or interpreting the association signals in disease mapping studies which are expected to be conducted in Malay population in the near future.


Background
Malaysia, a multi-ethnic, multi-lingual, multi-cultural and multi-religious country, is located at the crossroads of Southeast Asia. It is separated by the South China Sea into two land masses namely, the Peninsular Malaysia and East Malaysia (also known as the Borneo island). Malaysia has a total population of about 30 million people, of which approximately 26 million populate the Peninsular Malaysia. Among the major populations in Peninsular Malaysia, the Malays are the largest ethnic group and make up to 63% of the total population follow by Chinese, Indians and other minority ethnic groups.
Many Malays are of Malayo-Polynesian (Austronesian) origin that are culturally and historically heterogeneous [1]. The Malays from the west coast of Peninsular Malaysia are historically linked to Sumatera across the Straits of Malacca; while those from the south are thought to have migrated from Jawa, Sulawesi and other parts of Indonesia [2]. The Malays from the north Peninsular have a closer affinity to the Malay Muslims from the Southern Thai due to geographical location. The history of Singapore has never been separated from Peninsular Malaysia since the first century until the year 1965 when Singapore became an independent republic. Therefore, it is very likely that the Malays of Singapore have a similar history of origin as those from the southern part of Peninsular Malaysia [3].
Earlier studies had indicated potential genetic substructure among the different groups of Malays from Peninsular Malaysia [2,4,5], which could be possibly attributed to the migration history of these respective sub-groups. However, fine-scale sub-structure of the Malay population remained poorly described, especially, previous studies were based on very limited sample size. Indeed, this potentially poses confounding factors to the genetic association studies, in particular genome-wide association studies (GWAS), leading to spurious association signals [6]. Being one of the major populations in the Southeast Asia, characterizing population substructure is crucial in designing, analyzing and interpreting any genetic association study in this region.
In this study, we showed that the genetic diversity and population sub-structure of the Malays from Peninsular Malaysia are correlated to the geographical latitude. Notably, we observed the main differentiations between populations corresponding to the north and south Peninsular Malaysia. In addition, simulation analyses carried out also revealed that the genetic association is greatly affected by population sub-structure, suggesting that consideration of population stratification of samples at the stage of study design and careful interpretation of the association signals are necessary when mapping complex diseases in Malay populations.

Population substructure
We first compared the genetic diversity of the Peninsular Malays from a global scale with 6 populations from HapMap3 including YRI, CHB, JPT, CEU, MEX and GIH. PC plot indicated that the Malays clustered closely to the East Asian populations as expected, and showed a rather small genetic diversity. Several Malay individuals from northern Peninsular Malaysia (PMM) showed closer affinity to the South Asia populations (GIH) (Fig. 1a). We then performed PCA for the Peninsular and Singapore Malays, and revealed a seemingly homogenous cluster (Additional file 1: Figure S1). However, some level of differentiations were observed corresponding to three geographical regions (north, center and south), despite samples from center region that was scattered around (Fig. 1b). We subsequently excluded the samples form the center regions (Pahang and Selangor), and re-ran the smartPCA. Two clusters were observed representing the north and south regions, respectively (Fig. 1c).
In ADMIXTURE analysis, a significant difference was observed between the Malays from the north and south in the major component, with 57% and 65% in the north and south, respectively (P < 0.0001; Fig. 2). At K=3, the newly appeared component (denoted in green) was seen slightly higher in the central Malays than in the south Malays (6.8% vs 3%; P = 0.0415).

Correlation of genetic and geographic coordinates
Given the fact that the PC1 as well as the ADMIXTURE analysis showed significant differences between northern and southern Malay samples, we then investigated if the genetic diversity between these sub-structure of Malays in Peninsular Malaysia were attributed to geographical coordinates. Average PC1 values of southern Malay samples (corresponding to Fig. 1b) were generally less than 0 (except for Johor), whilst all geographically defined northern regions with PC1 >0 (Fig. 3). When we compared the PC1 with geographical latitude of these sample locations, a significant correlation was observed (R 2 = 0.3925; P = 0.029; Fig. 4). Due to the geographical nature, Peninsular Malaysia is divided into west coast and east coast by the Titiwangsa Ranges. We therefore asked if the genetic diversity could be attributed to the geographical longitude as well. Analysis between PC1 and geographical longitude, however showed no significant correlation (R 2 =0.0066; P = 0.7924; Addional file 1: Figure S2). We also evaluated if the genetic diversity was related to geographical distance between two populations, but found no significant correlation of F ST between populations and the geographical distances between them (R 2 = 0.01918; P = 0.1385; Additional file 1: Figure S3).

Genetic differentiation between northern and southern Malays
The regional F ST values indicated the highest regional diversity between the north and the south after 1,000 times bootstrapping repeats (F ST = 0.001; CI = 2.07E -04) ( Table 1; Additional file 2: Table S1). To further identify the genomic regions that are highly differentiated between northern and southern Peninsular Malay, we computed the F ST values of the 41,400 SNPs between northern and southern samples, and identified 428 SNPs listed in the top 1% of the F ST (Additional file 2: Table S2); of which 80 (0.1%) had an F ST value >0.05 (Table 2). SNP with the highest F ST value was rs4149264, residing in the candidate gene ABCA1 -a major gene responsible for high-density lipolipoprotein cholesterol (HDL-c) synthesis. Another highly differentiated SNP, rs4148475, is located at the candidate gene ABCC4. This gene is a member of the superfamily of ATP-binding Cassette (ABC) transporters, which may play a role in cellular detoxification [7]. A missense variant rs1056836 appears to be one of the four highly differentiated SNPs, leading to a change of valine to leucine in candidate gene CYP1B1, which had a minor allele frequency of 0.48 and 0.19 in northern and southern peninsular Malays, respectively (F ST = 0.2037). This candidate gene is responsible in drug metabolism and synthesis of cholesterols, steroids and lipids. It was found to play a role in the susceptibility of glaucoma [8,9]. We performed an enrichment analysis with DAVID (http://david.abcc. ncifcrf.gov/) by including the top 1% highly differentiated variants, but identified no significant enrichment after Benjamini correction (Additional file 2: Table S3).
We observed that 1,666 SNPs were presented in different minor alleles between the north and south Malays, and their allele frequencies in Malays were compared with that in South Asian (GIH) and East Asian (CHB) (Additional file 2). Although not substantial, differences in allele frequencies were observed between the Southand East-Asians, as well as the between the Malays and both South-and East-Asians. Notably, rs1126965 located at the candidate gene SLC39A11 revealed an alternative allele frequency of 0.8178 in the northern Malays and 0.4965 in the southern Malays. This gene has been reported to play a role in liver enzyme and smoking initiation [10,11]. Whether or not this gene is under positive selection in the Malays, however, remain further investigation. We subsequently assessed if these SNPs play a role in phenotypic association, and found that 19 of these SNPs were reported in GWAS catalogue (Additional file 4).
To evaluate the potential effect of population substructure on a disease association study, a series of computer simulation studies were carried out with PLINK following a case-control GWAS design (Additional file 2: Table S4). The GWAS simulations revealed that the effect on false positive rate and statistical power were greater than expected [12].

Discussion
We demonstrated in this study, that the Malays from Peninsular Malaysia and Singapore are essentially substructured. Although genetic correlation with geographical latitude had been previously reported in the Chinese populations [12,13], it is indeed surprising to reveal such differentiation among the Malay populations even within a small region in Peninsular Malaysia and Singapore (~800 KM from north to south). In addition to that, the F ST between the north and south Malays were similar to those of the earlier report between the northern and southern Han Chinese (F ST = 0.0011) [12] but lower than those within Europeans (F ST = 0.0033) [14] However, we observed higher diversity within the substructures of the Malays. For instance, the F ST between two northern Peninsular Malays from Kedah and Kelantan was 0.017 (Table S4), which is in line with the finding in a recent study [4]. This suggests higher heterogeneity among Malays than previously expected, possibly be due to the recent migration and gene flow from the surrounding populations in this region.
The Pahang Malays were found to have a closer affinity to the north, although they were classified as the central region in this study. This is likely due to the reason that samples were collected from the Federal Land Development Authority (FELDA) settlers in the Pahang state, of which the majority of them were originated from Kelantan. On a separate note, Selangor, being as the most advanced and most populated state of Malaysia, is where the metropolitan city Kuala Lumpur located. PCA revealed that samples from this population was scattered across both the north and south regions (Fig. 1b & c). We believe that urbanization had likely blurred the boundaries. Similar findings were observed Essentially, identification a panel of ancestry informative markers (AIMs) would be an ideal strategy to correct the population stratification in future genetic association studies [15]. However, the SNP coverage and the sample size in the current study are insufficient for such purpose. Those highly differentiated SNPs between the north and south Malays could be possibly due to genetic drift or, to a lesser extent, natural selection. These SNPs, however may be considered as the putative set of variants as the AIMs for the Malay populations. The candidate gene ABCA1 is a major gene that plays an important role in high-density lipoprotein cholesterol (HDL-c) synthesis and cholesterol transport [16]. However, whilst we suspect the genetic drift is likely to be the cause, the reason of this gene being highly differentiated between northern and southern Malays remains further investigated. Cautions should be taken though when positive signals of HDL-c and ABCA1 are identified in the genetic association study of Malays.
We acknowledge several limitations in this study. Sample collection from several locations were small, hence might have resulted into outliers which confounded the outcome of the correlation between genetic differentiation and geographical coordinates. In addition, self-reported ancestry might have also confounded the analysis when assigning to their respective state of origin. However, the number of samples covering all states in Peninsular Malaysia (and Singapore), and the marker utilized in our study are larger than the previous reports, thus provides further insights into the genetic structure of the Malays in Peninsular Malaysia. Notably, we revealed close relationship between genetic and geographical coordinates in the Malay populations. In addition, our results and to which extent the admixtures in Southeast Asia could impact the population stratification thus affect the genetic association studies. Therefore we call for attention to look into alternative strategies for disease mapping in genetically complex populations particularly from Southeast Asia.

Conclusion
In summary, we revealed that the population substructure of the Malays was correlated to the latitude coordinate. The genetic diversity of the Malays is more heterogeneous than previously expected, and that we proved that such population sub-structure occurred even though within a small geographical region may potentially lead to spurious signals in disease based genetic association studies. Therefore cautions should be taken when carrying out such study design.

Population and samples
A total of 431 Malay samples were included in this study. These samples were self-identified Malays from Peninsular Malaysia, 116 of which were genotyped with Affymetrix Genome-Wide Human SNP Array 6.0, whilst the remaining samples were genotyped with Illumina 660W (Sample size, N = 90) and Illumina Omni Express (N = 119). The additional 17 Malays samples from Kelantan genotyped with Affymetrix Genome-Wide Human SNP Array 6.0 [17], and 89 samples of metropolitan Malays from Singapore (SGVP) were also included in this analysis [3]. The studies were approved by the research and ethics committees of Universiti Teknologi    (Fig. 5), of which were divided into 3 geographical regions for the purpose of this study namely, North, South and the Centre regions, according to their respective latitude coordinate ( Table 3). The number of samples and their geographical locations are listed in Table 3

Analysis of population structure
Principal Component Analysis (PCA) was first carried out using the smartPCA in EIGENSOFT (ver 4.0) package.

Latitude-PC correlation
Pearson's correlation coefficient was calculated to evaluate the relationship between the genetic coordinates (PC values) and the geographic latitudes.

Pairwise F ST
Unbiased estimation of F ST was calculated according to Weir and Hill (2002) [20], with confidence intervals estimated by bootstrapping with 1,000 replications.

Additional files
Additional file 1: Figure S1. PCA plot including samples from the 11 states of Peninsular Malaysia. Figure S2. Correlation between PC1 and longitude. Figure S3. Correlation between geographical distance and F ST .
Additional file 2: Table S1. Regional pair-wise F ST of the populations collected from different states in Malaysia. Table S2. Top 1% SNPs that are highly differentiated between the Malays from northern and southern region of Peninsular. Table S3. Gene ontology and enrichment analysis of the candidate genes underlying the top 0.1% highly differentiated SNPs between the Malays from the north and south Peninsular Malaysia. Table S4. Simulation analysis for GWAS before and after removing 1,666 SNPs with different minor allele present between the north and south Malays.
Additional file 3: SNPs with different minor allele between northern and southern Malays.  Author details