Data analysis on sea water quality data in Jakarta Bay using Principal Components Analysis (PCA) method during transitional monsoon 2012

To get a conclusion from a data matrix consisting of 3 individuals and 2 variables is relatively easy. However, it is very difficult to understand the large amount of data. Therefore, it requires data analysis methods for an easier representation. Based on sea water quality data in Jakarta Bay from BPLHD DKI Jakarta (Jakarta Environmental Management Board), there are 24 biological, physical, and chemical parameters in 23 stations. Based on the quality and quantitative of data, we use only one set data on October 2012 as representative of the Second Transition monsoon. Analysing was conducted for 10 parameters namely turbidity, total suspended solid (TSS), temperature, pH, salinity, dissolved oxygen (DO), biological oxygen demand (BOD), methylene blue active substances, phenol, and zinc (Zn) at 23 stations. Consequently, in this paper, we get a conclusion from the data using principal component analysis (PCA) method for its application in data analysis. The method of PCA is used to analyse the data matrix from a similarity point of view between stations and correlation between parameters. The result of PCA is four principal components i.e. PC 1 (27.73% of the variance) is mainly related to TSS, temperature, salinity, and DO. PC 2 (16.33% of the variance) is mainly related to BOD. PC 3 (12.39% of variance) is mainly related to MBAS, phenol, and zinc. PC 4 explains 11.09% of variances related mainly to turbidity.


Introduction
Jakarta Bay is semi enclosed bay that located facing to capitol city of Indonesia, Jakarta Greater Area at the north part of Java Island. This bay is characterized by shallow water with 13 rivers system pour out to this bay and connected with the Java Sea [1]. Jakarta as one of the fast growing Asian megacities. The polluted water around this bay has long history evidences that has been investigated by many of researchers. Suwandana et al. summarized from several researchers the impact of rapid development and increase of population introduce some environmental issues such as urban waste, air and river water pollution, coastal pollution, and drink water shortage [2]. In recent years, the red tides, fish mortality and oxygen depleted water mass been frequently occured in Jakarta Bay [3]. The latest issue is related with the microplastic marine debris [4].
To determine the level of pollution in Jakarta Bay there are numerous water quality indicators i.e physical, chemical and biological parameters. Temperature, turbidity and total suspended sediment (TSS) can be classified as physical indacators. Salinity, pH, Biological Oxygen Demand (BOD), Dissolved Oxygen and type of metal parameter can be classifed as chemical indicators. The number of biological parameters is less and the common measure is Estherichia coli and fecal colifom.
The polluted coastal water and its process can be evaluated from several water quality method using observation [2,5], sattelite as well as numerical simulation [1,3]. In this research we will applied the principal component analysis (PCA) method as one of statistical technique to reduced a large set of parameters (variables) to a smaller new independent set data but still can get the information from large dataset. This method can be performed in data matrix from a similarity point of view between stations and parameters. So we can analyze the contribution of specific parameters in selected stations.

Materials and Method
In this paper, we use the water quality data in Jakarta Bay provided from BPLHD Jakarta Province (Regional Environmental Management Agency) [6] to analysis the data. Every year BPLHD conducts the observation and records sea water quality data in Jakarta Bay and estuary surrounding the bay. There are 2 or 3 times observations in a year to cover the dry, transition, and rainy season. For example, in 2012 there were 3 observations: July to represent dry season, October to represent the second transition season, and December to represent rainy season. We have to select the data based on the quality and quantity of the data. In this study, we found the best dataset in October 2012 that represents the second transition monsoon. The monitoring indicators include 3 physical, 19 chemical and 2 biological parameters respectively. Water sampling observation was taken in fixed 23 sampling stations at Jakarta Bay namely A1-A7, B1-B7, C2-C5, and D3-D6. All of the sampling stations locations can be seen on Figure 1 and the longitude and latitude position shown in Table 1.  For the case of October 2012, from 23 parameters, only 3 physical parameters and 7 chemical parameters can be analyzed. We eliminated a lot of not available data that could not be analyzed. So, we selected total of 10 parameters collected from 23 sampling stations to be analyzed. The selected 10 parameters can be seen on Table 2.
The water sample will be analyzed by principal component analysis (PCA) method to reduce the variables (parameters) [7], from 10 variables to 4 new variables (4 principal components). The new variables can represent the all original variable as water quality indicators, or we can say that is principal indicators. All the procedures were performed using MATLAB R2014a software for calculation implementation of PCA. In additional we compare the each PC component with water quality standards number 51/2004 published by the Minister of Environment.

Results and Discussion
The first step before doing the PCA method is standardizing data, because there are unit differences between one parameter with the another parameter, then calculates covariance from the standardized data [8], as seen in Table 3. In Table 3 we can see that highest covariance between parameters is the turbidity and total suspended solid (TSS) i.e. 0.655. The interpretation of this value that's the both parameters still have correlation. Therefore, PCA can produce the new variables which do not have any correlation between the parameters, but still described situation from the original data.   The original data can be transformed to an uncorrelated variable (independent variable) using the eigenvector of the correlation matrix. Eigenvector can be determined by finding the eigenvalue first. Eigenvalue can be found with Eigen-value Decomposition (EVD) or Singular-value Decomposition (SVD) [9]. In this research we use EVD, and Figure 2 is decomposition matrix (D) that has been sorted from the large to the small value.
Based on decomposition matrix (Figure 2), there are 4 principal components having significant eigenvalue > 1 that were performed in the first to fourth matrix columns. Eigenvalue for PC 1, PC 2, PC 3 and PC 4 is 2.77, 1.63, 1.23, and 1.11 respectively, then variance and cumulative variance for the principal component can be seen in Table 4 and Figure 3.  Based on Table 4, PC1, PC2, PC3, and PC4 can explain the diversity data around 67.54%. The loadings of the four principal components with PCA are shown in Table 5. PC 1 (27.73% of the variance) is mainly related to TSS, temperature, salinity, and DO. PC 2 (16.33% of the variance) is mainly related to BOD. PC 3 (12.39% of variance) is mainly related to MBAS, phenol, and zinc. PC 4 explains 11.09% of variances related mainly to turbidity.  The scores matrix provided information about the distribution of patterns or sources of contamination among samples, and the matrix of loadings defined the contribution of the original variables to each one of these contamination patterns or sources [10]. We choose loading value > 0.4 which can describe the variables that affected to sea water quality in Jakarta Bay. Therefore we can resume the result of PCA method as seen in Table 6: The principal components score plot can describe their different characteristics and find out the correlation between different parameters by the parameters lines [10], as seen in Figure 4. The red dots denote the station number and the blue lines denote the parameters. All ten parameters are represented by a vector, and they can describe for the contribution to the two principal components. For example, in Figure 4, the PC 1 has positive coefficients for eight parameters and has negative coefficient for two parameters (Salinity and Phenol). The highest positive coefficients in the PC 1 are TSS, temperature, and DO, and the highest negative coefficient is Salinity. This interpretation is corresponding to value in Table 5. If two-parameter lines get closer together, the correlation between them will be strong. For example In Figure 4, TSS and turbidity was overlap, consequently to see more clearly, we try to create score plot in three dimensions as seen in Figure 5. We can see that TSS and turbidity lines in Figure 5 are not overlap, but they just close together. Therefore TSS and turbidity have the highest positive correlation coefficient (0.655). If the angle between two parameter lines is about 90°, then the interpretation is there is no correlation between them, for example temperature and salinity (-0.515).  From the result above, then first, we can see the important of PC 1 (TSS, temperature, salinity and DO) affected the water quality of Jakarta Bay. Figure 6 shows the value of TSS, salinity, and DO at all stations locations still in accordance with quality standards for biota and marine tourism, but temperature at nine station locations are not available for coral and seagrass (temperature value > 30°C). The TSS value in general is less than 5 mg/L, except for B1, B2, C2 and D3 stations ( Figure  6a). Salinity is below the quality standards for all stations (Figure 6c). The range of DO is between 6-8 mg/L) as seen in Figure 6d. The second, we can see the important of PC 2 (BOD) affected the water quality of Jakarta Bay. BOD at all station locations still in accordance with quality standards for marine biota, but BOD at C3 is not fulfill to marine tourism standard quality with the BOD value > 10 mg/L (Figure 7).  The third, we can see the important of PC 3 (MBAS, Phenol, and Zinc) affected the water quality of Jakarta Bay. In Figure 8 the MBAS value in all station locations are to marine biota standard quality, but not available to marine tourism, the MBAS value > 0.001 mg/L. The phenol value > 0 mg/L at all station locations, it means the phenol value at all station locations are not available to marine tourism and biota. The zinc value at all location > 0.005 mg/L, it means the zinc values at all locations are not available to marine biota, but only 9 stations from 23 stations are not available to marine tourism. The fourth, we can see the important of PC 4 (turbidity) affected the water quality of Jakarta Bay ( Figure 9). Turbidity value at B7, C4, and D3 stations > 5 NTU, it means the turbidity value in three locations not meet to marine tourism and biota standard quality value.