Multidimensional analysis of excitonic spectra of monolayers of tungsten disulphide: toward computer-aided identification of structural and environmental perturbations of 2D materials

Despite 2D materials holding great promise for a broad range of applications, the proliferation of devices and their fulfillment of real-life demands are still far from being realized. Experimentally obtainable samples commonly experience a wide range of perturbations (ripples and wrinkles, point and line defects, grain boundaries, strain field, doping, water intercalation, oxidation, edge reconstructions) significantly deviating the properties from idealistic models. These perturbations, in general, can be entangled or occur in groups with each group forming a complex perturbation making the interpretations of observable physical properties and the disentanglement of simultaneously acting effects a highly non-trivial task even for an experienced researcher. Here we generalise statistical correlation analysis of excitonic spectra of monolayer WS2, acquired by hyperspectral absorption and photoluminescence imaging, to a multidimensional case, and examine multidimensional correlations via unsupervised machine learning algorithms. Using principal component analysis we are able to identify four dominant components that are correlated with tensile strain, disorder induced by adsorption or intercalation of environmental molecules, multi-layer regions and charge doping, respectively. This approach has the potential to determine the local environment of WS2 monolayers or other 2D materials from simple optical measurements, and paves the way toward advanced, machine-aided, characterization of monolayer matter.

Despite monolayers holding great promise for a broad range of applications, the research around 2D materials suggests that proliferation of the potential devices and their fulfillment of real-life demands are still far from realization. In contrast to theoretical descriptions of the physical properties of various 2D materials, experimentally obtainable samples commonly experience a wide range of perturbations significantly deviating the properties from idealistic models, and thereby affecting the performance of the devices. Amongst these perturbations are the presence of ripples and wrinkles [27][28][29], point and line defects [30,31], grain boundaries [32], strain fields [33], doping [34][35][36][37], water intercalation [38], oxidation [39][40][41], and edge reconstructions [42]. These perturbations, in general, can be entangled or occur in groups, forming complex perturbations. This, in turn, makes the interpretations of observable physical properties and the disentanglement of simultaneously acting effects a highly non-trivial task even for an experienced researcher, and advanced characterization methods are often desirable.
Due to the monolayer nature of 2D materials, their optical signatures are highly sensitive to fluctuations in structure and the local environment. This sensitivity results in non-trivial spatial variations, which are hard to analyze manually, and indicates that unsupervised machine learning algorithms applied to multimodal optical imaging data may aid attempts to identify the fluctuations in structure and the local environment distributed across monolayers.
Here we consider a semiconducting monolayer of tungsten disulphide (WS 2 ) grown via chemical vapour deposition (CVD) on a sapphire substrate. Optical properties of WS 2 monolayers are dominated by excitonic effects manifested as intense signatures in their absorption and emission spectra [43]. We apply absorption and photoluminescence (PL) hyperspectral imaging to gather data on the spatial variations of the excitonic properties, which arise from the various perturbations. The spectra are fully parameterised, leading to a multidimensional parametric phase-space (hypercube) where a single data point represents the set of values corresponding to all parameters at a given spatial location on the monolayer sample. This then allows us to apply principal component analysis (PCA) [44][45][46] to identify the parameters that vary together and ideally combine to quantify specific perturbations and how they vary across the monolayer flake. A projection of the multidimensional data-cloud onto a 2D plane with axes given by the two most significant principal components preserves the maximum variance in the data. By using unsupervised K-means clustering [47][48][49] of the data-points in this PCA-plane, regions of the sample with similar properties can be identified and provide further insight into how the perturbations combine and vary across the monolayer sample.

Results and discussion
Typical absorption and emission spectra of WS 2 monolayers are shown in figure 1(a). Absorption spectra are approximated here by differential reflectance [36,50,51] and feature two distinct peaks corresponding to spin-orbit split A-and B-exciton transitions occurring at K symmetry points in the first Brillouin zone [52]. Red-shifted PL emission is evident as an asymmetric peak formed as a result of recombination of excitons and trions [53]. Figures 1(b) and (c) show the spatially-resolved peak absorption amplitude and wavelength corresponding to the A-exciton transition, revealing trigonally-symmetric variations. Similar trends are observed in the spatial maps of PL emission (figures 1(d) and (e)): the absorption and emission are blue-shifted in the regions spanning from the center of the flake toward its apexes. This type of behaviour has been attributed previously to elevated n-doping levels in those areas [37,54], and conversely, red-shifted absorption and emission peaks in the adjacent regions have been shown to result from greater tensile strain [37]. While the absorption and emission wavelength maps are somewhat similar, differences reveal variations in Stokes shift, which is closely related to charge doping. More obvious differences are observed between the patterns formed by absorption peak amplitudes (figure 1(b)) and emission peak intensities ( figure 1(d)). The most significant difference is that the edges of the triangular monolayer flake can be clearly distinguished in the PL emission intensity map. Along the edges the PL intensity is enhanced as a result of combined effects of water intercalation progressing toward the interior over time [38] and oxidation [40,41]. Additionally, three bright spots near the center of the flake can be clearly distinguished in the absorption amplitude map. These bright features are believed to represent multilayer WS 2 material formed at the nucleation centers of the monolayer since larger reflectance contrasts have been observed for TMdC multilayers [55]. All these observed differences point to the complementarity of absorption and PL measurements allowing for observations of nonzero correlations between various parameters.
Statistical correlation analysis has been already proven to be a powerful tool in the studies of optoelectronic properties of 2D materials [33,36,[56][57][58][59][60]. For example, by correlating spectral shifts of prominent Raman peaks in graphene and graphene/TMdC heterostructures it can be possible to disentangle the effects of doping and strain [56,59]. Another route to solve a similar problem for TMdC monolayers used correlations involving the PL Stokes shift [37]. Correlation analyses also facilitated the recognition of physically distinct edges of triangular TMdC flakes as domains hosting large number of point defects [57,60]   and the effects of strain on optoelectronic properties of various TMdC monolayers, including direct/indirect nature of the bandgap [33]. All these results, however, were based on scatter plots between specifically chosen pairs of parameters missing out other possible correlations, and were not able to recognise the presence of any subtle variations in the data. Here, we generalise statistical correlation analysis to an N-dimensional case to acquire more insights into the optoelectronic variations commonly found in 2D materials.
To make the N-dimensional correlation analysis possible we fully parameterise absorption and emission spectra (figure 2(a)) and use each of the parameters to represent a dimension of an N-dimensional parametric space (figure 2(b)), with N = 17 in our case. The parameters used are shown in table 1, and given a label γ i , and their spatial distributions are shown in the supporting information (S1) (available online at stacks.iop.org/MLST/2/025021/mmedia). These parameters include all fit-free parameters we extracted and some basic fitting parameters: exciton and trion peak intensity and wavelength as well as the trion's charging energy, even though they are strongly correlated with such fitting-free quantities as PL peak intensity and wavelength, SM and ∆SM [37]. Despite the fact that several parameters measure similar quantities we did not exclude any at this stage, in case there were small differences that could be correlated with other changes and specific perturbations. This parameter space is represented by an N-dimensional hypercube (N-cube) encapsulating an N-dimensional data-cloud where each data-point ⃗ γ is described by a set of N values (coordinates), i.e. ⃗ γ = {γ 1 , γ 2 , . . . , γ N }. The parameters whose spatial variations are mapped in figures 1(b)-(e) correspond to γ 5 , γ 6 , γ 12 , γ 14 . A specific location (pink point in figures 1(b)-(e)) on the monolayer island can, therefore, be assigned a set of N = 17 numbers corresponding to the values of the 17 parameters chosen to describe the optical properties of the material.
The natural approach to visualization of the geometry of a multidimensional object (data-cloud) is to look at its projections onto 2D planes (figure 2(b)). Amongst infinite number of possible planes and projecting angles, a particular case of orthogonal projections onto the sides of the N-cube is the simplest to realise. It is this particular case that was considered in the previously reported correlation analyses of optoelectronic properties of 2D materials where certain physical trends and clusters have been identified Figure 2. (a) Schematic diagram representing absorption (shaded in light blue) and PL emission (shaded in pink) spectra with PL spectrum decomposed into exciton (shaded in yellow) and trion (shaded in green) contributions. All parameters (total 17) used in multidimensional analysis are labeled in pink: PL I (peak) = PL peak intensity; PL I (int.) = PL integrated intensity; SS = Stokes shift; CE = trion charging energy; FWHM = PL full width at half maximum; SM = PL spectral median; PL λ (peak) = PL peak wavelength; ∆SM = the difference between the SM and PL λ (peak); PL λ (X) = exciton emission peak wavelength; PL λ (T) = trion emission peak wavelength; PL I (X) = exciton emission peak intensity; PL I (T) = trion emission peak intensity; SO = effective spin-orbit splitting at K symmetry points; DR (A) = differential reflectance peak amplitude of A-exciton; DR (B) = differential reflectance peak amplitude of B-exciton; DR λ (A) = differential reflectance peak wavelength of A-exciton; DR λ (B) = differential reflectance peak wavelength of B-exciton. (b) Schematic diagram of a multidimensional data-cloud (blue object) within a multidimensional hypercube. Qualitatively different trends (shaded in grey) can be observed depending on the angle of view. Generic parameters γ1, γ2, γ3, . . . , γN, N = 17, form dimensions (axes) of the hypercube.  [33,36,37,56,57,59,60]. In some instances, oblique projections can provide greater separation of the data, and in some cases correspond to meaningful parameters. For example, charging energy is defined as the difference between the exciton and trion energy, and would correspond to an oblique projection, as detailed in supporting information (S5).
An example of an orthogonal projection is shown in figure 3(a). In this case, the data is distributed in a ring, which indicates that the full data-cloud is a torus-isomorphic object (other projections showing torus-shaped data-shadows are shown in supporting information, S7, for data density estimation). This topology can then also be related to the spatial variations of material properties assuming they vary smoothly, which is typically the case. Specifically, the shape of the data-cloud indicates that it is possible to define loops where each point on the loop has distinct spectral properties (figures 3(b) and (c)). These loops will be around a specific point or points on the material.
To help visualise this we note that around the data-cloud ring in figure 3(a), there appears to be four main clusters of data, as identified in figure 3(d). The boundaries are defined as the lines connecting those points of the data density contours that have high negative curvature [61,62] (see supporting information, S4, for more details on how these boundaries were identified). The data-points within each cluster are mapped back into their real-space location in figure 3(e), showing that the clusters are indeed related to specific regions on the sample. The points about which the circular paths can be identified, are the points where the four colours (corresponding to the four clusters) meet.
It is apparent then that by selecting specific orthogonal projections and subsequent clustering analysis, we can gain some insight into how the optical properties (and corresponding structural/environmental properties) vary across the sample. However, it is likely that there is further fine structure in these clusters, indeed, it is expected that many of the sample perturbations vary smoothly and continuously. These perturbations will affect the 17 different parameters in different ways, and in most cases will affect more than one parameter. To identify the parameters that vary together and maintain the maximum variance of the data, we apply the PCA [44][45][46]. By identifying the orthogonal principal components (i.e. specific linear combinations of the 17 parameters) that maintain the maximal variance, the ability to resolve fine structure and small variations in the data cloud is enhanced. Furthermore, in identifying the parameters that vary together, this approach has the potential to separate each specific sample perturbation (e.g. strain, doping, molecular adsorption, etc) and the specific changes to the optical properties induced by each of them. It may then become possible to map the structural and environmental properties across the sample.
The method for PCA has been described previously [44][45][46] and the details of the approach used here are included in supporting information (S6). The result is a new set of axes (the principal components) for the data hypercube. These principal components are defined such that the variance ∆ i , i = {1, …, 17} of the data along the principal components (PCi) is maximised for PC1 and decreases for each successive component (i.e. ∆ 1 ⩾ ∆ 2 ⩾ · · · ⩾ ∆ 17 ). The corollary of defining the principal components in this way is that it also identifies the measurement parameters that vary together, and distills the variance of the 17-dimensional data-cloud into a hypercube of a lower dimensionality. In the case of the hypercube defined by the parameters of absorption and emission spectra considered here, the PCA approach showed that it is possible to reduce the number of dimensions from 17 (defined by the parameters γ 1 , . . . , γ 17 ) down to four (defined by the parameters PC1, …, PC4) and still preserve as much as 87.9% of the total variance (see supporting information, S6, figure 6).
Each of the principal components are formed by a linear combination of the measurement parameters. The relative weights of the parameters for each PC are shown in the supporting information (S6), tables 2 and 3. Among other things, this shows that the parameters that measure similar quantities are grouped together, as would be expected. For example, the fitted PL peak intensity (γ 1 ), PL maximum intensity (γ 5 ), and integrated PL intensity make contributions to each of the first eight principal components that are within 10% of each other, as can be seen in figure S7 and table S2. For components PC9-PC17, it is not clear that there is any significance associated with them since they make up less than 1% of the total variance of the data, combined. Furthermore, the derived parameters, such as ∆SM also identify the expected relationship with their source parameters. For example, the weight of ∆SM (γ 9 ), which is calculated by the difference between PL peak wavelength (γ 6 ) and SM (γ 8 ), is close to zero for PC1 where the weightings for γ 6 and γ 8 are relatively large and essentially equal. In PC2, however, ∆SM has the largest weight, and the weights of γ 6 and γ 8 are significant, but of opposite sign to each other, meaning that when the contributions are added to form the PC2, the difference between PL peak wavelength and SM appears naturally. This gives further confidence that the PCA is performing reliably and giving meaningful results.
The amplitude of each principal component then varies across the sample and can be mapped spatially in the same way the measurement parameters were mapped in figures 1(b)-(e). Based on the make-up of each PC, their spatial variations across the WS 2 flake, and previously reported understanding of these materials, we are able to correlate each PC with a specific perturbation (or group of perturbations) of the sample structure and/or environment. Specifically, we link PC1 with variations in strain, PC2 with disorder induced by adsorption and/or intercalation of environmental molecules, PC3 with multilayers and PC4 with charge doping.
The attribution of PC1 to strain variation is based on the observation that the dominant contributions to this component are the PL intensity and wavelength parameters, as well as the absorption wavelength for the A-exciton, consistent with previous measurements reporting the effect of strain on optical properties [55,63,64]. In addition, the spatial variation of PC1 across the sample (figure 4(a)) is consistent with previous observations of how the strain varies from the apexes to the middle of the sides in CVD-grown WS 2 monolayers [37,65,66]. There are other perturbations that can also alter the PL intensities and wavelengths, however, these also affect other parameters that are absent from this principal component (e.g. doping also leads to substantial Stokes shift). PC2 ( figure 4(b)) is dominated by variations in the PL FWHM, ∆SM, and (c) 2D projection of 17-dimensional data-cloud onto the PCA-plane spanned by the parameters PC1 and PC2. Colored areas are the domains separated by the K-means clustering algorithm featuring four main clusters (yellow, green, red and purple) and 12 subclusters (shades of yellow, green, red and purple). Gray data-points correspond to multilayers identified by an anomaly detection method. A blue-colored cluster is centered around 0 in the PCA plane and corresponds to a 'boundary' between the four main domains. (d) Real-space: all 13 clusters (including multilayers) are mapped back onto the WS2 monolayer flake. Colorbar is labeled in accordance with the previously reported results. A question mark '?' in front of a label indicates an unconfirmed and tentative assignment. The size of the labels' font symbolically represents the weight of the corresponding perturbation. charging energy. Alone, ∆SM and charging energy can be associated with doping density, however, a clearer signature of doping is the Stokes shift [36,37,54], which does not make a significant contribution to this PC. Furthermore, this is a flake that has been exposed to air for some time and previous work has shown that where freshly grown flakes have large amounts of n-doping in the region of the apexes, the aged flakes adsorb environmental molecules, which reduce the density of free charges, and increase the FWHM [36,37,67,68]. The spatial variation of PC2 adds further weight to this assignment, as in addition to the expected variations in the bulk of the 2D flake, it also reveals the edges where water intercalation has occurred [37,38].
PC3 is dominated by the DR peak intensity for both A and B excitons and the spatial map shows the most significant variance occurs in small regions near the centre of the flake. This is consistent with previous measurements that have attributed significant increases in DR intensity to increased scattering from multi-layer regions on the sample [69,70]. The dominant contribution to PC4 is the PL Stokes shift, which is strongly linked with charge doping [36,37,54]. The relatively low level of variation in PC4 is consistent with previous observations of small variations of doping density on aged CVD-grown flakes. Interestingly, the other significant contribution to PC4 is the effective spin-orbit splitting, or in other words the energy difference between the A and B excitons in the DR measurements. It was previously speculated that these variations could be due to increased doping [37], but other possibilities could not be ruled out. This observation adds further weight to the case that it is due to variations in doping density.
The ability to self-consistently attribute specific sample perturbations to the four primary principal components that arise naturally from PCA of a single flake is potentially of great value. It points to the possibility of developing a set of well-defined principal components, consisting of linear combinations of spectroscopic parameters, that could be applied to determine the precise structural and environmental perturbations at a specific location in a 2D semiconductor. Further analysis of how these perturbations vary across the flakes could then provide insight into growth mechanisms and causes of the variations from pristine materials. We note, however, that to have a greater level of confidence in the make-up of the significant principal components and their relationship to physical perturbations, analysis of a larger dataset and materials prepared under different conditions with different combinations of perturbations is required. This will be the subject of future work.
One of the challenges that may be resolved by this type of PCA is the ability to distinguish different contributions, where multiple perturbations occur at the same place. To demonstrate this we project the data onto the plane spanned by the first two principal components (PC1 and PC2) (figure 4(c)), associated with strain and the adsorption of environmental molecules/intercalation of water. This projection shows the maximum spread of the data and reveals some clustering of data points. In order to acquire more insights into the projected data and correlate the coordinates in this PC1-PC2 plane with the spatial location on the WS 2 flake, we applied the K-means-clustering unsupervised learning algorithm [47][48][49]. This algorithm, for a given input number K, tries to classify the data-set into K labeled clusters (see supporting information, S9, for details). The value of K, however, cannot be automatically identified by the algorithm, and, therefore, cluster identification methods are commonly used. Here we used the so called 'elbow' method [71] as one of the most popular methods for identification of the natural number of clusters, if there are any (see supporting information, S8). As expected, in the case of the WS 2 monolayer considered here, the 'elbow' method revealed the presence of four prominent clusters in the data-cloud (see supporting information, S8) corresponding to the two heterogeneous interior domains and two heterogeneous edge domains within the flake, as identified above in figure 3. However, the method also revealed that there are other natural cluster sets (K = 2, K = 8 and K = 12) present in the data, although they are not as prominent as the set of four clusters (K = 4). With increasing the number of clusters K, additional 'shades' are introduced to the K = 4 cluster set (figure 4(c)). We note that we excluded multilayer regions (gray data-points in figure 4(c), from the K-means analysis by treating them as 'anomalies' (or 'outliers') (see supporting information, S7) [72]. We do this because the multilayer regions have vastly different optical properties from the rest of the sample, and if we were to plot the data as a function of the first three principal components, they would form their own cluster due to being separated along the PC3 axis, again due to their vastly different properties.
We then mapped the data-points in the PCA-plane within each of the clusters back onto the monolayer flake (figure 4(d)) revealing fine-structure of the four main regions of the sample mentioned above. These four clusters can be clearly grouped in pairs, with the purple and green regions mapping the edges affected by water intercalation, and the red and yellow regions mapping the interior of the flake. This roughly correlates with the higher values for PC2 around the edges, although it can also be seen that the value of PC2 increases when going from purple to green. A similar trend is seen when going from yellow to red in the interior. This indicates that while the main change in going from yellow to red and purple to green is along the PC1 axis, and due to reducing tensile strain, this trend is accompanied by an increase in disorder due to adsorbed molecules, and hence a shift along the PC2 axis.
We note also that water intercalation does not change appreciably the amount of strain along the edges, as evidenced by the consistent variation along PC1 for both the edges and interior. This suggests that on average the strain field vectors are aligned angularly around the center of the monolayer island so that the radially-propagating water intercalation does not release strain (see also supporting information, S10).

Conclusions
These results indicate that the PCA based on the spectral parameters from PL and DR hyperspectral imaging has the potential to disentangle and quantify different types of perturbations in monolayer materials. This approach is effectively an extension of specific 2D correlation plots that have been used to help understand the variations across a monolayer flake [33,36,37,56,57,59,60], and which were used here to reveal different regions of the monolayer flake with clearly different combinations of perturbations. The PCA, however, is a more systematic and quantitative approach.
The PCA applied to the data here produced four dominant, orthogonal principal components. By examining the combination of spectral parameters that makes up each of these, and the variation of these PCs across the flake, we were able to assign a specific sample perturbation to each: tensile strain, disorder induced by adsorption/intercalation of environmental molecules, multi-layers, and charge doping. These assignments, the spatial variations and spectral parameters contributing are fully consistent with previous measurements and understanding developed from similar flakes [37, 55, 63-66, 69, 70]. However, these assignments are not definitive and may not be able to reliably predict the specific perturbations on a different flake. It does, however, point to the possibility of using this approach for this purpose. To achieve this, a larger sample size including multiple flakes with different levels of the different perturbations is needed, and a refinement of the parameters may be necessary to remove the intensity/amplitude parameters, which depend on the measurement system. Subsequent steps could involve a large labeled dataset, and combinations of PCA and cluster analysis to train neural networks and enable real-time identification of spatially-varying perturbation. Regardless, the demonstration here that a self-consistent attribution can be made using PCA on a single flake indicates that this approach is promising, and may allow creation of a tool capable of identifying the perturbations at a given location in a given 2D material with optical transitions simply from PL and DR spectra.
For other 2D materials a similar unsupervised learning approach could also be used, however, the data preparation may need to be different. For example, semiconducting monolayers grown on a SiO 2 /Si substrate may lead to additional thin-film interference effects which have to be compensated [73]. Other 2D materials, such as graphene, may lack distinct absorption/emission peaks, in which case other parameters can be used for the described multidimensional analysis. For example, in the case of graphene the parameters derived from Raman spectra may be more useful [74]. In general, the reported analysis is not limited to PL/DR/Raman spectroscopy methods and can be applied to imaging data obtained from numerous other techniques such as nano-ARPES, CARS microscopy, SHG imaging, etc. The list of the possible imaging methods that have been already applied to 2D materials to yield new insights is given in the last section of supporting information (S11).

Sample preparation
The sample preparation was performed in a similar way as described in [75]. Briefly, monolayers of WS 2 were grown on sapphire substrate via CVD using WO 3 and sulphur precursors. WO 3 precursor was placed in the middle of the chamber (high-temperature zone, 860 • C) while the sulphur precursor was placed further upstream (low-temperature zone, 180 • C). The substrate was placed in a direct proximity to the WO 3 precursor. Heating the chamber and maintaining the hot environment lasted over ∼45 min, after which a cooling process was initiated.

Experimental realization
The PL and DR imaging setups were implemented in the same way as described in [37]. In a nutshell, in PL experiments, linearly polarised cw radiation (∼410 nm) was focused on the sample through a 100× objective lens (NA = 0.95). The detection scheme was implemented in an epi-fluorescence geometry in a confocal way with the detection spatial resolution reaching ∼300 nm. DR measurements were performed using a broadband (400-800 nm) incoherent white light radiation with the detection spatial resolution reaching ∼380 nm.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.