DIVIS: A Semantic Distance to Improve the Visualization of Incomplete Heterogeneous Phenotypic Datasets


 BackgroundThanks to the wider spread of high-throughput experimental techniques, biologists are accumulating large amounts of datasets which often mix quantitative and qualitative variables and are not always complete, in particular when they regard phenotypic traits. In order to get a first insight into these datasets and reduce the data matrices size scientists often rely on multivariate analyses. However such approaches are not always easily practicable in particular when faced with mixed datasets with missing values. Moreover displaying large numbers of individuals leads to cluttered visualizations which are difficult to interpret. ResultsWe introduce a new methodology to overcome these limits. The underlying principle consists in (i) grouping similar individuals, (ii) representing each group by emblematic individuals we call archetypes and (iii) build sparse visualizations based on these archetypes. As a preliminary step to the clustering we design a new semantic distance tailored for both quantitative and qualitative variables which allows a realistic representation of the relationships between individuals. This semantic distance is based on ontologies which are engineered to represent real life knowledge regarding the underlying variables. Our approach is implemented as a Python pipeline and illustrated by a rosebush dataset including passport and phenotypic data. ConclusionsThe introduction of our new semantic distance and of the archetype concept allows us to build a comprehensive representation of an incomplete dataset characterized by large proportion of qualitative data. The methodology described here could have wider use beyond information characterizing organisms or species and beyond plant science. Indeed we could apply the same approach to any incomplete mixed dataset.


Background
The 2000s and the sequencing of complete genomes sparked a scientific revolution in the study of living beings. The now accessible no a priori approach results in the wider spread of high-throughput experimental techniques such as transcriptomics, proteomics, metabolomics or phenomics and the increase in the volume of publicly available data. As a consequence, biologists are accumulating large amounts of datasets which are characterized by an increasing heterogeneity: • Information sources heterogeneity: multiple databanks, which can be local or distant, with various formats and interfaces, files with various formats, • Data heterogeneity: various scales (from the molecule to the population), various natures (quantitative and qualitative), various modes (text or images), various structuring levels (database fields, structured text, free text). Therefore the demand by biologists to integrate heterogeneous and large datasets from "omics" and phenotyping activities is rapidly expanding [1].
In this context where large complex datasets are becoming more and more widespread, biologists often rely on multivariate analyses to project individuals in a new coordinates space to get a first insight into the data and have smaller matrices to process. However such approaches are not always easily practicable in particular when faced with mixed (qualitative and quantitative) incomplete (that is to say including missing values) datasets. Moreover displaying large numbers of individuals leads to cluttered visualizations which are difficult to interpret.
In this paper we introduce a new methodology designed to overcome these limits. The approach relies on a new semantic distance which is designed for both quantitative and qualitative variables and allows for a realistic representation of the relationships between individuals. This semantic distance is based on ontologies which are engineered to represent real life knowledge regarding the underlying variables. We associate this new distance definition with an archetype concept to overcome the cluttered displays issue. Indeed we define archetypes as individuals representing groups of similar individuals from the dataset. Limiting the visualizations to these archetypes leads to a sparser representation which still provides valuable insight into the data.
More precisely the structuring of the population in groups is conducted through clustering, for which numerous approaches exist [2,3]. A common characteristic of clustering techniques is that they group individuals based on their similarity. This similarity is estimated based on distances between the features of the individuals.
However most clustering methods rely on numeric arithmetic. Therefore the features have to be represented by numeric values. This causes problems with qualitative variables and even more in the case of mixed datasets. Classical approaches consist in the discretization or dummy-coding of qualitative variables to transform them into numeric variables. But if the number of modalities is very different between variables the weight of each variable in the resulting similarity between individuals might be unbalanced [4]. Some distances are designed to cope with qualitative data. For instance we can cite Jaccard's coefficient [5], Dice's coefficient [6], Gower's distance [7], or the Chi-Square [8]. These metrics are widely used in biology, and in particular in ecology, to characterize species populations. For example Pandey et al rely on Jaccard's coefficient to cluster sesame (Sesamum indicum L.) populations [9], Pavoine and colleagues extends Gower's distance to characterize periurban woodland plant species populations [10] and de Bello et al propose a solution to overcome the issue of the disproportionate contribution of certain traits which exists with Gower's distance [11]. For a review on current clustering approaches for heterogeneous data, see [12].
Even if methods exist to account for qualitative variables, the fact that they do not always consist in a flat list of categories is often overlooked. Indeed in many cases these categories are structured. For instance a variable corresponding to the months of the year can be considered as a circular variable and proposals have been made to take this fact into account in distance calculations, through an extension of Gower's distance [10]. But a lot of qualitative variables modalities are structured according to more elaborate schemes and qualitative variables can be described as ontologies [13]. Ontologies structure knowledge as graphs where nodes represent concepts or terms and edges the relationships between them. Ontologies are heavily developed and used in life sciences to annotate data, in particular in almost every major biological database, and reason over domain knowledge [14].
In an ontology representing the modalities of a variable, modalities/values could be viewed as concepts, while the complex links between them would be materialized by the graph of relationships between concepts. We therefore propose to use the distance between concepts in corresponding ontologies to measure the distance between modalities of qualitative variables.
The measurement of distances in ontologies is a fundamental Semantic Web notion which is exploited for clustering, data mining or information retrieval [15]. Numerous formulas or algorithms [16,17] exist to define the distances between concepts in ontologies but most are based on two main approaches or a mix of the two: • Edge-based approaches imply counting the number of edges between two concepts in the ontology graph, • Node-based approaches compare the properties of the concepts involved, be it the concepts themselves, their parents, or their children. They generally rely on the Information Content (IC) notion which evaluates how specific and informative a concept is. However these approaches rely on the topology of the graph, with no regard for the reality of what the concepts represent. This can lead to inaccurate measurements. For instance, a geographical ontology graph usually positions France, Italy and Denmark as three concepts that are part of Europe. A classical ontological distance calculation would lead to identical pairwise distances between these countries. This is obviously false from a geographical point of view: Italy is closer to France than it is to Denmark.
We therefore intend to augment the ontology graphs with a priori knowledge represented as distance values associated with the relations between concepts.
Moreover clustering and distance calculations usually cannot be performed as is on datasets including missing data. But data matrices in biology are often incomplete, for example because of the cost of some experimental techniques or because an individual hasn't been available for the full duration of the study. The traditional approach to coping with missing data is to exclude the affected individuals or to ignore the variables with too many missing values. But this reduces the amount of data available for analysis. In the case of matrices with a large quantity of holes it could make the study completely useless. Estimating missing values using imputation is often presented as a better approach. A review of available methods is performed in [18] in the epidemiology field. However Johnson and colleagues show that estimating missing data is not always appropriate and none of the methods they tested could deal effectively with severe biases [19] which can be common in trait datasets.
We therefore decide to define a distance which can be calculated even when the features of an individual are not all described, i.e. in the presence of missing data.
To reduce the number of individuals we also propose to represent each cluster by a limited number of individuals we call archetypes. In order to define these archetypes, different strategies can be considered depending on the clustering results: • In the case of a large number of small clusters, representing each group by a single individual is probably more relevant. In such a case we can imagine basing the archetypes definition on the cluster centroids.
• In the case of a small number of large clusters a single individual might not be sufficient to represent the intra-cluster diversity. In these conditions, better to select several individuals with one of the existing sampling techniques [20,21]. A visualization of the archetypes allows to declutter the initial display of the population and can be associated with means to access the whole groups.
In this paper we develop the new approach we hinted at to overcome the listed limitations of classical methods to manipulate large datasets. We apply it to a rosebush dataset which includes passport data and a collection of qualitative and quantitative phenotypic traits.

Methods
Use case: rosebushes phenotypic traits To illustrate our study we rely on information associated with the rosebush collection of the RosePom Biological Resource Center (BRC) in Angers, France. The dataset consists in passport data and the phenotypic traits evaluated during the study of French roses (Rosa sp.) performed by Liorzou et al. [22]. It includes 1434 rosebushes from European garden roses from the 18 th and 19 th centuries. Each rosebush is described by the variables listed in Table 1. With the exception of the number of flowers all these variables are qualitative. Their respective modalities have been defined by domain experts.  [22] or are inferred from them. The horticulture group is defined according to the American Rose Society (ARS) classification. Breeding dates are grouped into time periods. Phenotypic variables have been evaluated by the RosePom BRC and its partners: breeders and rose gardens.
The dataset is far from complete: the "Quantity of prickles", "Perfume intensity", "Repeat flowering level" and "Number of flowers by volume" variables are only filled in for a small number of rosebushes. Some individuals include information for only one or two variables.
Such a dataset could be difficult to analyze with classical approaches and we choose it to test our method and see if we can provide better insight into the data. The dataset is therefore subjected to the pipeline presented in Figure 1. This pipeline is developed in Python 3.7. It relies on the NumPy [23] and pandas [24] libraries to manipulate the data, scikit-learn [25] to perform machine learning and matplotlib [26] to draw the figures. The next subsections detail it more precisely.

Build ontologies and capture the distance between concepts
General principle The first stages of the process consist in associating each qualitative variable in the dataset with an ontology, which corresponds to steps (1) and (2) from the pipeline in Figure 1. The various modalities of a variable will then become concepts in an ontology. Two cases can be considered: • A publicly available ontology corresponding to the variable exists: we can use and eventually adapt it to fit our variable modalities, • No public reference ontology exists: we rely on expert knowledge to transform the list of modalities of the variable into the concept graph of an ontology. For each ontology and each pair of concepts in the ontology we have to define a pairwise distance, as indicated in step (3) of the pipeline in Figure 1. Here again two cases exist: • A distance can be defined based on what the variable represents in real life: we use this distance and calculate it as needed, • No simple distance can be calculated: we rely on expert distance estimations. These estimations are stored along the ontology graph. These information are then processed to build a distance matrix for each variable as for step (4) of Figure 1.
Therefore the qualitative variables in our dataset are handled as follows.
Variables associated with public ontologies Public ontologies exist in relation to colours and geographic information and they can be used for our "Petal colour" and "Geographic origin" variables.
Regarding colours we extract their descriptions from DBpedia [27], using its SPARQL endpoint. These descriptions include coordinates of reference in different colour spaces. We choose to use the L*a*b* colour space because it is designed to approach the perception of colours by human vision. In this space, L * represents perceptual lightness, a * the green-red opponent colors and b * the blue to yellow tones. We then use as distance the ∆E (CIE 2000) which quantifies the visual difference between two L*a*b* colours and is presented in Equation (1).
In order to evaluate this distance we use the implementation from the Python colormath library [28].
Regarding regions of origin we take advantage of the GeoP y library [29] and the Nominatim geocoder to access OpenStreetMap data [30] and associate the locations with coordinates. Some region names in our dataset do not exist in OpenStreetMap. It is for example the case for the subdivision of France into four main quadrants. In such cases we consider the list of named areas composing the region and associate it with the mean latitude and the mean longitude of the areas in the list as a proxy for its location.
Variables with no associated public ontology For the other variables no existing ontology can be located. The structuring in the form of a graph of possible values is carried out for each variable in collaboration with rosebush experts and stored in an ontology file in OWL format using the Protégé editor [31]. We then have to define a distance between pair of concepts in each graph.
For the time periods we consider we can calculate such a distance. If S1 and S2 are the start years of two periods and E1 and E2 the end years, we define the distance between the two periods ∆t as the number of years between the middle of each period as presented in Equation (2).
Among the time periods modalities two have just one date: "< 1700" and "> 1920". For the calculations we consider 1600 as the start date for the first one and 2020 as the end year for the second.
Distances between the other phenotypic variables are defined with the help of rosebush experts. The corresponding ontologies are usually organized as trees. Moreover the modalities of the variables in the original sets are often ordered. For instance the set of modalities for the "Quantity of prickles" variable ("low", "average", "high" and "very high") can be ordered from the lowest quantity to the highest. In the ontology these modalities are organized in two subgroups as presented Figure 2. Distances between pairs of leaf concepts are defined with arbitrary but not random values: we choose values so that inter-subgroups distances are higher than distances within a subgroup and so that the original order is conserved when it is relevant. The resulting distance matrix for this example is presented in Table 2. Having defined values for pairwise distances between concepts we then need to store them into the OWL ontology. In order to do this we introduce a has distance relationship, that is to say an Object Property. We associate it with a distance Data Property of type owl : real. This distance is the Range of has distance. This principle is illustrated in Figure 3.
If we consider the previous example, the distances between the "Low" concept and the others in the "Quantity of prickles" ontology is represented as in Figure 4.
Build distance matrices for the ontologies In order to build the distance matrix for the colour and region ontologies, each pair of concepts in the ontology file is processed and the distance is calculated according to the previously defined methods. The OWL file containing the other ontologies that we engineered is read using the Python Owlready2 library [32]. It allows us to retrieve the list of concepts for each ontology along with the pairwise distances. These are formatted as distance matrices stored in a global Microsoft Excel file.
The range of distance values for each variable are very different. In order to prevent some variables from out-weighting the others in future calculations, each distance matrix is normalized on a [0 − 100] scale.

Build individuals distance matrix
The following step in Figure 1, that is to say step (5), consists in building the individuals distance matrix. We first of all have to define how to calculate the pairwise distance. Each individual can be represented as a vector of variable values. If we consider two individuals represented by the A and B vectors, the values of the i th variable can be represented as A i and B i respectively. The distance d A,B (i) between A and B for the i th variable can be found in the corresponding distance matrix as the distance between A i and B i . The distance D(A, B) between A and B can therefore be expressed as Equation (3).
where N is the total number of variables in the vectors and M the number of variables for which a distance can be defined. Indeed we have missing data in our dataset. If either A i or B i or both are missing then d A,B (i) is missing too. An example for a subset of variables is presented Figure 5.
We can then repeat the operation for all pairs of individuals to build the final distance matrix. In the matrix we store for each pair of individuals both the D(A, B) distance and M , the number of variables used in the calculation.
In order to have a baseline for comparison, we also calculate a distance matrix based on Gower's distance [7]. There is no official implementation of this distance in Python libraries. Moreover we have to adapt a version which would handle missing values the same way as our semantic distance. We therefore develop our own based on the Dice and Manhattan distances as found in the scikit-learn library. The algorithm goes through all variables in the dataset and builds a distance matrix for each variable. If the variable is quantitative, it uses the Manhattan distance. If the variable is qualitative it converts it into binary indicator variables including one for missing values. It calculates the Dice distance on the new dummy variables and marks as missing the pairwise distances which implied the missing value indicator. The final distance matrix is calculated as the by element average of all the individual variables distance matrices.

Projection in coordinates space and clustering
The next stage of the process would be to group similar individuals based on the distance matrix. Since different clustering algorithms can produce different results depending on the structure of the population to cluster we choose to test several algorithms. However not all clustering algorithms can use a distance matrix as input.
Therefore we perform a Multi-Dimensional Scaling (MDS) [33] to project the distance matrix in a coordinates space and use the projection as input for all clustering algorithms, as indicated in step (6) of the Figure 1 pipeline. The distance matrix is subjected to a metric MDS as implemented in the scikit-learn Python library [25]. The MDS function provides a value of ST RESS which quantifies the quality of the representation. This indicator is normalized to obtain the "Kruskal stress" (stress1 ), defined in Equation (4): where δ ij corresponds to the observed distance between pairs of individuals (i, j) supplied as input to the multidimensional positioning algorithm andδ ij is the reconstructed distance in the Euclidean space representing the data.
stress1 is a widely used indicator in the literature [33] and thresholds exist to guide the selection of the number of dimensions to keep in the new space to have a sufficiently good representation.
Regarding the clustering per se, that is to say step (7) from the pipeline in Figure 1, we rely on the scikit − learn implementations of the following algorithms:  [39]. The objective is to compare the results between the various algorithms. Most of these algorithms require a number of clusters as parameter. In order to assist in this choice we perform a Silhouette analysis [40] using the scikit-learn implementation of the Silhouette coefficient calculation.
To compare the results of the various clustering algorithms we divert classification evaluation approaches for our purpose. Indeed these evaluation techniques usually compare a prediction with a ground truth. We don't have a ground truth but the results of several algorithms. We consider the KMeans clustering results as "ground truth" and the result of each of the other algorithms as predictions. We then compute concordance matrices using pandas and confusion matrices using scikit-learn.
We use the same approach to compare the clusters between the two distances.

Archetypes definition and visualization
The next stage (8) of the process described in Figure 1 consists in representing each group by a small number of individuals. As previously stated two main strategies can be considered: • in the case of a large number of small clusters, we represent each group by a single individual.
• in the case of a small number of large clusters we represent each group by several individuals through sampling. To define the single archetype, we identify the cluster centroids as implemented by the scikit-learn library. We then calculate the Euclidean distance between each individual and its cluster centroid. The single archetype is the closest individual to the centroid. To choose several archetypes per cluster, we perform a random sampling of 5 % of the population in each cluster, using the pandas library.
The results visualizations are constructed using the seaborn library [41]. Given the number of dimensions may be superior to three we choose to draw pairplots. Given pairplots are symmetrical along the diagonal we use each half to present different elements: the original scatter plots in the bottom left part and archetypes associated with kernel density estimations to give a rough estimate of the groups envelops in the top right section.

Results
Classical analysis: MCA Before subjecting the dataset to the pipeline presented here we explore it with a classical analysis. Given the types of the variables in the dataset, a Mutiple Correspondence Analysis (MCA) [42] is seemingly appropriate. However we have to deal with the missing data. Three approaches are usually used to do so: • Delete the rows with missing values. In our case, four variables (out of 11) have a very high percentage of missing values (98.9%, 98.0%, 94.6% and 84.8%). We remove these variables before proceeding to the missing values by row. 405 individuals (out of 1434) don't contain any missing values. Thus it almost reduce by a third the number of variables and two-thirds the individuals in the population.
• Consider the missing data as a particular category. Here it is not relevant since the missing data for sure are not in the same category.
• Use multiple imputation methods [43]. It is not appropriate to rely on such an approach in our context since the rosebush varieties are by definition different. The similarity in some features doesn't reflect a similarity in the varieties. We apply MCA with the prince Python package [44], beginning with the first method (removing all the missing values). We present in Table 3 the percentage of inertia explained by the first six components. With the first two components, the total inertia is 7.5% and reaches 19.2% with six components. Therefore the projection in the MCA space is very poor. Another point that one needs to be careful about in MCA is the presence of rare categories (categories of small size). These categories can affect the results since the associated inertia will be high. Several solutions can be considered to remedy this. In particular, we can group the categories if there are natural groupings. By grouping categories and removing the missing values, we increase the inertia of the axes of the MCA. This percentage reaches 12.2% for two components and 30.8% with six components. However, for some variables combining into bigger categories is not relevant. For instance grouping underrepresented flower colours leads to treating very different colours together and to distinguishing similar colors.
In the end, this attempt to use an MCA approach is not conclusive for our objective and for our type of data.

Distance matrices
Following the described pipeline for our dataset we first of all produce distances matrices between individuals using both distances: Gower's and our own semantic one. These matrices are displayed as heatmaps in Figures 6 and 7.
The matrix is sparser in the Gower case and a larger proportion of values are closer to the maximum. This can be explained by the way the two distances are constructed. The distance between individuals is based on a majority of qualitative variables and a single quantitative variable. The qualitative variable is associated with a very limited amount of data. In the Gower case we mainly represent the proportion of variables whose values are different between individuals. Indeed qualitative variables are somewhat "interchangeable" given the distance between two individuals is binary. The values in the distance matrix are often superior to 0.5 because our rosebushes don't share a large proportion of values and we necessarily have a distance of 1 for each variable where the values are different. In the semantic case, the possible distance values between modalities are different from 1 and depend on the variable. The range of possible distances between individuals is therefore quite large but with a smaller maximum. The frequency of values in the two cases presented in Figure 8 illustrates this.
Looking at the heatmaps from Figures 6 and 7 both distances seem to structure the population in 3 or 4 groups but the interpretation is less clear between the two larger groups in the semantic distance case.

Multi Dimensional Scaling
In order to choose the appropriate number of dimensions for the MDS we plot the value of stress1 from Equation 4 for an increasing number of axes, as presented in Figure 9. The stress1 values are similar for both distances. Its is generally admitted that a stress1 value below 0.2 corresponds to a good representation of the distance matrix in a coordinates space [33]. Thus we choose 4 dimensions for the MDS.
Looking at individuals coordinates in the new space it appears that data points are more spread for the semantic distance than for Gower's distance. It is related to the fact that we have a wider range of distances between individuals.

Number of clusters choice
For both distances and as a preliminary step for the clustering process, we perform a Silhouette analysis using two strategies: • Plot the mean Silhouette coefficient as a function of the number of groups for three clustering algorithms: KMeans, KMedoids and Hierarchical Clustering, • Perform a Silhouette analysis at the individuals level for several number of clusters for the KMeans algorithm. The mean Silhouette graphs for the KMeans algorithm are presented Figure 10. For the semantic distance the number of clusters which maximizes the Silhouette coefficient is 6. Profiles are similar for the KMedoids and Hierarchical Clustering algorithms and suggest 5 or 6 clusters. A more precise rendering of Silhouette values at the individual level is presented for the KMeans algorithm and 5, 6 and 7 clusters in Figure 11 [see Additional file 1 for renderings for 2 to 19 clusters]. This figure confirms that the clustering quality from a Silhouette point of view is similar and we chose to perform the next steps for 5, 6 and 7 clusters. For Gower's distance the situation is very different since it seems that the more clusters the better the representation [see Additional file 2 for Silhouette coefficient renderings for 2 to 19 clusters]. We anyway choose to perform the next steps with 5 to 7 clusters so that the results can be compared with the semantic distance.

Cluster analysis
As part of the analysis of the clustering results we want to: • Compare the results between the different clustering algorithms for each distance (Gower's and semantic). In order to do so we consider the KMeans results as ground truth and compare each of the other algorithms with it.
• Compare the results of our new semantic distance with Gower's for all algorithms. Here we consider the results for the semantic distance as ground truth. In order to do so we calculate for 5, 6 and 7 clusters and for the two sets of comparisons: • Confusion matrices. These allow us to determine the number of rosebushes which are classified the same way or differently between two methods, • Concordance tables. These provide us a mapping between the clusters built according to the two methods. As illustration of the comparison of the results between algorithms Figure 12 presents the concordance matrices for the semantic distance [for the Gower distance see Additional file 3]. In each heatmap rows correspond to the KMeans clusters and columns to the clusters for the other algorithm. The other algorithms are Birch, Hierarchical Clustering and Gaussian Mixture for the top three heatmaps and KMedoids and Spectral Clustering for the two bottom ones.
Looking at the confusion and concordance matrices for both distances and all three numbers of clusters it appears that: • Hierarchical Clustering and Birch clusters are very close to the KMeans clusters, • KMeans and KMedoids results are very close except for KMeans cluster 1 which is split in two by the KMedoids algorithm, • Spectral Clustering and Gaussian Mixture results are more different. Moreover the results of these two algorithms are not similar. We have a good concordance between three of the algorithms, the fourth one has only one major difference and the last two have some clusters which are mixed up. Moreover the concordance is better for the Gower distance for all algorithms except Gaussian Mixture, in particular for 6 clusters. Therefore we decide to focus on the results of the KMeans algorithm with 6 clusters for the following analyses.
Regarding the comparison between the two distances the concordance matrix is presented as Figure 13 for the KMeans algorithm and 6 clusters. Group 3 is the only one which is almost identical between the two distances. Group 0 and 2 from the Gower distance are spread among the various semantic groups, with the larger subgroups in semantic groups 5 and 4 respectively. Almost all individuals from the Gower group 1 are in the semantic group 1 but the semantic distance associates them with individuals from the Gower groups 4 and 5. Group sizes are more balanced with the Gower distance. We can suppose it is once again linked to the fact that the data points are more spread in the semantic case.

Archetypes and visualisations
We calculate the archetypes positions for both distances and both approaches: one or several archetypes per group. The resulting visualizations are presented for the KMeans algorithm, 6 clusters and the semantic distance in Figure 14 for the single archetype approach and Figure 15 for the multiple archetype approach. Equivalent figures for Gower's distance are provided as Additional file 4 and Additional file 5.
The projection according to the 4 axes of the MDS (1/2 matrix at the bottom left) shows a structuring of the 6 groups along plan 2,3 which separates the pink / light blue groups from the four other groups and axis 4 which separates the three light blue / yellow / red groups from the others. The plans which provide the better representation of the six groups are 2,3 and 3,4. The complex structure of the point cloud is captured by the K-Means clustering in 6 groups. We can also see that Gower's distance does not provide such a clear structure between the groups.
Comparing our representations (top right corner of the pairplot) with the whole dataset scatter plot (bottom left corner of the pairplot) it appears our representation provides a good overview of the dispersion of the population and of its structuring as groups. Both approaches (single or multiple archetypes per group) seem relevant and the choice of one or the other is a matter of user preference and of the number of clusters: the more clusters there are, the less archetypes per cluster are required to provide a good overview of the dataset.

Discussion
The introduction of our new semantic distance and of the archetype concept allowed us to build a comprehensive representation of an incomplete dataset characterized by large proportion of qualitative data. This can be useful from several perspectives.
Incomplete datasets including mixed (quantitative and qualitative) data are becoming more and more common in life sciences. Classical statistics approaches such are MCA present limits when it comes to providing a first insight into such data: it is often necessary to drop part of the original information to build complete matrices. Data imputation is a solution to the problem but it also has drawbacks and imputed values remain probabilistic and might not represent reality. The approach we developed allows to overcome these problems. Indeed, as long as a pairwise distance can be calculated for all pairs of individuals, they can all be taken into into account in the subsequent MDS projection, clustering, archetype definition and visualization.
Regarding distances, we introduced a semantic distance as an alternative to distances tailored for mixed data such as Gower's. This semantic distance (Equation (3)) allows to account for the underlying meaning of qualitative variables. It can be attached to real life measures such as geographical distances or associated with specific calculations such as the distances between time periods we defined in Equation (2). It can also be based on expert knowledge regarding both the structuring of the modalities of the variable as the concepts graph of an ontology and the distance values between two concepts. This semantic distance brings more precision regarding how two individuals relate to each other compared to Gower's, which is more binary. This allows a wider range of possible distance values in the dataset, and as a consequence a more realistic spread of the data points in the MDS coordinates space. Moreover this semantic distance is defined as a weighted sum. Therefore it allows to give more or less importance to some variables compared to others, thus granting the ability to fine tune the way each facet of the dataset is managed.
Relying on ad hoc distances in concept graphs for some variables, we had to capture this information in an ontology format. We did just that in OWL format: we defined a has distance relationship. This relationship allows to link two concepts and store the distance value within a distance property attached to it. Giving a numeric value to the distance between two concepts is difficult for domain experts but such an approach also present advantages. These distances are data and they can be easily changed, which once again brings flexibility to the way we process the datasets.
Moreover we tackled the problem of cluttered scatter plots by reducing the number of displayed individuals.
On the application point of view, we illustrated our approach with passport and phenotypic traits associated with a collection of rosebushes held by a BRC. But it could be used for any dataset describing a large set of organisms, for instance in ecology, and including other types of data, for instance genomic. More widely it could be used for any incomplete dataset mixing qualitative and quantitative variables. A problem which present similar premises (reduce the number of individuals representing a population) is the constitution of core-collections by BRCs. Indeed BRCs store large collections of biological material and associated information and they often need to constitute sub-samples of a more manageable size e.g. for experimental purposes. These core-collections include, with a minimum of repeatability, the maximum diversity of the species in question [45] and are designed by exploiting the maximum amount of data available: origin of the samples, genetic and phenotypic characteristics, etc. The existing strategies for the selection of inputs are diverse: random strategy, partitioning (also called "stratification"), maximization, and some other so-called "hybrid" strategies [46]. The methodology presented here could add a new tool to the arsenal of BRCs.
However, even if it is functional, the methodology presents some limits. First of all the method relies heavily on ontologies which have to be defined. Efforts towards reference characterization of individuals in the plant sciences domain exist, for instance the MIAPPE (Minimum Information About a Plant Phenotyping Experiment) [47] minimum requirements or the ontologies of the Planteome (https://planteome.org) databank [48], in particular the Plant Trait Ontology. Sharing more reference ontologies would reduce the knowledge engineering burden. Moreover we specifically defined the distances between concepts in our ontologies. This again may not be practical, depending on the number of concepts. Indeed these concepts correspond to modalities of qualitative variables. Some classes may be associated with measures, such as our time periods, colours and geographic locations, but it is not always the case. In this situation distances have to be defined artificially and not only is the process time consuming but it might bias the results. Methods to better anchor the distances with quantifiable information have therefore to be designed.
Secondly the management of missing data could be further refined. The distance between individuals defined in Equation (3) allows to calculate a distance with each individual having missing data. However, depending on the number of variables where two individuals share values, the pairwise distance can be calculated based on different numbers of variables. For instance in our rosebush example we have distances calculated from 1 to 9 variables out of 11 potential variables for individuals with a complete record. In this context we might want to consider that a distance calculated based on more variables is more accurate than one calculated with less. One approach to represent this accuracy might be to represent the distance not as a number but as an interval or a fuzzy number. Another approach would be to associate an error to the distance. We then would have to perform the next stages of the process (MDS, clustering, archetype definition and visualization) based either on fuzzy data or error prone data. Methods are described in the literature, for instance for fuzzy MDS [49] or fuzzy clustering [50]. We however have to study the topic more thoroughly and find implementations of the described approaches or develop our own.
Thirdly the approaches we used to build the archetypes representing the clusters may not be the most relevant. We might want to better link the construction of these individuals with the values of the variables in the original dataset. A better archetype might indeed be an "artificial" one whose variable values are the most represented in its cluster. Defining an archetype this way however introduces new problems. It would have to be projected in the new coordinates space created by the MDS so that it could be represented in the visualizations. It isn't a trivial task given we can only calculate distances between individuals. The topic would have to be explored further.
Fourthly the vizualisations we produced are static pairplots. A big improvement would be to render them dynamically and make the visualisation interactive. A graphic interface to choose which display to render (which distance, which clustering algorithm, how many clusters, etc.) would be most welcome. We could imagine allowing to rotate and zoom in and out of the display. Tooltips associated with the archetypes could provide information regarding the cluster they represent such as number of individuals, main characteristics regarding the original variables, etc. Clicking on an archetype could change the display and lead to a visualization of the individuals composing the corresponding cluster (or a larger subset of it). The pipeline presented here was developed as proof of concept regarding the interest of the semantic distance and archetype notions. Dynamic visualizations would regard future work.

Conclusion
In this paper, we present a new method to integrate heterogeneous datasets including missing data. The approach relies on a new semantic distance which is designed for both quantitative and qualitative variables and can be considered as an alternative to for instance Gower's. This distance allows for a more realistic representation of the relationships between individuals and a wider spread of the data points. This semantic distance can be linked to real-life knowledge regarding the modalities of the underlying variable or to distance measures captured in ontologies. In this respect, we defined how to describe the distance value between two concepts in OWL format. We associated this new distance definition with an archetype concept to overcome the cluttered displays issue. Indeed we defined archetypes as individuals representing groups of similar individuals from the dataset. Limiting the visualizations to these archetypes leads to a sparser representation which still provides valuable insight into the data.
The methodology described here was applied to a dataset describing rosebush passport and phenotypic traits but it could have wider use beyond information characterizing organisms or species and beyond plant science. Indeed we could apply the same approach to any incomplete mixed dataset. Moreover, the selection of a representative subset of a population is a widespread problem. It is for instance faced by BRCs willing to build core collections for the species they are conserving. Our technique could provide a complementary methodology to existing ones.
The method is fully functional and has been implemented in Python 3.7. However, some aspects imply future work. The method relies heavily on ontologies. Sharing more reference ontologies would reduce the knowledge engineering burden. The design and choice of the pairwise distances in ontologies also have to be studied further so that it remains anchored in real-life information while still scaling up. Taking into account missing data and some kind of confidence in the pairwise distance between individuals based on the number of variables used to calculate this distance has to be studied further. An interactive visualization could improve the overall usability. Figure 1 The dataset processing pipeline. Based on the list of qualitative variables we define the list of required ontologies (one for each variable). For each ontology, if relevant data are publicly available we retrieve it. Otherwise we rely on expert knowledge to build the ontology graph. We introduce distances between concepts in the ontologies based either on real life distances or expert knowledge. These ontologies including distance between concepts information are used to build a distance matrix between variable modalities for each qualitative variable. Based on the vector of variable values which represent each individual in the dataset we calculate pairwise distances to build a distance matrix between individuals. Individuals are then projected in a coordinates space using multi-dimensional scaling. Individuals coordinates are used during the clustering process to build groups. Representative individuals for each group are estimated to define the groups archetypes which are used as part of the visualizations.

Figure 12
Heatmaps of the concordance tables between KMeans clusters for 6 clusters (columns) and the other tested clustering algorithms (rows), semantic distance. In each heatmap rows correspond to the KMeans clusters and columns to the clusters for the other algorithm. This other algorithm correspond to Birch, Hierarchical Clustering and Gaussian Mixture for the top three heatmaps and to KMedoids and Spectral Clustering for the bottom tow. Figure 13 Heatmap of the concordance table between KMeans clusters for 6 clusters built with Gower's distance (rows) and the semantic distance (columns).