Data quality-aware genomic data integration

Genomic data are growing at unprecedented pace, along with new protocols, update polices, formats and guidelines, terminologies and ontologies, which are made available every day by data providers. In this continuously evolving universe, enforcing quality on data and metadata is increasingly critical. While many aspects of data quality are addressed at each individual source, we focus on the need for a systematic approach when data from several sources are integrated, as such integration is an essential aspect for modern genomic data analysis. Data quality must be assessed from many perspectives, including accessibility, currency, representational consistency, speciﬁcity, and reliability. In this article we review relevant literature and, based on the analysis of many datasets and platforms, wereportonmethodsusedforguaranteeingdataqualitywhileintegratingheterogeneousdatasources.Weexploreseveralreal-worldcasesthatareexemplaryofmoregeneralunderlyingdataqualityprob-lemsandweillustratehowtheycanberesolvedwithastructuredmethod,sensiblyapplicablealsotootherbiomedicaldomains.Theoverviewedmethodsareimplementedinalargeframeworkforthe integrationofprocessedgenomicdata,whichismadeavailabletotheresearchcommunityforsup-porting tertiary data analysis over Next Generation Sequencing datasets, continuously loaded from many open data sources, bringing considerable added value to biological knowledge discovery.


Introduction
Genomics is going to generate the largest "big data" problem for the mankind: between 100 million and 2 billion human genomes are expected to be sequenced by 2025 [114]. High-throughput technologies and, more recently, Next Generation Sequencing [111] have brought increasing amounts of genomic data of multiple types, realizing huge steps towards unravelling human genome mechanisms and applying them for unprecedented personalized medicine outcomes. Prior to data analysis and biological knowledge discovery, data and metadata integration is considered an activity of irrefutable priority, with pressing demands for enhanced methodologies of data extraction, matching, normalization, and enrichment, to allow building multiple perspectives over the genome; these can lead to the identification of meaningful relationships, otherwise not perceivable when using incompatible data representations [107].
Bioinformatics, including genomics in particular, operates traditionally by exploiting the considerable fieldwork on data acquisition, wrangling and analysis from its practitioners. Best practices are accumulated across labs and di erent projects, shared on forums (e.g., https: //www.biostars.org/, http://seqanswers.com/, https://www. researchgate.net/) and collected in the documentation or wiki-guides in code repositories of tools and software. Within these processes, bioinformaticians are mostly concerned with the quality of the experimental data produced by sequencing platforms, for which consolidated pipelinesoften composed of many scripts -are available.
In comparison, quality actions that can be performed when aggregating multiple experimental data in system-< Corresponding author anna.bernasconi@polimi.it (A. Bernasconi) ORCID(s): 0000-0001-8016-5750 (A. Bernasconi) atized ways, have received less attention. However, with the emergence of a culture of data FAIRness [122] and of open and sharable science -promoted by initiatives such as FAIRsharing [109] -caring for data standards in both schemata and values becomes increasingly important, in the same way as implementing integration practices that foster data quality (focusing on accuracy, consistency, currency, and reliability [64]). In its 2012 report on quality of data, IBM found that 1 out of 3 business leaders do not trust the information they use to make decisions (https://www.ibmbigdatahub. com/infographic/four-vs-big-data); this ratio is unacceptable in fields like health-care and precision medicine, that are strongly driven by genomic databases and decision methods.
Recently, we have observed a trend of initiatives that gather tools and data structures to support interoperability among highly heterogeneous systems, to help bioinformaticians perform a set of curation and annotation operations. These include community-driven e orts such as bio.tools [68] (anchored within ELIXIR, https://www. elixir-europe.org/) service providers (EBI [98]), software suites (Bioconductor [67]), or lists (http://msutils. org/). Specific instances include APIs such as BioPython [34], tailored scripts, and field descriptions to be parsed (Bioschemas.org [60]). By using, e.g., the EDAM ontology [69], single initiatives can build bridges among resources, while conforming to well-established operations, types/formats of data and application domains.
In this fashion, most problems are handled within a single database by means of on-the-fly data integration, driven by a community-inspired guidance. On the other hand, a more systematic approach of low-level integration -based upon experience in building solid data warehouses -has also been adopted, helping to reach stable interoperability among imported sources. In these years we witnessed attempts to this kind of approach at many international centers for genomics (including the Broad Institutehttps: //www.broadinstitute.org/ -and Wellcome Sanger Institute https://www.sanger.ac.uk/ -that are so far unpublished) as well as in companies (including SciDB, implemented by Paradigm4, https://www.paradigm4.com/try_scidb/). In the context of research applied to real problems of the domain, the data-driven Genomic Computing project (GeCo [27]) has dedicated considerable e orts to integrate sources of data that are open for secondary research use, hence downloadable to a common repository, continuously updated. Such systems provide the advantage of o ering to users practical work environments. Indeed, biologists and clinicians appreciate ready-to-use repositories, while the know-how of bioinformaticians/developers (on scripting and querying technologies) may not be always at hand.
The emergence of the mentioned positive experiences indicates how data quality can be generally addressed, within thousands of cooperative studies that are jeopardized due to the poor quality of genomic data integration. Poor quality arises at very diverse levels: protocols, data units and dictionaries, metadata models and terminologies. We propose a step forward in the holistic understanding of genomic data and metadata integration, describing a number of methods that can be practically employed to resolve heterogeneity. While most of the introduced issues and techniques have commonalities with general data integration problems, we instantiate them in the specific genomic data context, providing practitioners with easy-to-relate examples to guide their procedures.
In Section 2 we discuss the state of the art since the earliest works on quality-aware genomic databases management [91,16]. In Section 3 we focus on processed data (i.e., the signal extracted from raw genomic datasets) and on metadata (i.e., data description), which is the main driver for interoperability and interconnectedness of di erent databases; in this context, we present a taxonomy of data integration procedures that can positively a ect data quality issues. We interpret integration as a set of steps [13], during which practitioners encounter several heterogeneity loci, which are contexts that cause heterogeneity and that may be addressed during the specific activities of integration.
In Section 4 we describe a collection of problems with related practical examples and solutions proposed as common practices or specific of our experimented pipelines. In this context, data integration involves: synchronizing the content of a global repository with the data sources, organizing data and corresponding metadata with a unique orthogonal approach, considering interoperability of data descriptions and, more in general, allowing heterogeneous datasets to be used together seamlessly. More pragmatically, we argue that the problem of data quality cannot be addressed as an independent issue. It is entangled with many other aspects regarding data modeling, management, integration and usage. We do not consider quality deriving from original sources as it is not a space where we can intervene a posteriori. Instead, we propose a novel angle: addressing data quality dimensions while diverse data sources are being integrated together to enable further applications. In conclu-sion, in Section 5 we rea rm the need for quality-aware solutions to integration and mention our vision on upcoming approaches dedicated to this matter.

Background
From DNA microarrays [70] to Next Generation Sequencing [111], "quality" in genomics has been usually employed to refer to "quality control" steps on sequences, typically a pre-processing activity aimed at removing adapter sequences, low-quality reads, uncalled bases, contaminants. Instead, in this review we refer to Data Quality (DQ) in the broader sense defined by Wang and Strong [119], usually captured by the expression "fitness for use", i.e., the ability of datasets to meet their users' requirements. DQ is evaluated by means of di erent quality dimensions (i.e., single aspects or components of a data quality concept [115]). State of the art techniques to solve data quality issues in general databases are summarized in [50], under the name of 'data cleaning'.
In Figure 1 we appreciate the chronological order of publication of relevant literature. In general, more foundational works of data quality in genomics/biological database have appeared in the early years between 2003 and 2008, building the first baseline for this subject, while after 2014 we observe more specific contributions. Müller et al. [91] examine the quality of molecular biological entities databases. Within the production of data, the authors identify intrinsic problems that lead to incorrect data, concluding that traditional data cleaning techniques, used successfully in many other domains, do not fit the peculiarities of genomics. While giving a complete review of potentially very dangerous errors in sequence and annotation genomic databases, the discussion leaves aside processed data as well as aspects related to data integration and integrated access to multiple heterogeneous sources.
While in 2005 Martinez and Hammer propose the conceptual integration of data quality measures inside a model of data [80], the research group led by Berti-Équille is more focused on the problems deriving from warehousing genomic data [16,63,89]. Their overall experience is summarized in [88], where they claim that metadata describing data preparation and data quality are not exploited enough for ensuring valid results of downstream data analysis. Along the same lines of the pioneer work of Wang and Strong [119], some works propose measures and conceptual frameworks that take into account user-driven quality requirements; see the work by Missier et al. [87], BioGu-ideSRS [35], BioDQ [81] and the more recent paper by Veiga et al. [118]).
A preliminary work by León et al. [76] classifies the data quality properties that are most relevant for genomics; it was then applied concretely to a Crohn's Disease clinical diagnosis case study [97]. The general framework has been described very recently in Pastor et al. [99]. Rajan et al. [104] have recently proposed to build a knowledge base for assessing quality and characterizing datasets in biomed-  The timeline of publications targeting data quality issues in biological (and more precisely genomic) databases. Red circles represent works describing approaches to resolve duplication; green circles are works on data warehousing or conceptual modeling; blue circles cover expert curation literature; grey circles are for user-driven data quality approaches; black circles are uncategorized works.
ical repositories, thus including also genomics and other translational research data. Other works address data quality on specific kinds of genomic databases (e.g., by Hedeler and Missier [64] for transcriptomics and proteomics, by Etcheverry et al. [49] for Genome Wide Association Studies, and by Gonçalves and Musen [59] for repositories of biological samples). As to specific addressed problems, duplicate detection in biological data was dealt with association rule mining first by Koh et al. [74], then by Apiletti et al. [2,3] and by Müller et al. [90], with a focus on contradicting databases. Recent works cover the prevention of redundancy in big data repositories (UniProt KB in [23] and high throughput sequencing in [52]), providing a comparison with other widely used biological large data repositories.
A number of approaches focus on primary data archives quality (i.e., sequence databases such as GenBank [110]) to automatically detect inconsistencies with respect to literature content (see [20,21]) and to provide benchmarks [31], general categorizations of duplicates [32], de-duplication clustering methods [30], as well as insights on characteristics, impacts and related solutions to the problem of duplication in biological databases [29].
Finally, data integration in a quality-aware perspective includes practices of data curation [108] and of service/process curation [58]. Data curation is explored in [117], where the Eagle-i system is developed to facilitate collaborative curation, and in [102,101], as a means to deal with conflicting and erroneous data in UniProtKB.
Unfortunately, things have not changed much since the first contributions in this area: data integration is still responsible for solving many data quality problems in this overreaching big data challenge. While the focus until now has been much on quality of original data, not much stress has been dedicated to processed data and to metadata issues, which are critical during the data curation process. We thus consider the challenges reported in the mentioned works and remodel them into data quality-driven methods that are already implemented in a working integration framework.

Genomics data quality dimensions
The preliminary generation of genomic data follows guidelines and collections of best practices that are gathered during years of practitioners' experience; they are paired by metadata, describing the produced datasets. These are submitted to repositories or collected by consortia that coordinate big research projects and are appointed with the responsibility of publishing it on their platforms. Unfortunately, the integrated use of data coming from di erent data sources is very challenging, as heterogeneity is met at multiple stages of data extraction (e.g., download protocols, update poli-cies), integration (e.g., conceptual arrangement, values and terminologies), and interlinking (e.g., references and annotation).
While integrating genomic datasets, either for ad hoc use in a research study, or for building long-lasting integrated data warehouses, we deal with various complexities that arise during three phases: i) download and retrieval of data from the (potentially multiple) sources; ii) transformation and manipulation, providing fully or partially structured data in machine-readable formats; iii) enrichment, improving the interoperability of datasets.
With heterogeneity locus we refer to an activity or phase within the genomic data production/integration process that exhibits heterogeneity issues, thus undermining the quality of resulting resources. Dividing production from integration, the taxonomy in Figure 2 keeps track of all the phases in which a genomic data user may need to resolve problems related to non-standardized ways of producing data, making it accessible, organizing it, or enhancing its interoperability. Issues may derive from diverse data and process management habits across di erent groups that work within the same institution; even more so across di erent ones. In Figure 2 the heterogeneity loci (listed in the central column) are grouped by production and integration phases (on the left) and are related to data quality dimensions (on the right) that are critical in the represented heterogeneity aspects and are described in the following subsections. In the figure, as in the remainder of the paper, we refer to widely used state-ofthe-art definitions of data quality dimensions [119,105] as well as to more recent ones [5,9].

Accuracy and validity of generated content
Within production, datasets are generated and then published. Generation includes complex practices and challenges, involving quality issues related to accuracy, i.e., the degree to which produced experimental data correctly and reliably describe real-world represented events [119]. Such aspects have been thoroughly reviewed in previous works [91,64,76]. Much less investigated, instead, are the issues related to metadata authoring (i.e., preliminary compilation of information) [93]. Until very recently, practitioners and investigators from the biomedical community have not recognized metadata creation as a first class activity in their work. As a consequence, accuracy of metadata values is negatively a ected and it becomes very hard for many final users to work with it.
As publication paves the way to downstream opportunities for integration and analysis, a growing number of scientific journals require, upon submission, that genomic experimental data are contextually submitted to public data repositories [1] (such as GEO, SRA [73] or ArrayExpress [6]). Unfortunately, metadata instances in GEO repository su er from redundancy, inconsistency, and incompleteness [124], especially due to a lightly regulated submission process. Users are allowed to create arbitrary fields that are not predefined by set dictionaries, many requested information are unstructured, and validity of the fields' values (i.e., the degree of their compliance with syntax -format, type, range -of the corresponding definitions [5]) is not checked. Information for submitting high-throughput sequencing data is listed at https://www.ncbi.nlm.nih.gov/geo/info/seq.html. A wide literature has been produced to capture structured information from GEO a posteriori (e.g., [100,120]). The scenario of alternative repositories, i.e., NCBI BioSample [7] and EBI BioSamples [44], is witnessed in [59]. 1 Once published on public repositories, data become available for a much wider community, they are potentially re-utilized in secondary analysis or integrated in other platforms; disorganization in the conveyance of provenance information and descriptions of generation procedures negatively a ects 'data lineage' [45].

Accessibility of open genomic data
Sources display diverse download options including programmatic interfaces (APIs), file transfer protocol (FTP) servers, and simple web interface links (HTTP or HTTPS). According to our analysis of important consortia housing open genomic data: i) ENCODE, GDC, and ICGC provide HTTPS API GET/POST services to retrieve lists of files corresponding to chosen filters and additional services to download the corresponding files one by one; ii) Roadmap Epigenomics, GENCODE, RefSeq, 1000 Genomes, and GWAS Catalog store all files on FTP servers, that can be navigated programmatically; iii) GEO provides a variety of methods (both through its own portal and from alternative interfaces), each concentrated on selected partitions of the entire repository content; iv) GTEx can only be accessed from its HTML website.
Only in some cases metadata information is structured and programmatically available. Sometimes metadata files are associated 1:1 to data files (i.e., each data file has a corresponding metadata file); in these cases they can be downloaded in similar ways as the corresponding data file (e.g., by just adding a parameter in an API call, as in ENCODE, or by calling a similar API endpoint using the same file identifier, as in GDC). In other cases, a single metadata file describes a collection of experiments (e.g., Roadmap Epigenomics) or  metadata information need to be retrieved in a number of di erent summary text files, where correspondence between a row and a genomic data file may be obtained using sample IDs (e.g., ICGC or 1000 Genomes). Other times sources have dedicated no e ort in systematizing metadata or bringing metadata to a single place; these can only be gathered from descriptions scattered across Web pages. Accessibility measures the ability of genomic data consumers to easily and quickly retrieve datasets [119]; it is a critical aspect in this phase, as very specific modules need to be created for each source, often upon analysis of cumbersome online documentation and understanding of specific parameters of each portal. Moreover, many well-known open-data databases (such as Cistrome [127], Broad Institute's CCLE [56], and COSMIC [116]) require authentication to access their data; these can only be downloaded and not re-distributed, creating a barrier to integration.

Currency of retrieved information
Measuring the extent to which data are up to date (the socalled currency [105]) is not trivial, as file version synchronization between integrative solutions and original sources strictly depends on the information about the data update state made available in the specific scenarios. The analyzed sources provide such kind of information in di erent ways: i) ENCODE, GDC and ICGC store information about last data update and checksums within their metadata; ii) Roadmap Epigenomics is a once-for-all project: it will not be updated (at least in the same distribution), therefore does not give such information; iii) 1000 Genomes organizes copies of its data, sequenced in di erent phases of the project, in different folders of the FTP-update information can be inferred from the paths of the files; iv) GENCODE and Ref-Seq produce di erent versions regularly; they are associated to release dates, available in the folders names used on the FTP server; v) GTEx and GWAS Catalog embed the source data version (and subversion) within the file names (e.g., "GTEx_v7_Annotations _SampleAttributesDS.txt" or "gwas_catalog_v1.0-studies_r2019-12-16").
On the contrary, metadata update information are available only in specific cases. For example let us consider the case of ENCODE and GDC, which have complex hierarchical metadata structures in JSON. ENCODE centers its model on the Experiment entity, including Biosamples with many Replicates, to which Files belong. GDC is centered on the Patient entity, providing multiple Samples; data are also divided by Project of a certain Tumor Type, for which many Data Types are given. These sources associate an update date to each JSON element representing an entity, such as "Experiment", "Treatment", "Donor". The update date automatically pertains also to the elements contained in the entity (e.g., Experiment.assay, Donor.age, Treatment.pipeline...), allowing a fine-grained definition of last update of each single metadata unit. For all sources where files are downloaded from an FTP server, the upload date of files can be used as reference metadata update date.

Representational conciseness and consistency
Transformation is necessary to organize genomic data and the related descriptions into formats that allow conciseness and consistency in the representation of information. These dimensions respectively measure the ability to compactly, yet completely, represent data and the ability to present data in a same format, allowing backward compatibility [119]. When targeting further data manipulation and analysis, these requirements consequently translate into ease of operation, i.e. the extent to which data are easily used and customized [119].
Genomic data organization is a hard task because files have many formats with di erent semantics (e.g., expression matrices, sets of annotations, sets of peaks measured during an experiment or instead corresponding to a specific reference epigenome...). There does not exist any collectively accepted standard for a general yet basic data unit, that is able to concisely represent very heterogeneous input data types (given that rows and columns can express di erent conceptual entities and with di erent levels of detail).
Also metadata formats are various: hierarchical ones (such as JSON, XML, or equally expressive) adhere to inhouse conceptual models; tab-delimited formats (TSV, CSV or Excel/Google Spreadsheets) present di erent semantics for rows and columns; completely unstructured metadata formats, collected from Web pages or other documentation provided by sources, need to be understood case by case.

Value consistency, uniqueness and specificity
Heterogeneity is present not only in representation formats, but also evident in values. Normalization activities may involve adding/standardizing genomic coordinates (e.g., from 0-based coordinates to 1-based or vice-versa) and other positional information, adding associated known genomic regions (e.g., genes, transcripts, miRNA) from standard nomenclatures, or formatting into general/sourcespecific formats, such as narrowPeak or broadPeak EN-CODE's standards. A non-exhaustive list of commonly used genomic formats is found at https://genome.ucsc.edu/FAQ/ FAQformat.html.
Also metadata that describe datasets in di erent sources are often incompatible or incomplete, using various reference ontologies or no terminology at all. The lack of consistency between value domains (i.e., no compliance with semantic rules defined over sets of values [9]) certainly hinders interoperability among sources.
Moreover, as the identity of genomic records is realized using descriptive fields in metadata -usually in addition to internal identifiers -metadata are in charge of handling uniqueness with respect to instances within a same source, ensuring that no exact duplicates exist for the same experimental data record [105]. Uniqueness is certainly a goal within single sources, while in the genomics domain (and biomedical more in general), it is accepted that entries representing same real-world entities are repeated in di erent sources, provided that linking references are present and records are aligned (as debated in [113]). This activity, improving lineage and interoperability of the database content, is very critical especially in an application field where resources are typically not well interlinked and information is only present in some databases and with di erent degrees of value specificity (referred to as level of detail in [105]).

Reliability of annotations
Annotation, i.e. structural and functional classification of (sub)sequences, is an across-the-board activity of the genomic data life cycle. Typically, annotating means associating genomic regions with labels from Gene Ontology [41] (explaining the related molecular function, biological process, cellular component) or with medical concepts related to the sequence (e.g., from UMLS [17]). The process is described in many works [47,63,64], hinting at the related data quality aspects. Annotations are either done by human experts, accurately based on literature evidence and certainly time consuming, or predicted automatically by algorithms that try to infer structural and functional information from similar genes/proteins (worst in terms of accuracy but much less time consuming).
Semantic annotation is instead a typical practice on metadata. As surveyed in Bodenreider [18], ontologies have been widely used in biomedical data management and integration for many years, with the main purpose of improving data interoperability [112]. Many tools are already available to allow semantic annotation with biomedical ontological concepts (see Annotator [71], EBI Zooma (https: //www.ebi.ac.uk/spot/zooma/), NIH UMLS MetaMap [4], HeTop [61]). Techniques of text-mining [66] have been put into practice on many sources of biomedical text, including abstracts and experiment description from Gene Expression Omnibus [57,28,53], so far one of biggest yet least curated and standardized sources, thus drawing more attention and e orts. The problem of choosing the right ontologies for semantic enrichment is addressed in [96].
However, guidelines to achieve more standard annotation outcomes are still lacking. Reliability [119] of results (i.e., the extent to which annotations can be confidently used to connect and compare datasets) remains a critical aspect of annotation, being dependent on both the algorithm and the acceptance of the ontology in the biomedical community (which itself results from many factors, sometimes hard to measure).

Quality-aware methods for data integration
During the research activity documented in [14] we analyzed about 30 data repository hosts, consortia databases, platforms, and interfaces that integrate heterogeneous datasets. We performed various genomic data excavation sessions with the perspective goal of understanding the most important open data sources to be included in a rich processed data repository. Within this process, we experienced several cases of heterogeneity located in the specific loci depicted in Figure 2 (see pink rectangles of di erent sizes, marked with labels that characterize Sections 4.1-4.4), necessarily resulting into data quality problems. In the following discussion, we focus on the loci related to integration phases. For each, first we provide paradigmatic real-world instances. Then, we formalize the problem into overarching questions, specifying the data quality dimensions that are addressed at this stage (as listed in the previous section). Finally, we outline methods that are employed to resolve the issue, from literature and from integration e orts realized in the GeCo project.

Global repository synchronization with data sources
In the following we provide two example problems regarding data synchronization on the widely employed TCGA and ENCODE sources. Two additional examples, based on ICGC and 1000 Genomes, are available in the Section 1 of Supplementary material. Example 1. Until 2016, TCGA data was available through a data portal that provided metadata only in XML format, using biospecimen supplements and clinical supplements that described respectively the biological samples analyzed in the experiments and the patient history, clinical information, and treatments. TCGA has undergone a transition towards the new GDC portal, where the data has been, by now, almost completely transferred. However, there are significant inconsistencies related to metadata. All supplements have been maintained and are still downloadable, but they nowhere fit in the new described data model, available at https://gdc.cancer.gov/ developers/gdc-data-model-0. Instead, an entirely new collection of metadata, available through programmatic interface, has been defined, divided in four main endpoint groups (Project, Case, File, Annotation). The documentation of available fields is at https://docs.gdc.cancer.gov/ API/Users_Guide/Appendix_A_Available_Fields/. GDC migration is still ongoing; nevertheless documentation is not consistently updated and it is common to find fields that are already visible in the interface facets (and APIs) but that indeed have null values for all instances in the database. Moreover, not all datasets that were available in the previous portal are now available in the new portal. For these reasons, synchronizing the content of an integrated repository with the one of GDC becomes very critical. Example 2. ENCODE source elements in JSON schemata, used for searching metadata through Elasticsearch (https: //www.elastic.co/) are changed very often, as documented in about 90 Changelogs, one for each JSON entity corresponding to a profile. A complete list of ENCODE's data model entities (i.e., profiles) is at https://www. encodeproject.org/profiles/. However, metadata instances change also their values. For example, to keep track of the change of about 10 attribute-value pairs in the experiment ENCSR635OSG (https://www.encodeproject.org/ experiments/ENCSR635OSG/), only a simple comment in the metadata was added (i.e., Submitter comment: "IMPOR-TANT! Bioreplicate 2 was previously annotated as liver from a 4 year old female. It has now been corrected to be liver from a 32 year old adult male.").

Problem formulation. How can changes on genomic data sources be taken into account to be reflected on integrated repositories, guaranteeing 'currency'? How can it be done systematically, overcoming 'accessibility' issues?
Method 1 -Source partitioning. When targeting integrated systems up to date, the main di culty is to identify data partitioning schemes specific for each source (as discussed in [13]); a partition can be repeatedly accessed and source files that are modified within the partition (or added to it) can be recognized, avoiding selectively the download of the source files that are not changed. Suppose we are interested in downloading a certain updated ENCODE portion (e.g., transcriptomics experiments on human tissue, aligned to reference genome hg19). We produce an API request to the endpoint https://www.encodeproject.org/matrix/, specifying the parameters type = Experiment, replicates.library.biosample.donor.organism. scientific_name = Homo+sapiens, status = released, assembly = hg19, and assay_slims = Transcription.
In 1000 Genomes, as there are no API available, we instead navigate the FTP server directly and check the most updated release available on ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_ collections/1000_genomes_project/release; we consequently enter the relative folder (e.g., 20190312_biallelic_SNV_and_INDEL) and download all chromosomes files.

Method 2 -Event-based update.
We periodically check source websites and FTP servers for new data. We use a relational database (called importer_db in the following) to manage the synchronization process between the data sources and our local repository. The Sources table has many Datasets, each of which corresponds to Files (i.e., the genomic region data files). Each Run of the download process checks unique properties of data Files such as URL, Origin-LastUpdate, OriginSize and Hash, used to compare the local copy of the file with the original one on the data source: new files are stored and processed; missing files (i.e., deprecated on the source) are copied to an archive; matching files that have identifying values di erent from the corresponding local values stored in the importer_db are re-downloaded.
For di erent sources ad hoc software modules can be developed to periodically check for changes in the Changelog, schema definitions, documentation, in search for motivation to update our local copy of the source data. As an ex-ample, GEO o ers to registered users a mechanism to be notified when new data is available, relevant for a search saved previously (https://www.ncbi.nlm.nih.gov/geo/info/ faq.html#notifications). TCGA2BED [46] was realized to handle data acquisition and transformation for TCGA source and OpenGDC [26] provides an updated framework to acquire synchronized data also from GDC portal.

Orthogonal data and metadata organization
Examples. While there is common agreement on the terminology used to define genomic data types (e.g., mutation, copy number variation, chromatin accessibility), data types are typically not rendered using the same machinereadable formats (e.g., there exist both VCF-like and ICGClike mutation formats (https://docs.icgc.org/submission/ guide/icgc-simple-somatic-mutation-format/), gene expression data may be presented as sample/gene matrices or just as lists of genes with expression values per aliquot). Sometimes formats are defined at "experiment time" to suit particular needs of the data; they are documented in plain text attachments. An example of format definition of ENCODE tsv files representing gene expression matrix is https://www.encodeproject.org/documents/ c2bbcf04-9b9d-41aa-883f-bbba9bc45e68/. In this kind of documents, some specifications may further confuse data organization, as matrix cell values are allowed to contain ad hoc formatting semantics (e.g., from an ENCODE format definition document: "The value in the cell contains two strings, one for TPM values and another for FPKM values, separated by underscore; each string contains values for each replicate, separated by colon.").
Many formats also fail at keeping representation levels orthogonal; for example, properties that represent values aggregated over a multitude of regions are sometimes displayed as part of single regions, repeated in each of them. In 1000 Genomes variation data, each line expresses one mutation and contains, as a property, the measure of allele frequencies across entire geographic populations (i.e., thousands of samples).
Additionally, data from specific projects are simultaneously provided by di erent portals, that however re-shape it in several ways: ENCODE portal includes Roadmap Epigenomics data, re-processed using distinct pipelines and with completely di erent data schemata and metadata; TCGA data appears in both GDC and ICGC with very dissimilar representation both for data (one textual file for each aliquot from a patient, as opposed to one big spreadsheet containing independent lines, each connected to a patient) and for metadata.

Problem formulation.
There is no agreement towards a basic genomic data unit for tertiary analysis. A common choice is to prepare one file for each experimental session; lines are genomic regions associated to some properties. Other times data units are huge matrices of patients or samples crossed with genes, miRNA, or other encoded sequences. Each source and each data type, thus, needs its own "basic unit". Can genomic data be expressed using a unique model that is general enough to represent all analyzed formats ('concise and consistent representation'), and that also allows 'ease of operation'?

Method -Genomic Data Model and sample identity.
The need for defining a genomic basic data unit is emerging: a single piece of information that contains genomic regions with their properties and is identifiable with an entity that is interesting for downstream analysis (e.g., a patient, a biological sample, a reference epigenome...). Any set of downloaded files -with their input format -should be convertible through a transformation relation into a set of genomic basic data units. We define as transformation relation cardinality the pair X : Y , where X is the cardinality of the set of files from the input source and Y is the cardinality of the output set of "basic units" into which the input is transformed for downstream use in an integrative system; X : Y is a fraction in lowest terms.
As a paradigm that more generally includes the intervalbased genomic data representations (see BEDTools [103] and BEDOPS [94] for example), an interesting candidate for expressing such basic unit is represented by the sample of the Genomic Data Model (GDM, [84]). A sample can express heterogeneous DNA features, such as variations (e.g., a mutation in a given DNA position), peaks of binding or expression (i.e., genomic regions with higher read density), or structural properties of the DNA (e.g., break points, where the DNA is damaged, or junctions, where the DNA creates loops). GDM is based on the notion of dataset, i.e. a collection of samples. A sample, in turn, consists of two parts: the region data, describing the characteristics and DNA location of genomic features, and the metadata, describing general properties of the sample, in the form of key-value pairs; in GDM format there is one metadata file for each region data file.
Some sources provide a data file for each experimental event, for example ENCODE. In this case, the transformation has a 1:1 cardinality, i.e., to each ENCODE produced file, it corresponds one GDM sample. Other sources include more complex formats, such as MAF, VCF, and gene expression matrices. In these cases, the transformation phase takes care of compiling one single data file for each patient or univocally identified sample in the origin data. The transformation cardinality is thus 1:N, N being the number of patients or biological samples.
Metadata also feature diverse formats in the analyzed sources: i) hierarchical formats (JSON, XML, or equally expressive) require applying a flattening procedure to create key-value pairs-the key results from the concatenation of all JSON/XML elements from the root to the element corresponding to a value; ii) tab-delimited formats (TSV, CSV or Excel/Google Spreadsheets) strictly depend on the semantics of rows and columns (e.g., 1 row = 1 epigenome, 1 row = 1 biological sample)-they often require pivoting tab-delimited columns into rows (which corresponds to creating key-value pairs); iii) two-columns tab-delimited formats (such as GEO's SOFT files) are translated into GDM straightforwardly; iv) completely unstructured metadata for- Table 1 Census of 13 important data sources reporting for each: the processed data types that can be downloaded (along with metadata), their physical formats, and the semantic cardinality of the transformation relation with respect to the GDM output format [84]. Expressed as X : Y ; this ratio represents the number X of data (resp. metadata) units used in the origin source to compose Y data (resp. metadata) file(s) in GDM format. 2 Each reference epigenome is used for many data types, thus many GDM samples. The same epigenomerelated metadata is replicated into many samples. 3 In these cases it is difficult to build a numerical relation-many meta are retrieved from the data files themselves, in addition to manually curated information. mats, collected from Web pages or other documentation provided by sources, need case-specific manual processing. Table 1 shows transformation relation cardinalities regarding both data and metadata input formats, targeting the GDM output format. We analyzed di erent data types in a number of important data sources, that possibly include files with di erent formats. Note that, while for descriptive purposes we indicate physical formats (e.g., TSV, TXT, JSON), the indication of cardinalities also embeds a semantic information: how many data units are represented in one file. Following the mapping from input sources into GDM format we can solve systematically the heterogeneity of data formats and prepare the GDM datasets as sets of GDM samples that are uniform in their schema. The Supplementary material (Section 2) provides additional details on this method.

Metadata interoperability
Examples. Metadata heterogeneity can also be analyzed from other perspectives. From a schema point of view (i.e., how each piece of information is identified and interrelated with others), when searching for disease-related attributes, we find diverse possibilities: "Disease type" in GDC, "Characteristics-tissue" in GEO, "Health status" in ENCODE. From the values point of view, when searching for breast cancer-related information, we find multiple expressions, pointing to comparable samples, e.g., "Breast Invasive Carcinoma" (GDC), "breast cancer ductal carcinoma" (GEO), "Breast cancer (adenocarcinoma)" (ENCODE).
Roadmap Epigenomics expresses ages of samples using a unique column "AGE (Post Birth in YEARS/Fetal in GES-TATIONAL WEEKS/CELL LINE CL)" together for three di erent kind of classes and, consequently, with three different measure units. Example values for single instances are "Fetus (GW unknown)", "CL", "Unknown, Unknown, 45Y", or "49Y, 59Y, 41Y, 25Y, 81Y", where 5 values are put together to express that the related epigenome is derived from 5 individuals. Also the information about donors and its interrelation with other attributes is confusing. The column "Single Donor (SD) / Composite (C)" discriminates between epigenomes deriving from one or more donors. Yet, in the column "DONOR / SAMPLE ALIAS" (containing identifiers), it happens that an epigenome labeled as deriving from a single donor, contains instead multiple IDs. No further explanation is available to clarify the semantics; other dependent columns such as "sex" and "ethnicity" become also unclear. Paradoxically, one donor, identified by the string H-22772, turns out to be present in two di erent epigenomes, one derived from lung and one from heart tissue. This problem is easily propagated to other sources, as Roadmap Epigenomics experiments are replicated in EN-CODE repository. Here, di erent experiments present the same external tags (e.g., roadmap-epigenomics:UW H22772). help users in querying data straightforwardly ('ease of operation')?

Method -Genomic Conceptual Model for metadata normalization.
In literature there are works that use conceptual modeling to better explain relations between biological entities [63,106,97]. However, conceptual modeling can serve brilliantly also the purpose of organizing metadata from heterogeneous sources into one global view. The Genomic Conceptual Model (GCM, [15]) is an Entity-Relationship model used to describe metadata of genomic data sources. The main objective of GCM is to recognize a common set of concepts (about 40) that are semantically supported by most genomic data sources, although with very di erent syntax and forms. GCM is a star-schema -inspired to classic data marts [19] -centered around the ITEM entity, representing a genomic basic data unit, such as the GDM elementary sample. The four dimensions of the star describe the biology of the experiment, the used technology, its management aspects, and the extraction parameters for internal organization of items. A complete integration framework (described in [13]) can be employed to download, transform, clean and integrate metadata at the schema level, importing them into the relational database gcm_db that implements the GCM physically. Data constraints checks (name existence and value dependencies in [15]) are performed based on a set of manually introduced rules, but also on automatically generated ones, inspired by the works on data cleaning using association rules mining [3] and much in the fashion of [82], who uses rules as a means to generate recommendations for suitable metadata additions to datasets. The conceptual representation of GCM widely helped domain users in finding data more easily from a unique query interface, without having to deal with heterogeneous access points, metadata formats and models, as demonstrated in [10].

Large-scale dataset interoperability
Example on data. Within ICGC gene annotation is not consistent among di erent datatypes (e.g., sequence-based gene expression datasets use Ensembl Gene IDs [125], like ENCODE gene quantification data and TCGA gene expression quantification, while array-based gene expression datasets use the gene name convention of HGNC [123]). Within annotation databases themselves, data may be incomplete. For example in GENCODE's comprehensive gene annotation files (ftp://ftp.ebi.ac.uk/pub/databases/ gencode/Gencode_human/) not all exons and transcripts regions have a corresponding gene region that includes them. While searching for correct coordinates of a gene, users may alternatively calculate the start as the one of its left-most transcript/exon and the stop as the one of its right-most transcript/exon, but this procedure could be not always accurate. Such shortcomings are consequently propagated in all processed data sources where the reference gene annotation is used to codify signals data (e.g., ENCODE). Furthermore, secondary sources use di erent releases versions to annotate di erent files (to date, GENCODE has 34 releases, out of which only 6 are still maintained for the new GRCh38 assembly). This makes it hard to consistently compare files from a same source that have been annotated using di erent reference sets.
Example on metadata. Metadata are a ected by the even more complicated issue of ontology misalignment. Ontology CL [86] and EFO [79] reference same concepts: the specific instances in the two ontologies have di erences in the values and schema. NCIT [42] and UBERON [92], both including parts of the human body, also show inconsistencies: while "hypothalamus" is considered a synonym of "BRAIN" in NCIT, it is a sub-concept of "brain" in UBERON (fivelevels more specific, traversing both relationships of sub-sumption is_a and containment part_of). Using ontologies as a base for further semantic annotation, many algorithms still produce a relevant number of inaccurate annotations (see [54,28]), which result in harder work for the downstream integration process.

Problem formulation.
How can datasets understand each other? Can we normalize data with respect to commonly adopted terminologies ('consistency', 'specificity' and 'uniqueness') and confidently exploit the currently available external resources ('reliability')?
Method 1 -Data enrichment. A fruitful approach with annotation is the inclusive one: integrators may add as many information as possible, considering the most accepted resources in the field. For structural and functional annotation of genomic regions and sequences, including adding for example gene/transcript/exon identifiers or biological process related to a protein, multiple reference databases may be queried (RefSeq, GENCODE, Ensembl, Entrez [78], HGNC), as documented in TCGA2BED [46] and OpenGDC [26], or performed during the integration process of several datasets from Roadmap Epigenomics and transcriptomics data from ENCODE. Large-scale data integration in genomics can be achieved using cross-references (see [55]); its success strictly depends on a correct use of persistent identifiers [85]. See the Supplementary material Section 3 for more details on this method.

Method 2 -Metadata enrichment.
The process of annotating existing structured metadata with ontological terms, their definitions, synonyms, ancestors, and descendants can be done in an iterative way, automated with respect to the querying of online annotation systems and semantic match computation, but also assisted by an expert manually checking the obtained links [12]. This enhancement of metadata (using specialized biomedical ontologies) can be seen as the construction of a knowledge graph of the content of the repository [11]; it is useful to instrument the search of datasets described by such metadata in a semantically enriched fashion (see GenoSurf interface [25]). See the Supplementary material Section 4 for more details on this method.

Discussion and outlook
The integration of data and metadata is of growing relevance in biomedical fields (including genomics), because critical decisions in the domains of health-care -such as precision medicine -depend on it. As individualized predictions become more di cult, they require approaches that combine multiple sources and multiple data types (from genomics, transcriptomics, epigenomics, etc.), possibly completed with clinical data. Heterogeneity aspects a ect many actors and stages of the data life cycle. In such situations, data quality dimensions can adequately lead the analysis of problems and related solutions. We have reviewed works that have contributed to data quality-driven approaches in genomics; even with community-driven approaches that pro-pose on-the-fly data integration, the focus has so far been on quality of origin data sources and not so much on the overall process that channels data together for subsequent use. Thus, we have introduced a novel perspective: we have shown a taxonomy of integration phases that directly impact quality of genomic databases and interfaces during data integration; we have detailed the issues related to such phases, providing examples, questions to be addressed, and methods that we experimented during the creation of a repository of high quality, which inspired the discussions of this review paper.
The repository, currently with more than 250k processed items, results from the GeCo project e ort. Figure 3 shows the sequential software modules (https://github.com/ DEIB-GECO/Metadata-Manager/) to integrate genomic sources, by solving all the analyzed heterogeneity aspects. Phases are recorded in the importer_db: a given dataset is downloaded and periodically synchronized with the origin source, transformed into the GDM format (achieving orthogonal data/metadata organization); metadata are cleaned, simplifying redundant attribute names, mapped into the gcm_db (unique conceptual representation, towards interoperability of metadata), semantically enriched and checked with respected to constraints. The relational representation is flattened to load datasets into a file-based engine for further biological querying [83].
While the described approaches have been successfully implemented in practical contexts [13], future challenges include applying the proposed solutions to complex contexts such as the one of clinical data and translational medicine, that ultimately will need to be also iterated with genomic data. We are aware of important work that is being conducted in parallel on health data [40,72,24,65,38], also employing the data warehouse paradigm as a guarantor of up-to-date de-duplicated data within a public network of research centers [37], usually oriented to support analytics [39]. Several works already address data quality for precision medicine [36,43,97], revieweing the use of genomic data in the medical context, whereas my review is focused primrily on issues of quality in genomic data integration (data comparability, metadata definitions, data standards, ...) encompassing all possible uses of genomic data.
In this review, we have shown how resolving quality while building a data repository can e ectively create usable integrated environments for researchers. Since many of the described approaches may be useful for other researchers -even in dynamic data integration assets -these will be provided through convenient external programmatic access. Starting from this baseline, we envision a data integration process that includes seamlessly evaluation of quality parameters, towards data and information that are more directly employable in genomic analysis and biological discovery. Predictably, future data integration approaches will include more and more a data quality-aware modus operandi with the following characteristics: i) currency-driven synchronization of sources, ii) concise/orthogonal/common data representations, iii) light and interoperable data descriptions, iv) reliability-tailored dataset linkage. All in all, this review highlights trends in genomic data and information integration, which will ideally guide and improve future e orts and activities.

Funding
This work was supported by the European Research Council Executive Agency under the EU Framework Programme Horizon 2020, ERC Advanced Grant number 693174 GeCo (data-driven Genomic Computing).

Acknowledgements
The author would like to thank Professor Stefano Ceri and Professor Cinzia Cappiello for fruitful discussions and for providing precious suggestions and inspiration during the preparation of the manuscript.

Conflicts of interest
The author declares no conflict of interest.