Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Ravi, Sandipp Krishnan; Comlek, Yigitcan; Chen, Wei; Pathak, Arjun; Gupta, Vipul; Umretiya, Rajnikant; Hoffman, Andrew; Pilania, Ghanshyam; Pandita, Piyush; Ghosh, Sayan; Mckeever, Nathaniel; Wang, Liping

Statistics > Machine Learning

arXiv:2402.04146 (stat)

[Submitted on 6 Feb 2024 (v1), last revised 16 Feb 2024 (this version, v2)]

Title:Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Authors:Sandipp Krishnan Ravi, Yigitcan Comlek, Wei Chen, Arjun Pathak, Vipul Gupta, Rajnikant Umretiya, Andrew Hoffman, Ghanshyam Pilania, Piyush Pandita, Sayan Ghosh, Nathaniel Mckeever, Liping Wang

View PDF

Abstract:With the advent of artificial intelligence (AI) and machine learning (ML), various domains of science and engineering communites has leveraged data-driven surrogates to model complex systems from numerous sources of information (data). The proliferation has led to significant reduction in cost and time involved in development of superior systems designed to perform specific functionalities. A high proposition of such surrogates are built extensively fusing multiple sources of data, may it be published papers, patents, open repositories, or other resources. However, not much attention has been paid to the differences in quality and comprehensiveness of the known and unknown underlying physical parameters of the information sources that could have downstream implications during system optimization. Towards resolving this issue, a multi-source data fusion framework based on Latent Variable Gaussian Process (LVGP) is proposed. The individual data sources are tagged as a characteristic categorical variable that are mapped into a physically interpretable latent space, allowing the development of source-aware data fusion modeling. Additionally, a dissimilarity metric based on the latent variables of LVGP is introduced to study and understand the differences in the sources of data. The proposed approach is demonstrated on and analyzed through two mathematical (representative parabola problem, 2D Ackley function) and two materials science (design of FeCrAl and SmCoFe alloys) case studies. From the case studies, it is observed that compared to using single-source and source unaware ML models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems, interpretability regarding the sources, and enhanced modeling capabilities by taking advantage of the correlations and relationships among different sources.

Comments:	27 Pages,9 Figures, 3 Supplementary Figures, 2 Supplementary Tables
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
Cite as:	arXiv:2402.04146 [stat.ML]
	(or arXiv:2402.04146v2 [stat.ML] for this version)
	https://doi.org/10.48550/arXiv.2402.04146

Submission history

From: Yigitcan Comlek [view email]
[v1] Tue, 6 Feb 2024 16:54:59 UTC (9,716 KB)
[v2] Fri, 16 Feb 2024 18:17:15 UTC (9,716 KB)

Statistics > Machine Learning

Title:Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Machine Learning

Title:Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators