Abstract
We define a heterogeneous dataset as a set of complex objects, that is, those defined by several data types including structured data, images, free text or time series. We envisage this could be extensible to other data types. There are currently research gaps in how to deal with such complex data. In our previous work, we have proposed an intermediary fusion approach called SMF which produces a pairwise matrix of distances between heterogeneous objects by fusing the distances between the individual data types. More precisely, SMF aggregates partial distances that we compute separately from each data type, taking into consideration uncertainty. Consequently, a single fused distance matrix is produced that can be used to produce a clustering using a standard clustering algorithm. In this paper we extend the practical work by evaluating SMF using the k-means algorithm to cluster heterogeneous data. We used a dataset of prostate cancer patients where objects are described by two basic data types, namely: structured and time-series data. We assess the results of clustering using external validation on multiple possible classifications of our patients. The result shows that the SMF approach can improved the clustering configuration when compared with clustering on an individual data type.
Funded by: King Abdulaziz University, Faculty of Computing and Information Technology, Jeddah, Saudi Arabia.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Contact the authors for more information about the data dictionary.
References
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Laney, D.: 3D data management: controlling data volume, velocity, and variety. Technical report, META Group. February 2001
Mojahed, A., De La Iglesia, B.: A fusion approach to computing distance for heterogeneous data. In: Proceedings of the Sixth International Conference on Knowledge Discover and Information Retrieval (KDIR 2014), pp. 269–276. SCITEPRESS, Rome, Italy (2014)
Steinbach, M., Ertz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004)
Johnson, R.A., Wichern, D.W. (eds.): Applied Multivariate Statistical Analysis. Prentice-Hall Inc, NJ (1988)
Skillicorn, D.B.: Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman and Hall/CRC, Taylor and Francis Group, Boca Raton (2007)
Hall, D., Llinas, J.: An introduction to multisensor data fusion. Proc. IEEE 85(1), 6–23 (1997)
Khaleghi, B., Khamis, A., Karray, F.O., Razavi, S.N.: Multisensor data fusion: a review of the state-of-the-art. Inf. Fusion 14(1), 28–44 (2013)
Abidi, M.A., Gonzalez, R.C.: Data Fusion in Robotics and Machine Intelligence. Academic Press Professional Inc, San Diego (1992)
Faouzi, N.E.E., Leung, H., Kurian, A.: Data fusion in intelligent transportation systems: progress and challenges a survey. Inf. Fusion 12(1), 4–10 (2011). Special Issue on Intelligent Transportation Systems
Dasarathy, B.V.: Information fusion, data mining, and knowledge discovery. Inf. Fusion 4(1), 1 (2003)
Maragos, P., Gros, P., Katsamanis, A., Papandreou, G.: Cross-modal integration for performance improving in multimedia: a review. In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and Interaction. Multimedia Systems and Applications, vol. 33, pp. 1–46. Springer, New York (2008)
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5, 27–72 (2004)
Bie, T.D., Tranchevent, L.C., van Oeffelen, L.M.M., Moreau, Y.: Kernel-based data fusion for gene prioritization. ISMB/ECCB (Suppl. Bioinform.) 23(13), 125–132 (2007)
Shi, Y., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A.K., De Moor, B., Moreau, Y.: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 11, 309–332 (2010)
Yu, S., Moor, B., Moreau, Y.: Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop (2009)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297: Statistics, Berkeley. University of California Press, California (1967)
Bettencourt-Silva, J.H., Iglesia, B.D.L., Donell, S., Rayward-Smith, V.: On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods of Information in Medicine, 6730–6737 (2012)
NICE: Prostate cancer: diagnosis and treatment. NICE clinical guideline, vol. 175, pp. 1–48 (2014)
Chan, T.Y., Partin, A.W., Walsh, P.C., Epstein, J.I.: Prognostic significance of gleason score 3+4 versus gleason score 4+3 tumor at radical prostatectomy. Urology 56(5), 823–827 (2000)
Jaccard, S.: Nouvelles researches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat 44, 223–270 (1908)
Rand, W.M.: Objective criteria foe the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1958)
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(1), 32–57 (1973)
Acknowledgments
We acknowledge support from grant number ES/L011859/1, from The Business and Local Government Data Research Centre, funded by the Economic and Social Research Council to provide economic, scientific and social researchers and business analysts with secure data services.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Mojahed, A., Bettencourt-Silva, J.H., Wang, W., de la Iglesia, B. (2015). Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF). In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-21024-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)