Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF)

Mojahed, Aalaa; Bettencourt-Silva, Joao H.; Wang, Wenjia; de la Iglesia, Beatriz

doi:10.1007/978-3-319-21024-7_17

Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF)

Aalaa Mojahed⁵,
Joao H. Bettencourt-Silva⁵,
Wenjia Wang⁵ &
…
Beatriz de la Iglesia⁵

Conference paper
First Online: 01 January 2015

3226 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9166))

Abstract

We define a heterogeneous dataset as a set of complex objects, that is, those defined by several data types including structured data, images, free text or time series. We envisage this could be extensible to other data types. There are currently research gaps in how to deal with such complex data. In our previous work, we have proposed an intermediary fusion approach called SMF which produces a pairwise matrix of distances between heterogeneous objects by fusing the distances between the individual data types. More precisely, SMF aggregates partial distances that we compute separately from each data type, taking into consideration uncertainty. Consequently, a single fused distance matrix is produced that can be used to produce a clustering using a standard clustering algorithm. In this paper we extend the practical work by evaluating SMF using the k-means algorithm to cluster heterogeneous data. We used a dataset of prostate cancer patients where objects are described by two basic data types, namely: structured and time-series data. We assess the results of clustering using external validation on multiple possible classifications of our patients. The result shows that the SMF approach can improved the clustering configuration when compared with clustering on an individual data type.

Funded by: King Abdulaziz University, Faculty of Computing and Information Technology, Jeddah, Saudi Arabia.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Contact the authors for more information about the data dictionary.

References

Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Article Google Scholar
Laney, D.: 3D data management: controlling data volume, velocity, and variety. Technical report, META Group. February 2001
Google Scholar
Mojahed, A., De La Iglesia, B.: A fusion approach to computing distance for heterogeneous data. In: Proceedings of the Sixth International Conference on Knowledge Discover and Information Retrieval (KDIR 2014), pp. 269–276. SCITEPRESS, Rome, Italy (2014)
Google Scholar
Steinbach, M., Ertz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004)
Chapter Google Scholar
Johnson, R.A., Wichern, D.W. (eds.): Applied Multivariate Statistical Analysis. Prentice-Hall Inc, NJ (1988)
MATH Google Scholar
Skillicorn, D.B.: Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman and Hall/CRC, Taylor and Francis Group, Boca Raton (2007)
Book Google Scholar
Hall, D., Llinas, J.: An introduction to multisensor data fusion. Proc. IEEE 85(1), 6–23 (1997)
Article Google Scholar
Khaleghi, B., Khamis, A., Karray, F.O., Razavi, S.N.: Multisensor data fusion: a review of the state-of-the-art. Inf. Fusion 14(1), 28–44 (2013)
Article Google Scholar
Abidi, M.A., Gonzalez, R.C.: Data Fusion in Robotics and Machine Intelligence. Academic Press Professional Inc, San Diego (1992)
MATH Google Scholar
Faouzi, N.E.E., Leung, H., Kurian, A.: Data fusion in intelligent transportation systems: progress and challenges a survey. Inf. Fusion 12(1), 4–10 (2011). Special Issue on Intelligent Transportation Systems
Article Google Scholar
Dasarathy, B.V.: Information fusion, data mining, and knowledge discovery. Inf. Fusion 4(1), 1 (2003)
Article Google Scholar
Maragos, P., Gros, P., Katsamanis, A., Papandreou, G.: Cross-modal integration for performance improving in multimedia: a review. In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and Interaction. Multimedia Systems and Applications, vol. 33, pp. 1–46. Springer, New York (2008)
Chapter Google Scholar
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5, 27–72 (2004)
MATH Google Scholar
Bie, T.D., Tranchevent, L.C., van Oeffelen, L.M.M., Moreau, Y.: Kernel-based data fusion for gene prioritization. ISMB/ECCB (Suppl. Bioinform.) 23(13), 125–132 (2007)
Article Google Scholar
Shi, Y., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A.K., De Moor, B., Moreau, Y.: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 11, 309–332 (2010)
Article Google Scholar
Yu, S., Moor, B., Moreau, Y.: Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop (2009)
Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297: Statistics, Berkeley. University of California Press, California (1967)
Google Scholar
Bettencourt-Silva, J.H., Iglesia, B.D.L., Donell, S., Rayward-Smith, V.: On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods of Information in Medicine, 6730–6737 (2012)
Google Scholar
NICE: Prostate cancer: diagnosis and treatment. NICE clinical guideline, vol. 175, pp. 1–48 (2014)
Google Scholar
Chan, T.Y., Partin, A.W., Walsh, P.C., Epstein, J.I.: Prognostic significance of gleason score 3+4 versus gleason score 4+3 tumor at radical prostatectomy. Urology 56(5), 823–827 (2000)
Article Google Scholar
Jaccard, S.: Nouvelles researches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat 44, 223–270 (1908)
Google Scholar
Rand, W.M.: Objective criteria foe the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1958)
Article Google Scholar
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)
Article Google Scholar
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(1), 32–57 (1973)
Article MATH MathSciNet Google Scholar

Download references

Acknowledgments

We acknowledge support from grant number ES/L011859/1, from The Business and Local Government Data Research Centre, funded by the Economic and Social Research Council to provide economic, scientific and social researchers and business analysts with secure data services.

Author information

Authors and Affiliations

Norwich Research Park, University of East Anglia, Norwich, Norfolk, UK
Aalaa Mojahed, Joao H. Bettencourt-Silva, Wenjia Wang & Beatriz de la Iglesia

Authors

Aalaa Mojahed
View author publications
You can also search for this author in PubMed Google Scholar
Joao H. Bettencourt-Silva
View author publications
You can also search for this author in PubMed Google Scholar
Wenjia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Beatriz de la Iglesia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aalaa Mojahed .

Editor information

Editors and Affiliations

IBaI, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mojahed, A., Bettencourt-Silva, J.H., Wang, W., de la Iglesia, B. (2015). Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF). In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-21024-7_17
Published: 01 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics