Skip to main content

Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF)

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9166))

Abstract

We define a heterogeneous dataset as a set of complex objects, that is, those defined by several data types including structured data, images, free text or time series. We envisage this could be extensible to other data types. There are currently research gaps in how to deal with such complex data. In our previous work, we have proposed an intermediary fusion approach called SMF which produces a pairwise matrix of distances between heterogeneous objects by fusing the distances between the individual data types. More precisely, SMF aggregates partial distances that we compute separately from each data type, taking into consideration uncertainty. Consequently, a single fused distance matrix is produced that can be used to produce a clustering using a standard clustering algorithm. In this paper we extend the practical work by evaluating SMF using the k-means algorithm to cluster heterogeneous data. We used a dataset of prostate cancer patients where objects are described by two basic data types, namely: structured and time-series data. We assess the results of clustering using external validation on multiple possible classifications of our patients. The result shows that the SMF approach can improved the clustering configuration when compared with clustering on an individual data type.

Funded by: King Abdulaziz University, Faculty of Computing and Information Technology, Jeddah, Saudi Arabia.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Contact the authors for more information about the data dictionary.

References

  1. Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)

    Article  Google Scholar 

  2. Laney, D.: 3D data management: controlling data volume, velocity, and variety. Technical report, META Group. February 2001

    Google Scholar 

  3. Mojahed, A., De La Iglesia, B.: A fusion approach to computing distance for heterogeneous data. In: Proceedings of the Sixth International Conference on Knowledge Discover and Information Retrieval (KDIR 2014), pp. 269–276. SCITEPRESS, Rome, Italy (2014)

    Google Scholar 

  4. Steinbach, M., Ertz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. Johnson, R.A., Wichern, D.W. (eds.): Applied Multivariate Statistical Analysis. Prentice-Hall Inc, NJ (1988)

    MATH  Google Scholar 

  6. Skillicorn, D.B.: Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman and Hall/CRC, Taylor and Francis Group, Boca Raton (2007)

    Book  Google Scholar 

  7. Hall, D., Llinas, J.: An introduction to multisensor data fusion. Proc. IEEE 85(1), 6–23 (1997)

    Article  Google Scholar 

  8. Khaleghi, B., Khamis, A., Karray, F.O., Razavi, S.N.: Multisensor data fusion: a review of the state-of-the-art. Inf. Fusion 14(1), 28–44 (2013)

    Article  Google Scholar 

  9. Abidi, M.A., Gonzalez, R.C.: Data Fusion in Robotics and Machine Intelligence. Academic Press Professional Inc, San Diego (1992)

    MATH  Google Scholar 

  10. Faouzi, N.E.E., Leung, H., Kurian, A.: Data fusion in intelligent transportation systems: progress and challenges a survey. Inf. Fusion 12(1), 4–10 (2011). Special Issue on Intelligent Transportation Systems

    Article  Google Scholar 

  11. Dasarathy, B.V.: Information fusion, data mining, and knowledge discovery. Inf. Fusion 4(1), 1 (2003)

    Article  Google Scholar 

  12. Maragos, P., Gros, P., Katsamanis, A., Papandreou, G.: Cross-modal integration for performance improving in multimedia: a review. In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and Interaction. Multimedia Systems and Applications, vol. 33, pp. 1–46. Springer, New York (2008)

    Chapter  Google Scholar 

  13. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5, 27–72 (2004)

    MATH  Google Scholar 

  14. Bie, T.D., Tranchevent, L.C., van Oeffelen, L.M.M., Moreau, Y.: Kernel-based data fusion for gene prioritization. ISMB/ECCB (Suppl. Bioinform.) 23(13), 125–132 (2007)

    Article  Google Scholar 

  15. Shi, Y., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A.K., De Moor, B., Moreau, Y.: L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinform. 11, 309–332 (2010)

    Article  Google Scholar 

  16. Yu, S., Moor, B., Moreau, Y.: Clustering by heterogeneous data fusion: framework and applications. In: NIPS workshop (2009)

    Google Scholar 

  17. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297: Statistics, Berkeley. University of California Press, California (1967)

    Google Scholar 

  18. Bettencourt-Silva, J.H., Iglesia, B.D.L., Donell, S., Rayward-Smith, V.: On creating a patient-centric database from multiple hospital information systems in a national health service secondary care setting. Methods of Information in Medicine, 6730–6737 (2012)

    Google Scholar 

  19. NICE: Prostate cancer: diagnosis and treatment. NICE clinical guideline, vol. 175, pp. 1–48 (2014)

    Google Scholar 

  20. Chan, T.Y., Partin, A.W., Walsh, P.C., Epstein, J.I.: Prognostic significance of gleason score 3+4 versus gleason score 4+3 tumor at radical prostatectomy. Urology 56(5), 823–827 (2000)

    Article  Google Scholar 

  21. Jaccard, S.: Nouvelles researches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat 44, 223–270 (1908)

    Google Scholar 

  22. Rand, W.M.: Objective criteria foe the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1958)

    Article  Google Scholar 

  23. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)

    Article  Google Scholar 

  24. Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(1), 32–57 (1973)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

We acknowledge support from grant number ES/L011859/1, from The Business and Local Government Data Research Centre, funded by the Economic and Social Research Council to provide economic, scientific and social researchers and business analysts with secure data services.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aalaa Mojahed .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Mojahed, A., Bettencourt-Silva, J.H., Wang, W., de la Iglesia, B. (2015). Applying Clustering Analysis to Heterogeneous Data Using Similarity Matrix Fusion (SMF). In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21024-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21023-0

  • Online ISBN: 978-3-319-21024-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics