Abstract
Diffusion through graphs can be used to model many real-world processes, such as the spread of diseases, social network memes, computer viruses, or water contaminants. Often, a real-world diffusion cannot be directly observed, while it is occurring—perhaps it is not noticed until some time has passed, continuous monitoring is too costly, or privacy concerns limit data access. This leads to the need to reconstruct how the present state of the diffusion came to be from partial diffusion data. Here, we tackle the problem of reconstructing a diffusion history from one or more snapshots of the diffusion state. This ability can be invaluable to learn when certain computer nodes are infected or which people are the initial disease spreaders to control future diffusions. We formulate this problem over discrete-time SEIRS-type diffusion models in terms of maximum likelihood. We design methods that are based on submodularity and a novel Prize Collecting Dominating Set Vertex cover relaxation that can identify likely diffusion steps with some provable performance guarantees. Our methods are the first to be able to reconstruct complete diffusion histories accurately in real and simulated situations. As a special case, they can also identify the initial spreaders better than the existing methods for that problem. Our results for both meme and contaminant diffusion show that the partial diffusion data problem can be overcome with proper modeling and methods, and that hidden temporal characteristics of diffusion can be predicted from limited data.
Similar content being viewed by others
References
Agresti A (2002) Categorical data analysis, Wiley series in probability and statistics, 2nd edn. Wiley-Interscience, New Jersey
Avi O et al (2008) The battle of water sensor networks (bwsn): a design challenge for engineers and algorithms. J Water Resour Plan Manag 134(6):556–568
Barabási A-L, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech: Theory Exp (10):P10008+
Boykov Y, Kolmogorov V (2004) An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. Pattern Anal Mach Intell IEEE Trans 26(9):1124–1137
Buchbinder N, Feldman M, Naor JS, Schwartz R (2012) A tight linear time (1/2)-approximation for unconstrained submodular maximization. In: 2012 IEEE 53rd annual symposium on foundations of computer science, IEEE, pp 649–658
Erdos P, Rnyi A (1960) On the evolution of random graphs. In: Publication of the Mathematical Institute of the Hungarian Academy of Sciences, pp 17–61
Feige U (1998) A threshold of ln n for approximating set cover. J ACM 45(4):634–652
Feige U, Goemans M (1995) Approximating the value of two power proof systems, with applications to max 2sat and max dicut. In: Theory of computing and systems, 1995. Proceedings of the third Israel symposium, pp 182–189
Feige U, Mirrokni VS, Vondrak J (2007) Maximizing non-monotone submodular functions. In: Proceedings of the 48th annual IEEE symposium on foundations of computer science. FOCS ’07. IEEE Computer Society, Washington, DC, USA, pp 461–471
Gomez-Rodriguez M, Leskovec J, Schölkopf B (2013) Structure and dynamics of information pathways in online media. WSDM ’13. ACM, New York, pp 23–32
Gupta A, Roth A, Schoenebeck G, Talwar K (2010) Constrained non-monotone submodular maximization: offline and secretary algorithms. CoRR, abs/1003.1517
Hethcote HW (2000) The mathematics of infectious diseases. SIAM Rev 42(4):599–653
Hochbaum DS (2000) Instant recognition of polynomial time solvability, half integrality and 2-approximations. In: ‘APPROX ’00’, Springer, Berlin, pp 2–14
Holme P (2013) Epidemiologically optimal static networks from temporal network data. PLoS Comput Biol 9(7):e1003142
IBM ILOG CPLEX Optimizer ( 2010) http://www.ilog.com/products/cplex/
Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 137–146
Kolmogorov V, Zabih R (2004) What energy functions can be minimized via graph cuts. IEEE Trans Pattern Anal Mach Intell 26:65–81
Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks, in ‘KDD ’10’. ACM Press, New York
Lee J, Mirrokni VS, Nagarajan V, Sviridenko M (2009) Non-monotone submodular maximization under matroid and knapsack constraints. IN: Proceedings of the forty-first annual ACM symposium on theory of computing. ACM, New York, pp 323–332
Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web 1(1):5
Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. KDD ’05. ACM, New York, pp 177–187
Prakash BA, Chakrabarti D, Faloutsos M, Valler N, Faloutsos C (2011) Threshold conditions for arbitrary cascade models on arbitrary networks. In: Proceedings of the 2011 IEEE 11th international conference on data mining. IEEE Computer Society, Washington, DC, pp 537–546
Prakash BA, Vreeken J, Faloutsos C (2012) Spotting culprits in epidemics: How many and which ones? In: ‘ICDM’, pp 11–20
Rossman L (1999) The epanet programmer’s toolkit for analysis of water distribution systems. In: ‘WRPMD’99’, pp 1–10
Salathé M, Kazandjieva M, Lee JW, Levis P, Feldman MW, Jones JH (2010) A high-resolution human contact network for infectious disease transmission. Proc Natl Acad Sci USA 107(51):22020–22025
Schrijver A (2003) Combinatorial optimization—polyhedra and efficiency. Springer, Berlin
Sefer E, Kingsford C (2011) Metric labeling and semi-metric embedding for protein annotation prediction. In: Research in computational molecular biology. Springer, Berlin, pp 392–407
Sefer E, Kingsford C (2014) Diffusion archaeology for diffusion progression history reconstruction. In: Data mining (ICDM), 2014 IEEE international conference on, pp 530–539
Sefer E, Kingsford C (2015) Convex risk minimization to infer networks from probabilistic diffusion data at multiple scales. In: Data engineering (ICDE), 2015 IEEE 31st international conference on, pp 663–674
Serazzi G, Zanero S (2003) Computer virus propagation models. In: In Tutorials of the 11th IEEE/ACM international symposium on modeling, analysis and simulation of computer and telecommunications systems (MASCOTS03)’, Springer, Berlin
Shah D, Zaman T (2011) Finding rumor sources on random graphs. arXiv:1110.6230
Wolsey LA, Nemhauser GL (1999) Integer and combinatorial optimization. Wiley-Interscience, New Jersey
Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: ACM international conference on web search and data minig (WSDM), pp 177–186
Zhang Y-Q, Li X, Liang D, Cui J (2015) Characterizing bursts of aggregate pairs with individual poissonian activity and preferential mobility. Commun Lett IEEE 19(7):1225–1228
Zhu K, Ying L (2015) Source localization in networks: trees and beyond. arXiv preprint arXiv:1510.01814
Acknowledgments
We thank anonymous reviewers for their very useful comments and suggestions. This work has been partially funded by the US National Science Foundation (CCF-1256087, CCF-1319998) and US National Institutes of Health (R21HG006913 and R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper appeared in ICDM 2014 [29].
Rights and permissions
About this article
Cite this article
Sefer, E., Kingsford, C. Diffusion archeology for diffusion progression history reconstruction. Knowl Inf Syst 49, 403–427 (2016). https://doi.org/10.1007/s10115-015-0904-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0904-x