Skip to main content
Log in

Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

  • Part 1. Special issue “High Performance Data Intensive Computing” Editors: V. V. Voevodin, A. S. Simonov, and A. V. Lapin
  • Published:
Lobachevskii Journal of Mathematics Aims and scope Submit manuscript

Abstract

The graph anomaly detection problem occurs in many application areas and can be solved by spotting outliers in unstructured collections of multi-dimensional data points, which can be obtained by graph analysis algorithms. We implement the algorithm for the small community analysis and the approximate LOF algorithm based on Locality-Sensitive Hashing, apply the algorithms to a real world graph and evaluate scalability of the algorithms. We use Apache Spark as one of the most popular Big Data frameworks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D. Reed and J. Dongarra, “Exascale computing and big data: the next frontier,” Commun. ACM 57 (7), 56–68 (2014).

    Article  Google Scholar 

  2. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets. HotCloud,” in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010, pp. 10–10. https://doi.org/static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf. Accessed 2018.

    Google Scholar 

  3. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM51, 107–113 (2010).

    Google Scholar 

  4. L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection and description: a survey,” Data Min. Knowl. Disc1ov 29, 626–688 (2015). https://doi.org/arxiv.org/pdf/1404.4679.pdf. Accessed 2018.

    Article  MathSciNet  Google Scholar 

  5. Z. Li, H. Xiong, and Y. Liu, “Detecting blackholes and volcanoes in directed networks,” arXiv:1005. 2179 (2010). https://doi.org/arxiv.org/pdf/1005.2179.pdf. Accessed 2018.

    Google Scholar 

  6. L. Akoglu, M. McGlohon, and C. Faloutsos, “OddBall: Spotting Anomalies in Weighted Graphs,” in Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining PAKDD’10, 2010, Part 2, pp. 410–421. https://doi.org/repository.cmu.edu/cgi/viewcontent.-gi?article=3599&context=compsci. Accessed 2018.

    Chapter  Google Scholar 

  7. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” in Proceedings of the ACM SIGMOD 2000 International Conference on Management of Data, 2010. https://doi.org/www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf. Accessed 2018.

    Google Scholar 

  8. R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in Proceedings of the 24rd International Conference on Very Large Data Bases (Morgan Kaufmann, 1998), pp. 194–205.

    Google Scholar 

  9. D. T. Lee and C. K. Wong, “Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees,” Acta Inf. 9, 23–29 (1977).

    Article  MathSciNet  MATH  Google Scholar 

  10. H. Koga, T. Ishibashi, and T. Watanabe, “Fast agglomerative hierarchical clustering algorithm using localitysensitive hashing,” Knowledge Inf. Syst. 12, 25–53 (2007).

    Article  Google Scholar 

  11. A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 97–104. https://doi.org/homes.cs.washington.-edu/sham/papers/ml/cover_tree.pdf. Accessed 2018.

    Google Scholar 

  12. R. Weber and S. Blott, “An approximation-based data structure for similarity search,” Technical Report No. 24, ESPRIT Project HERMES No. 9141 (1997).

    Google Scholar 

  13. S. Ramaswamy and K. Rose, “Adaptive cluster-distance bounding for nearest neighbor search in image databases,” IEEE Int. Conf. Image Process. 6, 381–384 (2007). https://doi.org/citeseerx.ist.psu.edu/viewdoc/-download?doi=10.1.1.80.6562&rep=rep1&type=pdf.

    Google Scholar 

  14. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: efficient indexing for high-dimensional similarity search,” in Proceedings of the VLDB Conference, 2007, pp. 950–961. https://doi.org/www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf.

    Google Scholar 

  15. T. S. Teixeira, G. Teodoro, E. Valle, and J. H. Saltz, “Scalable locality-sensitive hashing for similarity search in high-dimensional, large-scale multimedia datasets,” arXiv:1310. 4136 (2013); http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdfhttps://doi.org/arxiv.org/pdf/1310.4136.pdf.

    Google Scholar 

  16. Z. Yang, W. T. Ooi, and Q. Sun, “Hierarchical, non-uniform locality sensitive hashing and its application to video identification,” in Proceedigns of the IEEE International Conference on Multimedia and Expo ICME, IEEE Cat. No. 04TH8763 (2004), Vol. 1, pp. 743–746. https://doi.org/www.comp.nus.edu.sg/ooiwt/papers/lsh-icme04-final.pdf. Accessed 2018.

    Google Scholar 

  17. V. Stegailov, N. Orekhov, and G. Smirnov, “HPC hardware efficiency for quantum and classical molecular dynamics,” in Proceedigns of the International Conference on Parallel Computing Technologies (Springer, 2015).

    Google Scholar 

  18. G. Smirnov and V. Stegailov, “Efficiency of classical molecular dynamics algorithms on supercomputers,” Math. Models Comput. Simul. 8, 734–743 (2016).

    Article  MathSciNet  Google Scholar 

  19. M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, and M. Zaharia, “Spark sql: Relational data processing in spark,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383–1394. https://doi.org/amplab.cs.berkeley.edu/wpcontent/uploads/2015/03/SparkSQLSigmod2015.pdf. Accessed 2018.

    Google Scholar 

  20. A. Agarkov, T. Ismagilov, D. Makagon, A. Semenov, and A. Simonov, “Performance evaluation of the Angara interconnect,” in Proceedings of the International Conference Russian Supercomputing Days, 2016, pp. 626–639. https://doi.org/www.dislab.org/docs/rsd2016-angara-bench.pdf. Accessed 2018.

    Google Scholar 

  21. P. Erdős and A. Rényi, “On random graphs,” Publ. Math. Debrecen 6, 290–297 (1959). https://doi.org/snap.stanford.edu/class/cs224w-readings/erdos59random.pdf. Accessed 2018.

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Semenov.

Additional information

(Submitted by V. V. Voevodin)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Semenov, A., Mazeev, A., Doropheev, D. et al. Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark. Lobachevskii J Math 39, 1262–1269 (2018). https://doi.org/10.1134/S1995080218090184

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1995080218090184

Keywords and phrases

Navigation