Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

Semenov, A.; Mazeev, A.; Doropheev, D.; Yusubaliev, T.

doi:10.1134/S1995080218090184

Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

Part 1. Special issue “High Performance Data Intensive Computing” Editors: V. V. Voevodin, A. S. Simonov, and A. V. Lapin
Published: 08 January 2019

Volume 39, pages 1262–1269, (2018)
Cite this article

Lobachevskii Journal of Mathematics Aims and scope Submit manuscript

A. Semenov¹,
A. Mazeev¹,
D. Doropheev² &
…
T. Yusubaliev³

62 Accesses
1 Citation
Explore all metrics

Abstract

The graph anomaly detection problem occurs in many application areas and can be solved by spotting outliers in unstructured collections of multi-dimensional data points, which can be obtained by graph analysis algorithms. We implement the algorithm for the small community analysis and the approximate LOF algorithm based on Locality-Sensitive Hashing, apply the algorithms to a real world graph and evaluate scalability of the algorithms. We use Apache Spark as one of the most popular Big Data frameworks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random walk with restart on hypergraphs: fast computation and an application to anomaly detection

Article 21 December 2023

Certain Strategic Study on Machine Learning-Based Graph Anomaly Detection

ORCA: Outlier detection and Robust Clustering for Attributed graphs

Article 03 May 2021

References

D. Reed and J. Dongarra, “Exascale computing and big data: the next frontier,” Commun. ACM 57 (7), 56–68 (2014).
Article Google Scholar
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets. HotCloud,” in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010, pp. 10–10. https://doi.org/static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf. Accessed 2018.
Google Scholar
J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM51, 107–113 (2010).
Google Scholar
L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection and description: a survey,” Data Min. Knowl. Disc1ov 29, 626–688 (2015). https://doi.org/arxiv.org/pdf/1404.4679.pdf. Accessed 2018.
Article MathSciNet Google Scholar
Z. Li, H. Xiong, and Y. Liu, “Detecting blackholes and volcanoes in directed networks,” arXiv:1005. 2179 (2010). https://doi.org/arxiv.org/pdf/1005.2179.pdf. Accessed 2018.
Google Scholar
L. Akoglu, M. McGlohon, and C. Faloutsos, “OddBall: Spotting Anomalies in Weighted Graphs,” in Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining PAKDD’10, 2010, Part 2, pp. 410–421. https://doi.org/repository.cmu.edu/cgi/viewcontent.-gi?article=3599&context=compsci. Accessed 2018.
Chapter Google Scholar
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” in Proceedings of the ACM SIGMOD 2000 International Conference on Management of Data, 2010. https://doi.org/www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf. Accessed 2018.
Google Scholar
R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in Proceedings of the 24rd International Conference on Very Large Data Bases (Morgan Kaufmann, 1998), pp. 194–205.
Google Scholar
D. T. Lee and C. K. Wong, “Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees,” Acta Inf. 9, 23–29 (1977).
Article MathSciNet MATH Google Scholar
H. Koga, T. Ishibashi, and T. Watanabe, “Fast agglomerative hierarchical clustering algorithm using localitysensitive hashing,” Knowledge Inf. Syst. 12, 25–53 (2007).
Article Google Scholar
A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 97–104. https://doi.org/homes.cs.washington.-edu/sham/papers/ml/cover_tree.pdf. Accessed 2018.
Google Scholar
R. Weber and S. Blott, “An approximation-based data structure for similarity search,” Technical Report No. 24, ESPRIT Project HERMES No. 9141 (1997).
Google Scholar
S. Ramaswamy and K. Rose, “Adaptive cluster-distance bounding for nearest neighbor search in image databases,” IEEE Int. Conf. Image Process. 6, 381–384 (2007). https://doi.org/citeseerx.ist.psu.edu/viewdoc/-download?doi=10.1.1.80.6562&rep=rep1&type=pdf.
Google Scholar
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: efficient indexing for high-dimensional similarity search,” in Proceedings of the VLDB Conference, 2007, pp. 950–961. https://doi.org/www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf.
Google Scholar
T. S. Teixeira, G. Teodoro, E. Valle, and J. H. Saltz, “Scalable locality-sensitive hashing for similarity search in high-dimensional, large-scale multimedia datasets,” arXiv:1310. 4136 (2013); http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdfhttps://doi.org/arxiv.org/pdf/1310.4136.pdf.
Google Scholar
Z. Yang, W. T. Ooi, and Q. Sun, “Hierarchical, non-uniform locality sensitive hashing and its application to video identification,” in Proceedigns of the IEEE International Conference on Multimedia and Expo ICME, IEEE Cat. No. 04TH8763 (2004), Vol. 1, pp. 743–746. https://doi.org/www.comp.nus.edu.sg/ooiwt/papers/lsh-icme04-final.pdf. Accessed 2018.
Google Scholar
V. Stegailov, N. Orekhov, and G. Smirnov, “HPC hardware efficiency for quantum and classical molecular dynamics,” in Proceedigns of the International Conference on Parallel Computing Technologies (Springer, 2015).
Google Scholar
G. Smirnov and V. Stegailov, “Efficiency of classical molecular dynamics algorithms on supercomputers,” Math. Models Comput. Simul. 8, 734–743 (2016).
Article MathSciNet Google Scholar
M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, and M. Zaharia, “Spark sql: Relational data processing in spark,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383–1394. https://doi.org/amplab.cs.berkeley.edu/wpcontent/uploads/2015/03/SparkSQLSigmod2015.pdf. Accessed 2018.
Google Scholar
A. Agarkov, T. Ismagilov, D. Makagon, A. Semenov, and A. Simonov, “Performance evaluation of the Angara interconnect,” in Proceedings of the International Conference Russian Supercomputing Days, 2016, pp. 626–639. https://doi.org/www.dislab.org/docs/rsd2016-angara-bench.pdf. Accessed 2018.
Google Scholar
P. Erdős and A. Rényi, “On random graphs,” Publ. Math. Debrecen 6, 290–297 (1959). https://doi.org/snap.stanford.edu/class/cs224w-readings/erdos59random.pdf. Accessed 2018.
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Scientific Research Centre for Electronic Computer Technology (NICEVT) JSC, Varshavskoe sh. 125, Moscow, 117587, Russia
A. Semenov & A. Mazeev
Moscow Institute of Physics and Technology (State University), Institutskii per. 9, Dolgoprudny, Moscow oblast, 141701, Russia
D. Doropheev
Quality Software Solutions Ltd., Moscow, Russia
T. Yusubaliev

Authors

A. Semenov
View author publications
You can also search for this author in PubMed Google Scholar
A. Mazeev
View author publications
You can also search for this author in PubMed Google Scholar
D. Doropheev
View author publications
You can also search for this author in PubMed Google Scholar
T. Yusubaliev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Semenov.

Additional information

(Submitted by V. V. Voevodin)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Semenov, A., Mazeev, A., Doropheev, D. et al. Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark. Lobachevskii J Math 39, 1262–1269 (2018). https://doi.org/10.1134/S1995080218090184

Download citation

Received: 28 June 2018
Published: 08 January 2019
Issue Date: November 2018
DOI: https://doi.org/10.1134/S1995080218090184

Keywords and phrases

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

Abstract

Access this article

Similar content being viewed by others

Random walk with restart on hypergraphs: fast computation and an application to anomaly detection

Certain Strategic Study on Machine Learning-Based Graph Anomaly Detection

ORCA: Outlier detection and Robust Clustering for Attributed graphs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords and phrases

Navigation

Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

Abstract

Access this article

Similar content being viewed by others

Random walk with restart on hypergraphs: fast computation and an application to anomaly detection

Certain Strategic Study on Machine Learning-Based Graph Anomaly Detection

ORCA: Outlier detection and Robust Clustering for Attributed graphs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords and phrases

Search

Navigation