skip to main content
research-article

Using VDMS to index and search 100M images

Published:01 July 2021Publication History
Skip Abstract Section

Abstract

Data scientists spend most of their time dealing with data preparation, rather than doing what they know best: build machine learning models and algorithms to solve previously unsolvable problems. In this paper, we describe the Visual Data Management System (VDMS), and demonstrate how it can be used to simplify the data preparation process and consequently gain in efficiency simply because we are using a system designed for the job. To demonstrate this, we use one of the largest available public datasets (YFCC100M), with 100 million images and videos, plus additional data including machine-generated tags, for a total of about ~12TB of data. VDMS differs from existing data management systems due to its focus on supporting machine learning and data analytics pipelines that rely on images, videos, and feature vectors, treating these as first class citizens. We demonstrate how VDMS outperforms well-known and widely used systems for data management by up to ~364x, with an average improvement of about 85x for our use-cases, and particularly at scale, for a image search engine implementation. At the same time, VDMS simplifies the process of data preparation and data access, and provides functionalities non-existent in alternative options.

References

  1. Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Fausto Rabitti. 2016. YFCC100M-HNfc6: A Large-Scale Deep Features Benchmark for Similarity Search. In Similarity Search and Applications. Springer International Publishing, 196--209. Google ScholarGoogle ScholarCross RefCross Ref
  2. P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. 1998. The Multidimensional Database System RasDaMan. In Proc. of the 1998 ACM SIGMOD (Seattle, Washington, USA) (SIGMOD '98). ACM, 575--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, Peter Vajgel, et al. 2010. Finding a Needle in Haystack: Facebook's Photo Storage. In 9th USENIX Symposium on OSDI, Vol. 10. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Gary Bradski and Adrian Kaehler. 2013. Learning OpenCV: Computer Vision in C++ with the OpenCV Library (2nd ed.). O'Reilly Media, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 963--968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 1--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jing Fan, Adalbert Gerald Soosai Raj, and Jignesh M Patel. 2015. The Case Against Specialized Graph Analytics Engines. In CIDR.Google ScholarGoogle Scholar
  8. Robert Fergus, Li Fei-Fei, Pietro Perona, and Andrew Zisserman. 2005. Learning object categories from google's image search. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, Vol. 2. IEEE, 1816--1823. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  10. Eric Heien, Derrick Kondo, Ana Gainaru, Dan LaPine, Bill Kramer, and Franck Cappello. 2011. Modeling and tolerating heterogeneous failures in large parallel systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Geir Hoydalsvik. 2019. MySQL Connection Handling and Scaling. Retrieved July 23, 2021 from https://mysqlserverteam.com/mysql-connection-handling-and-scaling/Google ScholarGoogle Scholar
  12. Larry Huston, Rahul Sukthankar, Rajiv Wickremesinghe, Mahadev Satyanarayanan, Gregory R Ganger, Erik Riedel, and Anastassia Ailamaki. 2004. Diamond: A Storage Architecture for Early Discard in Interactive Search. In FAST, Vol. 4. 73--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. IntelPR. 2015. Intel and Micron Produce Breakthrough Memory Technology. Retrieved July 23, 2021 from http://goo.gl/MUWm0WGoogle ScholarGoogle Scholar
  14. Nishtha Jatana, Sahil Puri, Mehak Ahuja, Ishita Kathuria, and Dishant Gosain. 2012. A Survey and Comparison of Relational and Non-Relational Database. International Journal of Engineering Research and Technology 1 (2012). Issue 6.Google ScholarGoogle Scholar
  15. Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. CoRR abs/1702.08734 (2017). arXiv:1702.08734 http://arxiv.org/abs/1702.08734Google ScholarGoogle Scholar
  16. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data management in machine learning: Challenges, techniques, and systems. In Proceedings of the 2017 ACM International Conference on Management of Data. 1717--1722. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandier, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. arXiv preprint arXiv:1208.4173 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. Proc. of the VLDB Endowment 5, 12 (2012), 1790--1801. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ziqi Li. 2019. NoSQL Databases. Google ScholarGoogle ScholarCross RefCross Ref
  21. Libffmpeg. [n.d.]. FFMPEG Library. Retrieved July 23, 2021 from http://source.ffmpeg.orgGoogle ScholarGoogle Scholar
  22. Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys (CSUR) 53, 1 (2020), 1--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Justin J Miller. 2013. Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324.Google ScholarGoogle Scholar
  24. Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, et al. 2014. f4: Facebook's Warm {BLOB} Storage System. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 383--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Oracle Co. [n.d.]. The world's most popular open source database. Retrieved July 23, 2021 from https://www.mysql.com/Google ScholarGoogle Scholar
  26. Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. 2016. The TileDB Array Data Storage Manager. Proc. VLDB Endowment 10, 4 (Nov. 2016), 349--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jianbin Qin, Wei Wang, Chuan Xiao, and Ying Zhang. 2020. Similarity query processing for high-dimensional data. Proceedings of the VLDB Endowment 13, 12 (2020), 3437--3440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Luis Remis, Vishakha Gupta-Cledat, Christina R. Strong, and Ragaad Altarawneh. 2018. VDMS: An Efficient Big-Visual-Data Access for Machine Learning Workloads. Systems for Machine Learning Workshop (SysML) at NIPS, Montreal, Canada abs/1810.11832 (2018). arXiv:1810.11832 http://arxiv.org/abs/1810.11832Google ScholarGoogle Scholar
  29. Mahadev Satyanarayanan, Rahul Sukthankar, Lily Mummert, Adam Goode, Jan Harkes, and Steve Schlosser. 2010. The unique strengths and storage access characteristics of discard-based search. Journal of Internet Services and Applications 1, 1 (2010), 31--44.Google ScholarGoogle ScholarCross RefCross Ref
  30. David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Diet-mar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems 28 (2015), 2503--2511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. SingleStore, Inc. [n.d.]. SingleStore: The Single Database for All Data-Intensive Applications. Retrieved July 23, 2021 from https://www.singlestore.com/Google ScholarGoogle Scholar
  32. The Apache Software Foundation. [n.d.]. Apache Spark: Lightning-fast unified analytics engine. Retrieved July 23, 2021 from https://spark.apache.org/Google ScholarGoogle Scholar
  33. The Apache Software Foundation. [n.d.]. What is Apache Hadoop? Retrieved July 23, 2021 from https://hadoop.apache.org/Google ScholarGoogle Scholar
  34. The PostgreSQL Global Development Group. [n.d.]. PostgreSQL: The World's Most Advanced Open Source Relational Database. Retrieved July 23, 2021 from https://www.postgresql.org/Google ScholarGoogle Scholar
  35. Bart Thomee, Benjamin Elizalde, David A. Shamma, Karl Ni, Gerald Friedland, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M. Commun. ACM 59, 2 (Jan 2016), 64--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kenton Varda. 2008. Protocol buffers: Google's data interchange format. Google Open Source Blog, Available at least as early as Jul 72 (2008).Google ScholarGoogle Scholar
  37. Venkateshwaran Venkataramani, Zach Amsden, Nathan Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Jeremy Hoon, et al. 2012. Tao: how facebook serves the social graph. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 791--792. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Reynold S Xin, Josh Rosen, Matei Zaharia, Michael J Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of data. 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Using VDMS to index and search 100M images
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 14, Issue 12
        July 2021
        587 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 July 2021
        Published in pvldb Volume 14, Issue 12

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader