Abstract
Data scientists spend most of their time dealing with data preparation, rather than doing what they know best: build machine learning models and algorithms to solve previously unsolvable problems. In this paper, we describe the Visual Data Management System (VDMS), and demonstrate how it can be used to simplify the data preparation process and consequently gain in efficiency simply because we are using a system designed for the job. To demonstrate this, we use one of the largest available public datasets (YFCC100M), with 100 million images and videos, plus additional data including machine-generated tags, for a total of about ~12TB of data. VDMS differs from existing data management systems due to its focus on supporting machine learning and data analytics pipelines that rely on images, videos, and feature vectors, treating these as first class citizens. We demonstrate how VDMS outperforms well-known and widely used systems for data management by up to ~364x, with an average improvement of about 85x for our use-cases, and particularly at scale, for a image search engine implementation. At the same time, VDMS simplifies the process of data preparation and data access, and provides functionalities non-existent in alternative options.
- Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Fausto Rabitti. 2016. YFCC100M-HNfc6: A Large-Scale Deep Features Benchmark for Similarity Search. In Similarity Search and Applications. Springer International Publishing, 196--209. Google ScholarCross Ref
- P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. 1998. The Multidimensional Database System RasDaMan. In Proc. of the 1998 ACM SIGMOD (Seattle, Washington, USA) (SIGMOD '98). ACM, 575--577. Google ScholarDigital Library
- Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, Peter Vajgel, et al. 2010. Finding a Needle in Haystack: Facebook's Photo Storage. In 9th USENIX Symposium on OSDI, Vol. 10. 1--8. Google ScholarDigital Library
- Gary Bradski and Adrian Kaehler. 2013. Learning OpenCV: Computer Vision in C++ with the OpenCV Library (2nd ed.). O'Reilly Media, Inc. Google ScholarDigital Library
- Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 963--968. Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 1--26. Google ScholarDigital Library
- Jing Fan, Adalbert Gerald Soosai Raj, and Jignesh M Patel. 2015. The Case Against Specialized Graph Analytics Engines. In CIDR.Google Scholar
- Robert Fergus, Li Fei-Fei, Pietro Perona, and Andrew Zisserman. 2005. Learning object categories from google's image search. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, Vol. 2. IEEE, 1816--1823. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Eric Heien, Derrick Kondo, Ana Gainaru, Dan LaPine, Bill Kramer, and Franck Cappello. 2011. Modeling and tolerating heterogeneous failures in large parallel systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. Google ScholarDigital Library
- Geir Hoydalsvik. 2019. MySQL Connection Handling and Scaling. Retrieved July 23, 2021 from https://mysqlserverteam.com/mysql-connection-handling-and-scaling/Google Scholar
- Larry Huston, Rahul Sukthankar, Rajiv Wickremesinghe, Mahadev Satyanarayanan, Gregory R Ganger, Erik Riedel, and Anastassia Ailamaki. 2004. Diamond: A Storage Architecture for Early Discard in Interactive Search. In FAST, Vol. 4. 73--86. Google ScholarDigital Library
- IntelPR. 2015. Intel and Micron Produce Breakthrough Memory Technology. Retrieved July 23, 2021 from http://goo.gl/MUWm0WGoogle Scholar
- Nishtha Jatana, Sahil Puri, Mehak Ahuja, Ishita Kathuria, and Dishant Gosain. 2012. A Survey and Comparison of Relational and Non-Relational Database. International Journal of Engineering Research and Technology 1 (2012). Issue 6.Google Scholar
- Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. CoRR abs/1702.08734 (2017). arXiv:1702.08734 http://arxiv.org/abs/1702.08734Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google ScholarDigital Library
- Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data management in machine learning: Challenges, techniques, and systems. In Proceedings of the 2017 ACM International Conference on Management of Data. 1717--1722. Google ScholarDigital Library
- Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandier, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. arXiv preprint arXiv:1208.4173 (2012). Google ScholarDigital Library
- Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. Proc. of the VLDB Endowment 5, 12 (2012), 1790--1801. Google ScholarDigital Library
- Ziqi Li. 2019. NoSQL Databases. Google ScholarCross Ref
- Libffmpeg. [n.d.]. FFMPEG Library. Retrieved July 23, 2021 from http://source.ffmpeg.orgGoogle Scholar
- Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys (CSUR) 53, 1 (2020), 1--37. Google ScholarDigital Library
- Justin J Miller. 2013. Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324.Google Scholar
- Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, et al. 2014. f4: Facebook's Warm {BLOB} Storage System. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 383--398. Google ScholarDigital Library
- Oracle Co. [n.d.]. The world's most popular open source database. Retrieved July 23, 2021 from https://www.mysql.com/Google Scholar
- Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. 2016. The TileDB Array Data Storage Manager. Proc. VLDB Endowment 10, 4 (Nov. 2016), 349--360. Google ScholarDigital Library
- Jianbin Qin, Wei Wang, Chuan Xiao, and Ying Zhang. 2020. Similarity query processing for high-dimensional data. Proceedings of the VLDB Endowment 13, 12 (2020), 3437--3440. Google ScholarDigital Library
- Luis Remis, Vishakha Gupta-Cledat, Christina R. Strong, and Ragaad Altarawneh. 2018. VDMS: An Efficient Big-Visual-Data Access for Machine Learning Workloads. Systems for Machine Learning Workshop (SysML) at NIPS, Montreal, Canada abs/1810.11832 (2018). arXiv:1810.11832 http://arxiv.org/abs/1810.11832Google Scholar
- Mahadev Satyanarayanan, Rahul Sukthankar, Lily Mummert, Adam Goode, Jan Harkes, and Steve Schlosser. 2010. The unique strengths and storage access characteristics of discard-based search. Journal of Internet Services and Applications 1, 1 (2010), 31--44.Google ScholarCross Ref
- David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Diet-mar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems 28 (2015), 2503--2511. Google ScholarDigital Library
- SingleStore, Inc. [n.d.]. SingleStore: The Single Database for All Data-Intensive Applications. Retrieved July 23, 2021 from https://www.singlestore.com/Google Scholar
- The Apache Software Foundation. [n.d.]. Apache Spark: Lightning-fast unified analytics engine. Retrieved July 23, 2021 from https://spark.apache.org/Google Scholar
- The Apache Software Foundation. [n.d.]. What is Apache Hadoop? Retrieved July 23, 2021 from https://hadoop.apache.org/Google Scholar
- The PostgreSQL Global Development Group. [n.d.]. PostgreSQL: The World's Most Advanced Open Source Relational Database. Retrieved July 23, 2021 from https://www.postgresql.org/Google Scholar
- Bart Thomee, Benjamin Elizalde, David A. Shamma, Karl Ni, Gerald Friedland, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M. Commun. ACM 59, 2 (Jan 2016), 64--73. Google ScholarDigital Library
- Kenton Varda. 2008. Protocol buffers: Google's data interchange format. Google Open Source Blog, Available at least as early as Jul 72 (2008).Google Scholar
- Venkateshwaran Venkataramani, Zach Amsden, Nathan Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Jeremy Hoon, et al. 2012. Tao: how facebook serves the social graph. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 791--792. Google ScholarDigital Library
- Reynold S Xin, Josh Rosen, Matei Zaharia, Michael J Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of data. 13--24. Google ScholarDigital Library
Index Terms
- Using VDMS to index and search 100M images
Recommendations
Searching and annotating 100M Images with YFCC100M-HNfc6 and MI-File
CBMI '17: Proceedings of the 15th International Workshop on Content-Based Multimedia IndexingWe present an image search engine that allows searching by similarity about 100M images included in the YFCC100M dataset, and annotate query images. Image similarity search is performed using YFCC100M-HNfc6, the set of deep features we extracted from ...
Indexing and searching 100M images with map-reduce
ICMR '13: Proceedings of the 3rd ACM conference on International conference on multimedia retrievalMost researchers working on high-dimensional indexing agree on the following three trends: (i) the size of the multimedia collections to index are now reaching millions if not billions of items, (ii) the computers we use every day now come with multiple ...
CRANBERRY: Memory-Effective Search in 100M High-Dimensional CLIP Vectors
Similarity Search and ApplicationsAbstractRecent advances in cross-modal multimedia data analysis necessarily require efficient similarity search on the scales of hundreds of millions of high-dimensional vectors. We address this task by proposing the CRANBERRY algorithm that specifically ...
Comments