research-article

Using VDMS to index and search 100M images

Authors:
Luis Remis

ApertureData

ApertureData
View Profile

,
Chaunté W. Lacewell

Intel Labs

Intel Labs
View Profile

Proceedings of the VLDB Endowment Volume 14 Issue 12pp 3240–3252https://doi.org/10.14778/3476311.3476381

Published:01 July 2021Publication History

Proceedings of the VLDB Endowment

Abstract

Data scientists spend most of their time dealing with data preparation, rather than doing what they know best: build machine learning models and algorithms to solve previously unsolvable problems. In this paper, we describe the Visual Data Management System (VDMS), and demonstrate how it can be used to simplify the data preparation process and consequently gain in efficiency simply because we are using a system designed for the job. To demonstrate this, we use one of the largest available public datasets (YFCC100M), with 100 million images and videos, plus additional data including machine-generated tags, for a total of about ~12TB of data. VDMS differs from existing data management systems due to its focus on supporting machine learning and data analytics pipelines that rely on images, videos, and feature vectors, treating these as first class citizens. We demonstrate how VDMS outperforms well-known and widely used systems for data management by up to ~364x, with an average improvement of about 85x for our use-cases, and particularly at scale, for a image search engine implementation. At the same time, VDMS simplifies the process of data preparation and data access, and provides functionalities non-existent in alternative options.

References

Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Fausto Rabitti. 2016. YFCC100M-HNfc6: A Large-Scale Deep Features Benchmark for Similarity Search. In Similarity Search and Applications. Springer International Publishing, 196--209. Google ScholarCross Ref
P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. 1998. The Multidimensional Database System RasDaMan. In Proc. of the 1998 ACM SIGMOD (Seattle, Washington, USA) (SIGMOD '98). ACM, 575--577. Google ScholarDigital Library
Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, Peter Vajgel, et al. 2010. Finding a Needle in Haystack: Facebook's Photo Storage. In 9th USENIX Symposium on OSDI, Vol. 10. 1--8. Google ScholarDigital Library
Gary Bradski and Adrian Kaehler. 2013. Learning OpenCV: Computer Vision in C++ with the OpenCV Library (2nd ed.). O'Reilly Media, Inc. Google ScholarDigital Library
Paul G Brown. 2010. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 963--968. Google ScholarDigital Library
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 1--26. Google ScholarDigital Library
Jing Fan, Adalbert Gerald Soosai Raj, and Jignesh M Patel. 2015. The Case Against Specialized Graph Analytics Engines. In CIDR.Google Scholar
Robert Fergus, Li Fei-Fei, Pietro Perona, and Andrew Zisserman. 2005. Learning object categories from google's image search. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, Vol. 2. IEEE, 1816--1823. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Eric Heien, Derrick Kondo, Ana Gainaru, Dan LaPine, Bill Kramer, and Franck Cappello. 2011. Modeling and tolerating heterogeneous failures in large parallel systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11. Google ScholarDigital Library
Geir Hoydalsvik. 2019. MySQL Connection Handling and Scaling. Retrieved July 23, 2021 from https://mysqlserverteam.com/mysql-connection-handling-and-scaling/Google Scholar
Larry Huston, Rahul Sukthankar, Rajiv Wickremesinghe, Mahadev Satyanarayanan, Gregory R Ganger, Erik Riedel, and Anastassia Ailamaki. 2004. Diamond: A Storage Architecture for Early Discard in Interactive Search. In FAST, Vol. 4. 73--86. Google ScholarDigital Library
IntelPR. 2015. Intel and Micron Produce Breakthrough Memory Technology. Retrieved July 23, 2021 from http://goo.gl/MUWm0WGoogle Scholar
Nishtha Jatana, Sahil Puri, Mehak Ahuja, Ishita Kathuria, and Dishant Gosain. 2012. A Survey and Comparison of Relational and Non-Relational Database. International Journal of Engineering Research and Technology 1 (2012). Issue 6.Google Scholar
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. CoRR abs/1702.08734 (2017). arXiv:1702.08734 http://arxiv.org/abs/1702.08734Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google ScholarDigital Library
Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data management in machine learning: Challenges, techniques, and systems. In Proceedings of the 2017 ACM International Conference on Management of Data. 1717--1722. Google ScholarDigital Library
Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandier, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. arXiv preprint arXiv:1208.4173 (2012). Google ScholarDigital Library
Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. Proc. of the VLDB Endowment 5, 12 (2012), 1790--1801. Google ScholarDigital Library
Ziqi Li. 2019. NoSQL Databases. Google ScholarCross Ref
Libffmpeg. [n.d.]. FFMPEG Library. Retrieved July 23, 2021 from http://source.ffmpeg.orgGoogle Scholar
Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys (CSUR) 53, 1 (2020), 1--37. Google ScholarDigital Library
Justin J Miller. 2013. Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324.Google Scholar
Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, et al. 2014. f4: Facebook's Warm {BLOB} Storage System. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 383--398. Google ScholarDigital Library
Oracle Co. [n.d.]. The world's most popular open source database. Retrieved July 23, 2021 from https://www.mysql.com/Google Scholar
Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. 2016. The TileDB Array Data Storage Manager. Proc. VLDB Endowment 10, 4 (Nov. 2016), 349--360. Google ScholarDigital Library
Jianbin Qin, Wei Wang, Chuan Xiao, and Ying Zhang. 2020. Similarity query processing for high-dimensional data. Proceedings of the VLDB Endowment 13, 12 (2020), 3437--3440. Google ScholarDigital Library
Luis Remis, Vishakha Gupta-Cledat, Christina R. Strong, and Ragaad Altarawneh. 2018. VDMS: An Efficient Big-Visual-Data Access for Machine Learning Workloads. Systems for Machine Learning Workshop (SysML) at NIPS, Montreal, Canada abs/1810.11832 (2018). arXiv:1810.11832 http://arxiv.org/abs/1810.11832Google Scholar
Mahadev Satyanarayanan, Rahul Sukthankar, Lily Mummert, Adam Goode, Jan Harkes, and Steve Schlosser. 2010. The unique strengths and storage access characteristics of discard-based search. Journal of Internet Services and Applications 1, 1 (2010), 31--44.Google ScholarCross Ref
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Diet-mar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems 28 (2015), 2503--2511. Google ScholarDigital Library
SingleStore, Inc. [n.d.]. SingleStore: The Single Database for All Data-Intensive Applications. Retrieved July 23, 2021 from https://www.singlestore.com/Google Scholar
The Apache Software Foundation. [n.d.]. Apache Spark: Lightning-fast unified analytics engine. Retrieved July 23, 2021 from https://spark.apache.org/Google Scholar
The Apache Software Foundation. [n.d.]. What is Apache Hadoop? Retrieved July 23, 2021 from https://hadoop.apache.org/Google Scholar
The PostgreSQL Global Development Group. [n.d.]. PostgreSQL: The World's Most Advanced Open Source Relational Database. Retrieved July 23, 2021 from https://www.postgresql.org/Google Scholar
Bart Thomee, Benjamin Elizalde, David A. Shamma, Karl Ni, Gerald Friedland, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M. Commun. ACM 59, 2 (Jan 2016), 64--73. Google ScholarDigital Library
Kenton Varda. 2008. Protocol buffers: Google's data interchange format. Google Open Source Blog, Available at least as early as Jul 72 (2008).Google Scholar
Venkateshwaran Venkataramani, Zach Amsden, Nathan Bronson, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Jeremy Hoon, et al. 2012. Tao: how facebook serves the social graph. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 791--792. Google ScholarDigital Library
Reynold S Xin, Josh Rosen, Matei Zaharia, Michael J Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of data. 13--24. Google ScholarDigital Library

Index Terms

Using VDMS to index and search 100M images
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Searching and annotating 100M Images with YFCC100M-HNfc6 and MI-File
CBMI '17: Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing

We present an image search engine that allows searching by similarity about 100M images included in the YFCC100M dataset, and annotate query images. Image similarity search is performed using YFCC100M-HNfc6, the set of deep features we extracted from ...
Read More
Indexing and searching 100M images with map-reduce
ICMR '13: Proceedings of the 3rd ACM conference on International conference on multimedia retrieval

Most researchers working on high-dimensional indexing agree on the following three trends: (i) the size of the multimedia collections to index are now reaching millions if not billions of items, (ii) the computers we use every day now come with multiple ...
Read More
CRANBERRY: Memory-Effective Search in 100M High-Dimensional CLIP Vectors
Similarity Search and Applications
Abstract
Recent advances in cross-modal multimedia data analysis necessarily require efficient similarity search on the scales of hundreds of millions of high-dimensional vectors. We address this task by proposing the CRANBERRY algorithm that specifically ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 14, Issue 12
July 2021
587 pages
ISSN:2150-8097
Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2021
Published in pvldb Volume 14, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 42
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using VDMS to index and search 100M images

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Searching and annotating 100M Images with YFCC100M-HNfc6 and MI-File

Indexing and searching 100M images with map-reduce

CRANBERRY: Memory-Effective Search in 100M High-Dimensional CLIP Vectors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Using VDMS to index and search 100M images

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Searching and annotating 100M Images with YFCC100M-HNfc6 and MI-File

Indexing and searching 100M images with map-reduce

CRANBERRY: Memory-Effective Search in 100M High-Dimensional CLIP Vectors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media