skip to main content
research-article
Artifacts Available / v1.1

On Efficient Approximate Queries over Machine Learning Models

Published:01 December 2022Publication History
Skip Abstract Section

Abstract

The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because finding high quality answers by invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query, can be prohibitive. We develop a novel unified framework for approximate query answering by leveraging a proxy to minimize the oracle usage of finding high quality answers for both Precision-Target (PT) and Recall-Target (RT) queries. Our framework uses a judicious combination of invoking the expensive oracle on data samples and applying the cheap proxy on the DB objects. It relies on two assumptions. Under the Proxy Quality assumption, we develop two algorithms: PQA that efficiently finds high quality answers with high probability and no oracle calls, and PQE, a heuristic extension that achieves empirically good performance with a small number of oracle calls. Alternatively, under the Core Set Closure assumption, we develop two algorithms: CSC that efficiently returns high quality answers with high probability and minimal oracle usage, and CSE, which extends it to more general settings. Our extensive experiments on five real-world datasets on both query types, PT and RT, demonstrate that our algorithms outperform the state-of-the-art and achieve high result quality with provable statistical guarantees.

References

  1. Zeeshan Ahmed, Saeed Amizadeh, Mikhail Bilenko, Rogan Carr, Wei-Sheng Chin, Yael Dekel, Xavier Dupré, Vadim Eksarevskiy, Senja Filipi, Tom Finley, Abhishek Goswami, Monte Hoover, Scott Inglis, Matteo Interlandi, Najeeb Kazmi, Gleb Krivosheev, Pete Luferenko, Ivan Matantsev, Sergiy Matusevych, Shahab Moradi, Gani Nazirov, Justin Ormont, Gal Oshri, Artidoro Pagnoni, Jignesh Parmar, Prabhat Roy, Mohammad Zeeshan Siddiqui, Markus Weimer, Shauheen Zahirazami, and Yiwen Zhu. 2019. Machine Learning at Microsoft with ML.NET. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis (Eds.). ACM, 2448--2458.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mohammad Alodadi and Vandana P. Janeja. 2015. Similarity in Patient Support Forums Using TF-IDF and Cosine Similarity Metrics. In 2015 International Conference on Healthcare Informatics. 521--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Michael R. Anderson, Michael J. Cafarella, German Ros, and Thomas F. Wenisch. 2019. Physical Representation-Based Predicate Optimization for a Visual Analytics Database. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. 1466--1477.Google ScholarGoogle Scholar
  4. Jees Augustine, Suraj Shetiya, Mohammadreza Esfandiari, Senjuti Basu Roy, and Gautam Das. 2021. A Generalized Approach for Reducing Expensive Distance Calls for A Broad Class of Proximity Problems. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 142--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jon L Bentley. 1975. A Survey of Techniques for Fixed Radius near Neighbor Searching. Technical Report. Stanford, CA, USA.Google ScholarGoogle Scholar
  6. Jon L. Bentley, Donald F. Stanat, and E. Hollins Williams. 1977. The complexity of finding fixed-radius near neighbors. Inform. Process. Lett. 6, 6 (1977), 209--212. Google ScholarGoogle ScholarCross RefCross Ref
  7. William Biscarri, Sihai Dave Zhao, and Robert J Brunner. 2018. A simple and fast method for computing the Poisson binomial distribution function. Computational Statistics & Data Analysis 122 (2018), 92--100.Google ScholarGoogle ScholarCross RefCross Ref
  8. Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. 2017. Adaptive Neural Networks for Efficient Inference. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 527--536.Google ScholarGoogle Scholar
  9. R Brull, W A Ghali, and H Quan. 1999. Missed opportunities for prevention in general internal medicine. CMAJ 160, 8 (Apr 1999), 1137--1140.Google ScholarGoogle Scholar
  10. Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim, David G. Andersen, Michael Kaminsky, and Subramanya R. Dulloor. 2019. Scaling Video Analytics on Constrained Edge Nodes. CoRR abs/1905.13536 (2019). arXiv:1905.13536 http://arxiv.org/abs/1905.13536Google ScholarGoogle Scholar
  11. Yue Cao, Mingsheng Long, Jianmin Wang, Han Zhu, and Qingfu Wen. 2016. Deep Quantization Network for Efficient Image Retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI'16). AAAI Press, 3457--3463.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, Andrea C. Arpaci-Dusseau and Geoff Voelker (Eds.). USENIX Association, 578--594.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wei Chen, Yu Liu, Weiping Wang, Erwin Bakker, Theodoros Georgiou, Paul Fieguth, Li Liu, and Michael S. Lew. 2021. Deep Image Retrieval: A Survey. arXiv:2101.11282 [cs.CV]Google ScholarGoogle Scholar
  14. Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-Label Image Recognition With Graph Convolutional Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5172--5181. Google ScholarGoogle ScholarCross RefCross Ref
  15. Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. 2016. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. arXiv:1511.05942 [cs.LG]Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017, Aditya Akella and Jon Howell (Eds.). USENIX Association, 613--627.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Piali Das, Nikita Ivkin, Tanya Bansal, Laurence Rouesnel, Philip Gautier, Zohar Karnin, Leo Dirac, Lakshmi Ramakrishnan, Andre Perunicic, Iaroslav Shcherbatyi, Wilton Wu, Aida Zolic, Huibin Shen, Amr Ahmed, Fela Winkelmolen, Miroslav Miladinovic, Cedric Archembeau, Alex Tang, Bhaskar Dutt, Patricia Grao, and Kumar Venkateswar. 2020. Amazon SageMaker Autopilot: A White Box AutoML Solution at Scale. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning (Portland, OR, USA) (DEEM'20). Association for Computing Machinery, New York, NY, USA, Article 2, 7 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Stanislaw Deniziak and Tomasz Michno.2016. Content based image retrieval using query by approximate shape. In 2016 Federated Conference on Computer Science and Information Systems (FedCSIS). 807--816.Google ScholarGoogle Scholar
  19. Dujian Ding, Sihem Amer-Yahia, and Laks VS Lakshmanan. 2022. On Efficient Approximate Queries over Machine Learning Models. Google ScholarGoogle ScholarCross RefCross Ref
  20. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88, 2 (2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sérgio Fernandes and Jorge Bernardino. 2015. What is BigQuery?. In Proceedings of the 19th International Database Engineering & Applications Symposium (Yokohama, Japan) (IDEAS '15). Association for Computing Machinery, New York, NY, USA, 202--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Junyang Gao, Yifan Xu, Pankaj K. Agarwal, and Jun Yang. 2021. Efficiently Answering Durability Prediction Queries. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021. 591--604.Google ScholarGoogle Scholar
  23. Geoffrey R. Grimmett. 1986. Probability: An Introduction. Oxford University Press.Google ScholarGoogle Scholar
  24. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. CoRR abs/1703.06870 (2017). arXiv:1703.06870 http://arxiv.org/abs/1703.06870Google ScholarGoogle Scholar
  25. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385Google ScholarGoogle Scholar
  26. Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. arXiv:1801.03493 [cs.DB]Google ScholarGoogle Scholar
  27. Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (2016), 160035.Google ScholarGoogle ScholarCross RefCross Ref
  28. José F. Rodrigues Jr., Marco Antonio Gutierrez, Gabriel Spadon, Bruno Brandoli, and Sihem Amer-Yahia. 2021. LIG-Doctor: Efficient patient trajectory prediction using bidirectional minimal gated-recurrent networks. Inf. Sci. 545 (2021), 813--827.Google ScholarGoogle ScholarCross RefCross Ref
  29. Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. NoScope: Optimizing Deep CNN-Based Queries over Video Streams at Scale. Proc. VLDB Endow. 10, 11 (2017), 1586--1597.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Daniel Kang, Edward Gan, Peter Bailis, Tatsunori Hashimoto, and Matei Zaharia. 2020. Approximate Selection with Guarantees using Proxies. Proc. VLDB Endow. 13, 11 (2020), 1990--2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Konstantinos Karanasos, Matteo Interlandi, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Doris Xin, Supun Nakandala, Subru Krishnan, Markus Weimer, Yuan Yu, Raghu Ramakrishnan, and Carlo Curino. 2020. Extending Relational Query Processing with ML Inference. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org.Google ScholarGoogle Scholar
  32. Alfirna Rizqi Lahitani, Adhistya Erna Permanasari, and Noor Akhmad Setiawan. 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In 2016 4th International Conference on Cyber and IT Service Management. 1--6. Google ScholarGoogle ScholarCross RefCross Ref
  33. Ziliang Lai, Chenxia Han, Chris Liu, Pengfei Zhang, Eric Lo, and Ben Kao. 2021. Top-K Deep Video Analytics: A Probabilistic Approach. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 1037--1050. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? - A Survey on Approximate Query Processing. Data Sci. Eng. 3, 4 (2018), 379--397.Google ScholarGoogle Scholar
  35. Yikuan Li, Shishir Rao, JoséRoberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. 2020. BEHRT: Transformer for Electronic Health Records. Scientific Reports 10, 1 (2020), 7155.Google ScholarGoogle ScholarCross RefCross Ref
  36. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  37. Richard R Love. 1994. Cancer prevention through health promotion: Defining the role of physicians in public health. Cancer 74, S4 (1994), 1418--1422.Google ScholarGoogle ScholarCross RefCross Ref
  38. Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018. 1493--1508.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. ZHU Mingdong, XU Lixin, SHEN Derong, KOU Yue, and NIE Tiezheng. 2018. Methods for Similarity Query on Uncertain Data with Cosine Similarity Constraints. Journal of Frontiers of Computer Science & Technology 12, 1 (2018), 49.Google ScholarGoogle Scholar
  40. Robert C Moore. 1984. Possible-world semantics for autoepistemic logic. Technical Report. SRI INTERNATIONAL MENLO PARK CA ARTIFICIAL INTELLIGENCE CENTER.Google ScholarGoogle Scholar
  41. N. Unnikrishnan Nair, P. G. Sankaran, and N. Balakrishnan. 2013. Stochastic Orders in Reliability. Springer New York, New York, NY, 281--326. Google ScholarGoogle ScholarCross RefCross Ref
  42. Tom J. Pollard, Alistair E. W. Johnson, Jesse D. Raffa, Leo A. Celi, Roger G. Mark, and Omar Badawi. 2018. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data 5, 1 (2018), 180178.Google ScholarGoogle ScholarCross RefCross Ref
  43. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. arXiv:1506.02640 [cs.CV]Google ScholarGoogle Scholar
  44. Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, Faster, Stronger. arXiv:1612.08242 [cs.CV]Google ScholarGoogle Scholar
  45. Jose F. Rodrigues, Jean Louis Pépin, Lorraine Goeuriot, and Sihem Amer-Yahia. 2020. An Extensive Investigation of Machine Learning Techniques for Sleep Apnea Screening. In CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020. 2709--2716.Google ScholarGoogle Scholar
  46. Jose F. Rodrigues-Jr, Marco A. Gutierrez, Gabriel Spadon, Bruno Brandoli, and Sihem Amer-Yahia. 2021. LIG-Doctor: Efficient patient trajectory prediction using bidirectional minimal gated-recurrent networks. Information Sciences 545 (2021), 813--827. Google ScholarGoogle ScholarCross RefCross Ref
  47. Ming Tai-Seale, Thomas G McGuire, and Weimin Zhang. 2007. Time allocation in primary care office visits. Health Serv Res 42, 5 (Oct 2007), 1871--1894.Google ScholarGoogle ScholarCross RefCross Ref
  48. Martin Theobald, Gerhard Weikum, and Ralf Schenkel. 2004. Top-k Query Evaluation with Probabilistic Guarantees. In (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, Toronto, Canada, August 31 - September 3 2004, Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blakeley, and K. Bernhard Schiefer (Eds.). Morgan Kaufmann, 648--659.Google ScholarGoogle Scholar
  49. Franklin H Top. 1959. Preventive Medicine for the Doctor in His Community: An Epidemiologic Approach. AMA Archives of Internal Medicine 103, 1 (1959), 164--165.Google ScholarGoogle ScholarCross RefCross Ref
  50. Roman Vershynin. 2018. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press. Google ScholarGoogle ScholarCross RefCross Ref
  51. Zhihui Yang, Zuozhi Wang, Yicong Huang, Yao Lu, Chen Li, and X. Sean Wang. 2022. Optimizing Machine Learning Inference Queries with Correlative Proxy Models. Proc. VLDB Endow. 15, 10 (jun 2022), 2032--2044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Liang Zheng, Yi Yang, and Qi Tian. 2018. SIFT Meets CNN: A Decade Survey of Instance Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 5 (2018), 1224--1244. Google ScholarGoogle ScholarCross RefCross Ref
  53. Xuanhe Zhou, Chengliang Chai, Guoliang Li, and JI SUN. 2020. Database Meets Artificial Intelligence: A Survey. IEEE Transactions on Knowledge and Data Engineering 1, 1 (2020), 1--18.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader