skip to main content
10.1145/3600006.3613154acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections

Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless Computing

Published:23 October 2023Publication History

ABSTRACT

Serverless computing separates function execution from state management. Simple retry-based fault tolerance might corrupt the shared state with duplicate updates. Existing solutions employ log-based fault tolerance to achieve exactlyonce semantics, where every single read or write to the external state is associated with a log for deterministic replay. However, logging is not a free lunch, which introduces considerable overhead to stateful serverless applications.

We present Halfmoon, a serverless runtime system for fault-tolerant stateful serverless computing. Our key insight is that it is unnecessary to symmetrically log both reads and writes. Instead, it suffices to log either reads or writes, i.e., asymmetrically. We design two logging protocols that enforce exactly-once semantics while providing log-free reads and writes, which are suitable for read- and write-intensive workloads, respectively. We theoretically prove that the two protocols are log-optimal, i.e., no other protocols can achieve lower logging overhead than our protocols. We provide a criterion for choosing the right protocol for a given workload, and a pauseless switching mechanism to switch protocols for dynamic workloads. We implement a prototype of Halfmoon. Experiments show that Halfmoon achieves 20%--40% lower latency and 1.5--4.0× lower logging overhead than the state-of-the-art solution Boki.

Skip Supplemental Material Section

Supplemental Material

References

  1. 2023. AWS Step Functions. https://aws.amazon.com/step-functions/. Accessed 2023-04-17.Google ScholarGoogle Scholar
  2. 2023. Azure Durable Entities. https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-entities. Accessed 2023-04-17.Google ScholarGoogle Scholar
  3. 2023. DeathStarBench. https://github.com/delimitrou/DeathStarBench/. Accessed 2023-04-17.Google ScholarGoogle Scholar
  4. 2023. Functionbench. https://github.com/kmu-bigdata/serverless-faas-workbench. Accessed 2023-04-17.Google ScholarGoogle Scholar
  5. 2023. Google Cloud Functions Triggers. https://cloud.google.com/functions/docs/calling. Accessed 2023-04-17.Google ScholarGoogle Scholar
  6. 2023. Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless Computing (Extended Version). https://tomquartz.github.io/files/SOSP23_Halfmoon_extended.pdf. Accessed 2023-09-11.Google ScholarGoogle Scholar
  7. 2023. Logging in Azure Durable Functions. https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-orchestrations. Accessed 2023-04-17.Google ScholarGoogle Scholar
  8. 2023. Retrying event-driven functions in Google Cloud. https://cloud.google.com/functions/docs/bestpractices/retries. Accessed 2023-04-17.Google ScholarGoogle Scholar
  9. 2023. Sample projects for AWS Step Functions. https://docs.aws.amazon.com/step-functions/latest/dg/create-sample-projects.html. Accessed 2023-04-17.Google ScholarGoogle Scholar
  10. 2023. Serverless Examples. https://github.com/serverless/examples. Accessed 2023-04-17.Google ScholarGoogle Scholar
  11. 2023. Serverlessbench. https://serverlessbench.systems/en-us/. Accessed 2023-04-17.Google ScholarGoogle Scholar
  12. 2023. Statelessness of Google Cloud Functions. https://cloud.google.com/functions/docs/concepts/execution-environment. Accessed 2023-04-17.Google ScholarGoogle Scholar
  13. 2023. Tutorial: Design and implementation of a simple Twitter clone using PHP and the Redis key-value store. https://redis.io/topics/twitter-clone. Accessed 2023-09-11.Google ScholarGoogle Scholar
  14. Marcos K Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra J Marathe, Athanasios Xygkis, and Igor Zablotchi. 2020. Microsecond consensus for microsecond applications. In USENIX OSDI.Google ScholarGoogle Scholar
  15. Remzi Can Aksoy and Manos Kapritsos. 2019. Aegean: Replication beyond the Client-Server Model. In ACM SOSP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kalev Alpernas, Aurojit Panda, Leonid Ryzhyk, and Mooly Sagiv. 2021. Cloud-scale runtime verification of serverless applications. In ACM Symposium on Cloud Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mohamed Alzayat, Jonathan Mace, Peter Druschel, and Deepak Garg. 2023. Groundhog: Efficient Request Isolation in FaaS. In EuroSys.Google ScholarGoogle Scholar
  18. Mahesh Balakrishnan, Jason Flinn, Chen Shen, Mihir Dharamshi, Ahmed Jafri, Xiao Shi, Santosh Ghosh, Hazem Hassan, Aaryaman Sagar, Rhed Shi, et al. 2020. Virtual consensus in delos. In USENIX OSDI.Google ScholarGoogle Scholar
  19. Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D Davis. 2012. CORFU: A Shared Log Design for Flash Clusters.. In USENIX NSDI.Google ScholarGoogle Scholar
  20. Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D Davis, Sriram Rao, Tao Zou, and Aviad Zuck. 2013. Tango: Distributed data structures over a shared log. In ACM SOSP.Google ScholarGoogle Scholar
  21. Mahesh Balakrishnan, Chen Shen, Ahmed Jafri, Suyog Mapara, David Geraghty, Jason Flinn, Vidhya Venkat, Ivailo Nedelchev, Santosh Ghosh, Mihir Dharamshi, et al. 2021. Log-structured protocols in delos. In ACM SOSP.Google ScholarGoogle Scholar
  22. Daniel Barcelona-Pons, Pierre Sutra, Marc Sánchez-Artigas, Gerard París, and Pedro García-López. 2022. Stateful serverless computing with crucial. ACM Transactions on Software Engineering and Methodology (2022).Google ScholarGoogle Scholar
  23. Ken Birman and Thomas Joseph. 1987. Exploiting virtual synchrony in distributed systems. In ACM SOSP.Google ScholarGoogle Scholar
  24. Sebastian Burckhardt, Badrish Chandramouli, Chris Gillum, David Justo, Konstantinos Kallas, Connor McMahon, Christopher S Meiklejohn, and Xiangfeng Zhu. 2022. Netherite: Efficient execution of serverless workflows. Proceedings of the VLDB Endowment (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sebastian Burckhardt, Chris Gillum, David Justo, Konstantinos Kallas, Connor McMahon, and Christopher S Meiklejohn. 2021. Durable functions: semantics for stateful serverless.. In ACM OOPSLA.Google ScholarGoogle Scholar
  26. Binbin Chen, Haifeng Yu, Yuda Zhao, and Phillip B Gibbons. 2014. The cost of fault tolerance in multi-party communication complexity. J. ACM (2014).Google ScholarGoogle Scholar
  27. James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. 2012. Spanner: Google's globally-distributed database. In USENIX OSDI.Google ScholarGoogle Scholar
  28. Heming Cui, Rui Gu, Cheng Liu, Tianyu Chen, and Junfeng Yang. 2015. Paxos made transparent. In ACM SOSP.Google ScholarGoogle Scholar
  29. Martijn de Heus, Kyriakos Psarakis, Marios Fragkoulis, and Asterios Katsifodimos. 2021. Distributed transactions on serverless stateful functions. In Proceedings of the ACM International Conference on Distributed and Event-based Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Cong Ding, David Chu, Evan Zhao, Xiang Li, Lorenzo Alvisi, and Robbert Van Renesse. 2020. Scalog: Seamless reconfiguration and total order in a scalable shared log. In USENIX NSDI.Google ScholarGoogle Scholar
  31. Haoran Ding, Zhaoguo Wang, Zhuohao Shen, Rong Chen, and Haibo Chen. 2023. Automated Verification of Idempotence for Stateful Serverless Applications. In USENIX OSDI.Google ScholarGoogle Scholar
  32. Zhiyuan Dong, Zhaoguo Wang, Xiaodong Zhang, Xian Xu, Changgeng Zhao, Haibo Chen, Aurojit Panda, and Jinyang Li. 2023. Fine-Grained Re-Execution for Efficient Batched Commit of Distributed Transactions. (2023).Google ScholarGoogle Scholar
  33. Dong Du, Qingyuan Liu, Xueqiang Jiang, Yubin Xia, Binyu Zang, and Haibo Chen. 2022. Serverless computing on heterogeneous computers. In ACM ASPLOS.Google ScholarGoogle Scholar
  34. Dong Du, Tianyi Yu, Yubin Xia, Binyu Zang, Guanglu Yan, Chenggang Qin, Qixuan Wu, and Haibo Chen. 2020. Catalyzer: Sub-millisecond startup for serverless computing with initialization-less booting. In ACM ASPLOS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mostafa Elhemali, Niall Gallagher, Bin Tang, Nick Gordon, Hao Huang, Haibo Chen, Joseph Idziorek, Mengtian Wang, Richard Krog, Zongpeng Zhu, et al. 2022. Amazon {DynamoDB}: A Scalable, Predictably Performant, and Fully Managed {NoSQL} Database Service. In USENIX ATC.Google ScholarGoogle Scholar
  36. Vitor Enes, Carlos Baquero, Alexey Gotsman, and Pierre Sutra. 2021. Efficient replication via timestamp stability. In EuroSys.Google ScholarGoogle Scholar
  37. Henrique Fingler, Zhiting Zhu, Esther Yoon, Zhipeng Jia, Emmett Witchel, and Christopher J Rossbach. 2022. DGSF: Disaggregated GPUs for Serverless Functions. In IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  38. Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee, Christos Kozyrakis, Matei Zaharia, and Keith Winstein. 2019. From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers. In USENIX ATC.Google ScholarGoogle Scholar
  39. Xinwei Fu, Wook-Hee Kim, Ajay Paddayuru Shreepathi, Mohannad Ismail, Sunny Wadkar, Dongyoon Lee, and Changwoo Min. 2021. Witcher: Systematic crash consistency testing for non-volatile memory key-value stores. In ACM SOSP.Google ScholarGoogle Scholar
  40. Xinwei Fu, Dongyoon Lee, and Changwoo Min. 2022. {DURINN}: Adversarial Memory and Thread Interleaving for Detecting Durable Linearizability Bugs. In USENIX OSDI.Google ScholarGoogle Scholar
  41. Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. 2019. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In ACM ASPLOS.Google ScholarGoogle Scholar
  42. Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2021. Exploiting nil-externality for fast replicated storage. In ACM SOSP.Google ScholarGoogle Scholar
  43. Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. 2022. Clio: A hardware-software co-designed disaggregated memory system. In ACM ASPLOS.Google ScholarGoogle Scholar
  44. Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. 2017. IronFleet: Proving Safety and Liveness of Practical Distributed Systems. Commun. ACM (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Joseph M Hellerstein, Jose Faleiro, Joseph E Gonzalez, Johann SchleierSmith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. 2019. Serverless computing: One step forward, two steps back. (2019).Google ScholarGoogle Scholar
  46. Maurice P Herlihy and Jeannette M Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (1990).Google ScholarGoogle Scholar
  47. Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, Tianyu Cheng, Vijay Chidambaram, and Emmett Witchel. 2019. TxFS: Leveraging file-system crash consistency to provide ACID transactions. ACM Transactions on Storage (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and enhancing in situ system observability for failure detection. In USENIX OSDI.Google ScholarGoogle Scholar
  49. Nicholas Hunt, Tom Bergan, Luis Ceze, and Steven D Gribble. 2013. DDOS: taming nondeterminism in distributed systems. ACM SIGPLAN Notices (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Abhinav Jangda, Donald Pinckney, Yuriy Brun, and Arjun Guha. 2019. Formal foundations of serverless computing. (2019).Google ScholarGoogle Scholar
  51. Zhipeng Jia and Emmett Witchel. 2021. Boki: Stateful serverless computing with shared logs. In ACM SOSP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zhipeng Jia and Emmett Witchel. 2021. Nightcore: efficient and scalable serverless computing for latency-sensitive, interactive microservices. In ACM ASPLOS.Google ScholarGoogle Scholar
  53. Ricardo Jiménez-Peris, Gustavo Alonso, and Bettina Kemme. 2003. Are quorums an alternative for data replication? ACM Transactions on Database Systems (TODS) (2003).Google ScholarGoogle Scholar
  54. Kostis Kaffes, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2019. Centralized Core-Granular Scheduling for Serverless Functions. In ACM Symposium on Cloud Computing.Google ScholarGoogle Scholar
  55. Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, et al. 2017. Canopy: An end-to-end performance tracing and analysis system. In ACM SOSP.Google ScholarGoogle Scholar
  56. Konstantinos Kallas, Haoran Zhang, Rajeev Alur, Sebastian Angel, and Vincent Liu. 2023. Executing Microservice Applications on Serverless, Correctly. ACM POPL (2023).Google ScholarGoogle Scholar
  57. Manos Kapritsos, Yang Wang, Vivien Quema, Allen Clement, Lorenzo Alvisi, and Mike Dahlin. 2012. All about eve: Execute-verify replication for multi-core servers. In USENIX OSDI.Google ScholarGoogle Scholar
  58. Antonios Katsarakis, Vasilis Gavrielatos, MR Siavash Katebzadeh, Arpit Joshi, Aleksandar Dragojevic, Boris Grot, and Vijay Nagarajan. 2020. Hermes: A fast, fault-tolerant and linearizable replication protocol. In ACM ASPLOS.Google ScholarGoogle Scholar
  59. Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. 2018. Pocket: Elastic Ephemeral Storage for Serverless Analytics.. In USENIX OSDI.Google ScholarGoogle Scholar
  60. Marios Kogias and Edouard Bugnion. 2020. HovercRaft: Achieving scalability and fault-tolerance for microsecond-scale datacenter services. In EuroSys.Google ScholarGoogle Scholar
  61. Eric Koskinen and Junfeng Yang. 2016. Reducing crash recoverability to reachability. In ACM POPL.Google ScholarGoogle Scholar
  62. Swaroop Kotni, Ajay Nayak, Vinod Ganapathy, and Arkaprava Basu. 2021. Faastlane: Accelerating Function-as-a-Service Workflows.. In USENIX ATC.Google ScholarGoogle Scholar
  63. Leslie Lamport. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. (1979).Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News (2001).Google ScholarGoogle Scholar
  65. Günter Last and Mathew Penrose. 2017. Lectures on the Poisson process. Vol. 7. Cambridge University Press.Google ScholarGoogle Scholar
  66. Sekwon Lee, Soujanya Ponnapalli, Sharad Singhal, Marcos K Aguilera, Kimberly Keeton, and Vijay Chidambaram. 2022. DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory. Proceedings of the VLDB Endowment (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Guangpu Li, Haopeng Liu, Xianglan Chen, Haryadi S Gunawi, and Shan Lu. 2019. Dfix: automatically fixing timing bugs in distributed systems. In ACM Conference on Programming Language Design and Implementation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Dan RK Ports. 2016. Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering.. In USENIX OSDI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Jiaxin Li, Yiming Zhang, Shan Lu, Haryadi S Gunawi, Xiaohui Gu, Feng Huang, and Dongsheng Li. 2023. Performance Bug Analysis and Detection for Distributed Storage and Computing Systems. ACM Transactions on Storage (2023).Google ScholarGoogle Scholar
  70. Zijun Li, Yushi Liu, Linsong Guo, Quan Chen, Jiagan Cheng, Wenli Zheng, and Minyi Guo. 2022. Faasflow: Enable efficient workflow execution for function-as-a-service. In ACM ASPLOS.Google ScholarGoogle Scholar
  71. Barbara Liskov, Liuba Shrira, and John Wroclawski. 1991. Efficient at-most-once messages based on synchronized clocks. ACM Transactions on Computer Systems (1991).Google ScholarGoogle Scholar
  72. John DC Little. 2011. Little's Law as viewed on its 50th anniversary. Operations Research (2011).Google ScholarGoogle Scholar
  73. Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. Dcatch: Automatically detecting distributed concurrency bugs in cloud systems. ACM SIGARCH Computer Architecture News (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Joshua Lockerman, Jose M. Faleiro, Juno Kim, Soham Sankaran, Daniel J. Abadi, James Aspnes, Siddhartha Sen, and Mahesh Balakrishnan. 2018. The FuzzyLog: A Partially Ordered Shared Log. In USENIX OSDI.Google ScholarGoogle Scholar
  75. Jeffrey F Lukman, Huan Ke, Cesar A Stuardo, Riza O Suminto, Daniar H Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, et al. 2019. Flymc: Highly scalable testing of complex interleavings in distributed systems. In EuroSys.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Haojun Ma, Hammad Ahmad, Aman Goel, Eli Goldweber, Jean-Baptiste Jeannin, Manos Kapritsos, and Baris Kasikci. 2022. Sift: Using Refinement-guided Automation to Verify Complex Distributed Systems. In USENIX ATC.Google ScholarGoogle Scholar
  77. Haojun Ma, Aman Goel, Jean-Baptiste Jeannin, Manos Kapritsos, Baris Kasikci, and Karem A Sakallah. 2019. I4: incremental inference of inductive invariants for verification of distributed protocols. In ACM SOSP.Google ScholarGoogle Scholar
  78. Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot tracing: Dynamic causal monitoring for distributed systems. In ACM SOSP.Google ScholarGoogle Scholar
  79. Kostas Meladakis, Chrysostomos Zeginis, Kostas Magoutis, and Dimitris Plexousakis. 2022. Transferring transactional business processes to FaaS. In Proceedings of the Eighth International Workshop on Serverless Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Luke Nelson, James Bornholt, Ronghui Gu, Andrew Baumann, Emina Torlak, and Xi Wang. 2019. Scaling symbolic evaluation for automated verification of systems code with Serval. In ACM SOSP.Google ScholarGoogle Scholar
  81. Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast serializable multi-version concurrency control for main-memory database systems. In ACM SIGMOD.Google ScholarGoogle Scholar
  82. Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In USENIX ATC.Google ScholarGoogle Scholar
  83. Haochen Pan, Jesse Tuglu, Neo Zhou, Tianshu Wang, Yicheng Shen, Xiong Zheng, Joseph Tassarotti, Lewis Tseng, and Roberto Palmieri. 2021. Rabia: Simplifying state-machine replication through randomization. In ACM SOSP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Dai Qin, Angela Demke Brown, and Ashvin Goel. 2017. Scalable replay-based replication for fast databases. Proceedings of the VLDB Endowment (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Andrew Quinn, Jason Flinn, Michael Cafarella, and Baris Kasikci. 2022. Debugging the {OmniTable} Way. In USENIX OSDI.Google ScholarGoogle Scholar
  86. Francisco Romero, Gohar Irfan Chaudhry, Íñigo Goiri, Pragna Gopa, Paul Batum, Neeraja J. Yadwadkar, Rodrigo Fonseca, Christos Kozyrakis, and Ricardo Bianchini. 2021. Faa$T: A Transparent Auto-Scaling Cache for Serverless Applications. In ACM Symposium on Cloud Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Johann Schleier-Smith, Vikram Sreekanti, Anurag Khandelwal, Joao Carreira, Neeraja J Yadwadkar, Raluca Ada Popa, Joseph E Gonzalez, Ion Stoica, and David A Patterson. 2021. What serverless computing is and should become: The next phase of cloud computing. Commun. ACM (2021).Google ScholarGoogle Scholar
  88. Srinath TV Setty, Chunzhi Su, Jacob R Lorch, Lidong Zhou, Hao Chen, Parveen Patel, and Jinglei Ren. 2016. Realizing the Fault-Tolerance Promise of Cloud Storage Using Locks with Intent.. In USENIX OSDI.Google ScholarGoogle Scholar
  89. Simon Shillaker and Peter Pietzuch. 2020. Faasm: Lightweight isolation for efficient stateful serverless computing. In USENIX ATC.Google ScholarGoogle Scholar
  90. Vikram Sreekanti, Chenggang Wu, Saurav Chhatrapati, Joseph E Gonzalez, Joseph M Hellerstein, and Jose M Faleiro. 2020. A fault-tolerance shim for serverless computing. In EuroSys.Google ScholarGoogle Scholar
  91. Vikram Sreekanti, Chenggang Wu, Xiayue Charles Lin, Johann Schleier-Smith, Joseph E Gonzalez, Joseph M Hellerstein, and Alexey Tumanov. 2020. Cloudburst: Stateful Functions-as-a-Service. Proceedings of the VLDB Endowment (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Yang Tang and Junfeng Yang. 2020. Lambdata: Optimizing serverless computing by making data intents explicit. In IEEE International Conference on Cloud Computing.Google ScholarGoogle ScholarCross RefCross Ref
  93. Dmitrii Ustiugov, Plamen Petrov, Marios Kogias, Edouard Bugnion, and Boris Grot. 2021. Benchmarking, analysis, and optimization of serverless function snapshots. In ACM ASPLOS.Google ScholarGoogle Scholar
  94. Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M Chen, Jason Flinn, and Satish Narayanasamy. 2012. DoublePlay: Parallelizing sequential logging and replay. ACM Transactions on Computer Systems (2012).Google ScholarGoogle Scholar
  95. Stephanie Wang, John Liagouris, Robert Nishihara, Philipp Moritz, Ujval Misra, Alexey Tumanov, and Ion Stoica. 2019. Lineage stash: fault tolerance off the critical path. In ACM SOSP.Google ScholarGoogle Scholar
  96. Zhaoguo Wang, Changgeng Zhao, Shuai Mu, Haibo Chen, and Jinyang Li. 2019. On the Parallels between Paxos and Raft, and how to Port Optimizations. In ACM PODC.Google ScholarGoogle Scholar
  97. Michael Wei, Amy Tai, Christopher J Rossbach, Ittai Abraham, Maithem Munshed, Medhavi Dhawan, Jim Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, et al. 2017. vcorfu: A cloud-scale object store on a shared log. In USENIX NSDI.Google ScholarGoogle Scholar
  98. Xingda Wei, Rong Chen, Haibo Chen, Zhaoguo Wang, Zhenhan Gong, and Binyu Zang. 2021. Unifying Timestamp with Transaction Ordering for MVCC with Decentralized Scalar Timestamp.. In USENIX NSDI.Google ScholarGoogle Scholar
  99. Xingda Wei, Fangming Lu, Tianxia Wang, J Gu, Y Yang, R Chen, and H Chen. 2023. No provisioned concurrency: Fast RDMA-codesigned remote fork for serverless computing. (2023).Google ScholarGoogle Scholar
  100. Jinfeng Wen, Zhenpeng Chen, Xin Jin, and Xuanzhe Liu. 2023. Rise of the Planet of Serverless Computing: A Systematic Review. ACM Transactions on Software Engineering and Methodology (2023).Google ScholarGoogle Scholar
  101. Chenggang Wu, Vikram Sreekanti, and Joseph M Hellerstein. 2020. Transactional causal consistency for serverless computing. In ACM SIGMOD.Google ScholarGoogle Scholar
  102. Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. 2017. An empirical evaluation of in-memory multi-version concurrency control. Proceedings of the VLDB Endowment (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Jianan Yao, Runzhou Tao, Ronghui Gu, and Jason Nieh. 2022. {DuoAI}: Fast, Automated Inference of Inductive Invariants for Verifying Distributed Protocols. In USENIX OSDI.Google ScholarGoogle Scholar
  104. Jianan Yao, Runzhou Tao, Ronghui Gu, Jason Nieh, Suman Jana, and Gabriel Ryan. 2021. DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols.. In USENIX OSDI.Google ScholarGoogle Scholar
  105. Tianyi Yu, Qingyuan Liu, Dong Du, Yubin Xia, Binyu Zang, Ziqian Lu, Pingchao Yang, Chenggang Qin, and Haibo Chen. 2020. Characterizing serverless platforms with serverlessbench. In ACM Symposium on Cloud Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. 2010. Sherlog: error diagnosis by connecting clues from run-time logs. In ACM ASPLOS.Google ScholarGoogle Scholar
  107. Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael M Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be conservative: Enhancing failure diagnosis with proactive logging. In USENIX OSDI.Google ScholarGoogle Scholar
  108. Xinhao Yuan and Junfeng Yang. 2020. Effective concurrency testing for distributed systems. In ACM ASPLOS.Google ScholarGoogle Scholar
  109. Haoran Zhang, Adney Cardoza, Peter Baile Chen, Sebastian Angel, and Vincent Liu. 2020. Fault-tolerant and transactional stateful serverless workflows. In USENIX OSDI.Google ScholarGoogle Scholar
  110. Tian Zhang, Dong Xie, Feifei Li, and Ryan Stutsman. 2019. Narrowing the Gap Between Serverless and its State with Storage Functions. In ACM Symposium on Cloud Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Wen Zhang, Vivian Fang, Aurojit Panda, and Scott Shenker. 2020. Kappa: A programming framework for serverless computing. In ACM Symposium on Cloud Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Wen Zhang, Eric Sheng, Michael Chang, Aurojit Panda, Mooly Sagiv, and Scott Shenker. 2022. Blockaid: Data Access Policy Enforcement for Web Applications. In USENIX OSDI.Google ScholarGoogle Scholar
  113. Yanqi Zhang, Íñigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh Elnikety, Christina Delimitrou, and Ricardo Bianchini. 2021. Faster and cheaper serverless computing on harvested resources. In ACM SOSP.Google ScholarGoogle Scholar
  114. Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. 2017. Non-Intrusive Failure Reproduction for Distributed Systems using the Partial Trace Principle. In ACM SOSP.Google ScholarGoogle Scholar
  115. Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. 2017. Pensieve: Non-intrusive failure reproduction for distributed systems using the event chaining approach. In ACM SOSP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Ziming Zhao, Mingyu Wu, Jiawei Tang, Binyu Zang, Zhaoguo Wang, and Haibo Chen. 2023. BeeHive: Sub-second elasticity for web services with Semi-FaaS execution. In ACM ASPLOS.Google ScholarGoogle Scholar
  117. Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, and Haibo Chen. 2019. Using concurrent relational logic with helpers for verifying the AtomFS file system. In ACM SOSP.Google ScholarGoogle Scholar
  118. Gefei Zuo, Jiacheng Ma, Andrew Quinn, Pramod Bhatotia, Pedro Fonseca, and Baris Kasikci. 2021. Execution reconstruction: Harnessing failure reoccurrences for failure reproduction. In ACM Conference on Programming Language Design and Implementation.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless Computing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles
          October 2023
          802 pages
          ISBN:9798400702297
          DOI:10.1145/3600006

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SOSP '23 Paper Acceptance Rate43of232submissions,19%Overall Acceptance Rate131of716submissions,18%

          Upcoming Conference

          SOSP '24
        • Article Metrics

          • Downloads (Last 12 months)808
          • Downloads (Last 6 weeks)142

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader