ABSTRACT
Researchers are in constant need of reliable data to develop and evaluate AI/ML methods for networks and cybersecurity. While Internet measurements can provide realistic data, such datasets lack ground truth about application flows. We present a ∼ 750GB dataset that includes ∼ 2000 systematically conducted experiments and the resulting packet captures with video streaming, video teleconferencing, and cloud-based document editing applications. This curated and labeled dataset has bidirectional and encrypted traffic with complete ground truth that can be widely used for assessments and evaluation of AI/ML algorithms.
- 8x8. 2022. Jitsi Meet. https://jitsi.org/meetGoogle Scholar
- Calvin Ardi, Alefiya Hussain, and Stephen Schwab. 2021. Building Reproducible Video Streaming Traffic Generators. In Cyber Security Experimentation and Test Workshop (Virtual, CA, USA) (CSET ’21). Association for Computing Machinery, New York, NY, USA, 91–95. https://doi.org/10.1145/3474718.3474721Google ScholarDigital Library
- MergeTB Authors. 2022. The Merge Testbed Platform. https://next.mergetb.orgGoogle Scholar
- Fabrice Bellard. 2005. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (Anaheim, CA) (ATEC ’05). USENIX Association, USA, 41.Google ScholarDigital Library
- RIPE Network Coordination Center. 2022. RIPE Atlas. https://www.ripe.net/analyse/internet-measurementsGoogle Scholar
- kc claffy, David Clark, John Heidemann, Fabian Bustamante, Mattijs Jonker, Aaron Schulman, and Ellen Zegura. 2021. Workshop on Overcoming Measurement Barriers to Internet Research (WOMBIR 2021) Final Report. SIGCOMM Comput. Commun. Rev. 51, 3 (July 2021), 33–40. https://doi.org/10.1145/3477482.3477489Google ScholarDigital Library
- DARPA. 2022. Searchlight. https://www.darpa.mil/program/searchlightGoogle Scholar
- David DeAngelis, Alefiya Hussain, Brian Kocoloski, Calvin Ardi, and Stephen Schwab. 2022. Generating Representative Video Teleconferencing Traffic(CSET ’22). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3546096.3546107Google ScholarDigital Library
- Jason A Donenfeld. 2017. Wireguard: Next Generation Kernel Network Tunnel. In 24th Annual Network and Distributed System Security Symposium (San Diego, California, USA) (NDSS ’17). Internet Society. https://doi.org/10.14722/ndss.2017.23160Google Scholar
- Naganand Doraswamy and Dan Harkins. 2003. IPSec: the new security standard for the Internet, intranets, and virtual private networks. Prentice Hall Professional.Google Scholar
- Constantine Dovrolis, Krishna Gummadi, Aleksandar Kuzmanovic, and Sascha D. Meinrath. 2010. Measurement Lab: Overview and an Invitation to the Research Community. SIGCOMM Comput. Commun. Rev. 40, 3 (June 2010), 53–56. https://doi.org/10.1145/1823844.1823853Google ScholarDigital Library
- Anja Feldmann, Oliver Gasser, Franziska Lichtblau, Enric Pujol, Ingmar Poese, Christoph Dietzel, Daniel Wagner, Matthias Wichtlhuber, Juan Tapiador, Narseo Vallina-Rodriguez, Oliver Hohlfeld, and Georgios Smaragdakis. 2020. The Lockdown Effect: Implications of the COVID-19 Pandemic on Internet Traffic. In Proceedings of the ACM Internet Measurement Conference (Virtual Event, USA) (IMC ’20). Association for Computing Machinery, New York, NY, USA, 1–18. https://doi.org/10.1145/3419394.3423658Google ScholarDigital Library
- Anja Feldmann, Oliver Gasser, Franziska Lichtblau, Enric Pujol, Ingmar Poese, Christoph Dietzel, Daniel Wagner, Matthias Wichtlhuber, Juan Tapiador, Narseo Vallina-Rodriguez, Oliver Hohlfeld, and Georgios Smaragdakis. 2021. A Year in Lockdown: How the Waves of COVID-19 Impact Internet Traffic. Commun. ACM 64, 7 (June 2021), 101–108. https://doi.org/10.1145/3465212Google ScholarDigital Library
- The Etherpad Foundation. 2022. Etherpad. https://etherpad.orgGoogle Scholar
- Timur Friedman, Phillipa Gill, Sue Moon, Dave Clark, and Ítalo Cunha. 2022. The Networking Channel: Network Datasets: what exists, and what are the problems?https://networkingchannel.eu/network-datasets-what-exists-and-what-are-the-problems/Google Scholar
- John Heidemann and Christos Papadopoulos. 2009. Uses and Challenges for Network Datasets. In Proceedings of the IEEE Cybersecurity Applications and Technologies Conference for Homeland Security (CATCH). IEEE, Washington, DC, USA, 73–82. https://doi.org/10.1109/CATCH.2009.29Google ScholarDigital Library
- Alefiya Hussain, Genevieve Bartlett, Yuri Pryadkin, John Heidemann, Christos Papadopoulos, and Joseph Bannister. 2005. Experiences with a Continuous Network Tracing Infrastructure. In Proceedings of the 2005 ACM SIGCOMM Workshop on Mining Network Data (Philadelphia, Pennsylvania, USA) (MineNet ’05). Association for Computing Machinery, New York, NY, USA, 185–190. https://doi.org/10.1145/1080173.1080181Google ScholarDigital Library
- kc claffy. 2022. CAIDA Datasets. https://www.caida.org/catalog/datasets/completed-datasets/Google Scholar
- Alexander D. Kent. 2016. Cyber-Security Data Sources for Dynamic Network Research. In Dynamic Networks in Cybersecurity, Niall Adams and Nick Heard (Eds.). Imperial College Press, 37–65. https://doi.org/10.1142/9781786340757_0002Google Scholar
- Brian Kocoloski, Alefiya Hussain, Matthew Troglia, Calvin Ardi, Steven Cheng, Dave DeAngelis, Christopher Symonds, Michael Collins, Ryan Goodfellow, and Stephen Schwab. 2021. Case Studies in Experiment Design on a Minimega Based Network Emulation Testbed. In Cyber Security Experimentation and Test Workshop (Virtual, CA, USA) (CSET ’21). Association for Computing Machinery, New York, NY, USA, 83–90. https://doi.org/10.1145/3474718.3474730Google ScholarDigital Library
- Richard Lippmann, Joshua W Haines, David J Fried, Jonathan Korba, and Kumar Das. 2000. The 1999 DARPA off-line intrusion detection evaluation. Computer Networks 34, 4 (2000), 579–595. https://doi.org/10.1016/S1389-1286(00)00139-0Google ScholarDigital Library
- Microsoft. 2022. Playwright. https://playwright.devGoogle Scholar
- Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdfGoogle Scholar
- Markus Ring, Sarah Wunderlich, Deniz Scheuring, Dieter Landes, and Andreas Hotho. 2019. A survey of network-based intrusion detection data sets. Computers & Security 86(2019), 147–167. https://doi.org/10.1016/j.cose.2019.06.005Google ScholarDigital Library
- Sandvine. 2020. The Global Internet Phenomena Report COVID-19 Spotlight. (7 May 2020). https://www.sandvine.com/phenomenaGoogle Scholar
- Sandvine. 2022. 2022 Global Internet Phenomena Report. (20 Jan. 2022). https://www.sandvine.com/phenomenaGoogle Scholar
- Jihwang Yeo, David Kotz, and Tristan Henderson. 2006. CRAWDAD: A Community Resource for Archiving Wireless Data at Dartmouth. SIGCOMM Comput. Commun. Rev. 36, 2 (April 2006), 21–22. https://doi.org/10.1145/1129582.1129588Google ScholarDigital Library
Index Terms
- The DARPA SEARCHLIGHT Dataset of Application Network Traffic
Recommendations
A human morning routine dataset
AAMAS '14: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systemsTo be able to evaluate and compare the quality of different approaches in research, general and publicly available datasets are needed. While in some areas, there exists a variety of such datasets that are constantly used by researchers, in the area of ...
A method to generate a ground truth distributed network traffic dataset
CCRIS '22: Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent SystemNetwork traffic dataset is the basis for studying the properties of network traffic and training network traffic classification model. In order to solve the problems of Insufficient authenticity, inaccuracy, limited traffic size and privacy security ...
Model of Network Traffic Based on Network Applications and Network Users
ISCSCT '08: Proceedings of the 2008 International Symposium on Computer Science and Computational Technology - Volume 02There was a close relationship among network traffic, network user and network application in the complex network environment. We use Gini coefficient in economics to describe elephant and mice phenomenon the network traffic. A new model of network ...
Comments