skip to main content
10.1145/3589335.3651513acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
short-paper
Open Access

GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method

Published:13 May 2024Publication History

ABSTRACT

As natural language models like ChatGPT become increasingly prevalent in applications and services, the need for robust and accurate methods to detect their output is of paramount importance. In this paper, we present GPT Reddit Dataset (GRiD), a novel Generative Pretrained Transformer (GPT)-generated text detection dataset designed to assess the performance of detection models in identifying generated responses from ChatGPT. The dataset consists of a diverse collection of context-prompt pairs based on Reddit, with human-generated and ChatGPT-generated responses. We provide an analysis of the dataset's characteristics, including linguistic diversity, context complexity, and response quality. To showcase the dataset's utility, we benchmark several detection methods on it, demonstrating their efficacy in distinguishing between human and ChatGPT-generated responses. This dataset serves as a resource for evaluating and advancing detection techniques in the context of ChatGPT and contributes to the ongoing efforts to ensure responsible and trustworthy AI-driven communication on the internet. Finally, we propose GpTen, a novel tensor-based GPT text detection method that is semi-supervised in nature since it only has access to human-generated text and performs on par with fully-supervised baselines.

Skip Supplemental Material Section

Supplemental Material

shp7122.mp4

Supplemental video

mp4

5.3 MB

References

  1. Tom B. Brown et al. 2020. Language Models are Few-Shot Learners. CoRR , Vol. abs/2005.14165 (2020). showeprint[arXiv]2005.14165Google ScholarGoogle Scholar
  2. Evan Crothers, Nathalie Japkowicz, and Herna Viktor. 2023. Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods. arxiv: 2210.07321 [cs.CL]Google ScholarGoogle Scholar
  3. Gisel Bastidas Guacho, Sara Abdali, Neil Shah, and Evangelos E. Papalexakis. 2018. Semi-supervised Content-based Detection of Misinformation via Tensor Embeddings. arxiv: 1804.09088 [cs.LG]Google ScholarGoogle Scholar
  4. Zied Haj-Yahia, Adrien Sieg, and Léa A Deleris. 2019. Towards unsupervised text classification leveraging experts and word embeddings. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics. 371--379.Google ScholarGoogle ScholarCross RefCross Ref
  5. Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, Vol. 28, 1 (1972), 11--21.Google ScholarGoogle ScholarCross RefCross Ref
  6. Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S. Yu, and Lifang He. 2020. A Survey on Text Classification: From Shallow to Deep Learning. CoRR , Vol. abs/2008.00364 (2020). showeprint[arXiv]2008.00364Google ScholarGoogle Scholar
  7. Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, and Thomas Wolf. 2019. Transfer learning in natural language processing. In NAACL: Tutorials. 15--18.Google ScholarGoogle Scholar
  8. Yue Zhao, Zain Nasrullah, and Zheng Li. 2019b. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research, Vol. 20, 96 (2019), 1--7. http://jmlr.org/papers/v20/19-011.htmlGoogle ScholarGoogle Scholar
  9. Zhenjie Zhao, Andrew Cattle, Evangelos Papalexakis, and Xiaojuan Ma. 2019a. Embedding lexical features via tensor decomposition for small sample humor recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '24: Companion Proceedings of the ACM on Web Conference 2024
          May 2024
          1928 pages
          ISBN:9798400701726
          DOI:10.1145/3589335

          Copyright © 2024 Owner/Author

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 May 2024

          Check for updates

          Qualifiers

          • short-paper

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%
        • Article Metrics

          • Downloads (Last 12 months)50
          • Downloads (Last 6 weeks)50

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader