short-paper

Open Access

GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method

Authors:
Zubair Qazi

University of California Riverside, Riverside, CA, USA

University of California Riverside, Riverside, CA, USA

0009-0005-2009-8388
View Profile

,
William Shiao

University of California, Riverside, Riverside, CA, USA

University of California, Riverside, Riverside, CA, USA

0000-0001-5813-2266
View Profile

,
Evangelos E. Papalexakis

University of California Riverside, Riverside, CA, USA

University of California Riverside, Riverside, CA, USA

0000-0002-3411-8483
View Profile

WWW '24: Companion Proceedings of the ACM on Web Conference 2024May 2024Pages 842–846https://doi.org/10.1145/3589335.3651513

Published:13 May 2024Publication History

WWW '24: Companion Proceedings of the ACM on Web Conference 2024

Pages 842–846

ABSTRACT

As natural language models like ChatGPT become increasingly prevalent in applications and services, the need for robust and accurate methods to detect their output is of paramount importance. In this paper, we present GPT Reddit Dataset (GRiD), a novel Generative Pretrained Transformer (GPT)-generated text detection dataset designed to assess the performance of detection models in identifying generated responses from ChatGPT. The dataset consists of a diverse collection of context-prompt pairs based on Reddit, with human-generated and ChatGPT-generated responses. We provide an analysis of the dataset's characteristics, including linguistic diversity, context complexity, and response quality. To showcase the dataset's utility, we benchmark several detection methods on it, demonstrating their efficacy in distinguishing between human and ChatGPT-generated responses. This dataset serves as a resource for evaluating and advancing detection techniques in the context of ChatGPT and contributes to the ongoing efforts to ensure responsible and trustworthy AI-driven communication on the internet. Finally, we propose GpTen, a novel tensor-based GPT text detection method that is semi-supervised in nature since it only has access to human-generated text and performs on par with fully-supervised baselines.

Supplemental Material

shp7122.mp4

Supplemental video

mp4

5.3 MB

Download

References

Tom B. Brown et al. 2020. Language Models are Few-Shot Learners. CoRR , Vol. abs/2005.14165 (2020). showeprint[arXiv]2005.14165Google Scholar
Evan Crothers, Nathalie Japkowicz, and Herna Viktor. 2023. Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods. arxiv: 2210.07321 [cs.CL]Google Scholar
Gisel Bastidas Guacho, Sara Abdali, Neil Shah, and Evangelos E. Papalexakis. 2018. Semi-supervised Content-based Detection of Misinformation via Tensor Embeddings. arxiv: 1804.09088 [cs.LG]Google Scholar
Zied Haj-Yahia, Adrien Sieg, and Léa A Deleris. 2019. Towards unsupervised text classification leveraging experts and word embeddings. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics. 371--379.Google ScholarCross Ref
Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, Vol. 28, 1 (1972), 11--21.Google ScholarCross Ref
Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S. Yu, and Lifang He. 2020. A Survey on Text Classification: From Shallow to Deep Learning. CoRR , Vol. abs/2008.00364 (2020). showeprint[arXiv]2008.00364Google Scholar
Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, and Thomas Wolf. 2019. Transfer learning in natural language processing. In NAACL: Tutorials. 15--18.Google Scholar
Yue Zhao, Zain Nasrullah, and Zheng Li. 2019b. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research, Vol. 20, 96 (2019), 1--7. http://jmlr.org/papers/v20/19-011.htmlGoogle Scholar
Zhenjie Zhao, Andrew Cattle, Evangelos Papalexakis, and Xiaojuan Ma. 2019a. Embedding lexical features via tensor decomposition for small sample humor recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).Google ScholarCross Ref

Index Terms

GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method
1. Information systems

Recommendations

Tensor-based anomaly detection

Traditional spectral-based methods such as PCA are popular for anomaly detection in a variety of problems and domains. However, if data includes tensor (multiway) structure (e.g. space-time-measurements), some meaningful anomalies may remain invisible ...
Read More
Outlier/Anomaly Detection of Univariate Time Series: A Dataset Collection and Benchmark
Big Data Analytics and Knowledge Discovery
Abstract
In this paper, we present an extensive collection of outlier/anomaly detection tasks to identify unusual series from a given time series dataset. The presented work is based on the popular UCR time series classification archive. In addition to the ...
Read More
A unified benchmark for the unknown detection capability of deep neural networks
Abstract
Deep neural networks have achieved outstanding performance over various tasks, but they have a critical issue: over-confident predictions even for completely unknown samples. Many studies have been proposed to successfully filter out ...
Highlights
- An unknown detection task is proposed to evaluate the detection capability of NNs.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '24: Companion Proceedings of the ACM on Web Conference 2024
May 2024
1928 pages
ISBN:9798400701726
DOI:10.1145/3589335
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Proceedings Chair:
Roy Ka-Wei Lee
Singapore University of Technology and Design
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University
Copyright © 2024 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2024
Check for updates
Author Tags
benchmark dataset
gpt-text detection
out-of-distribution detection
semi-supervised
tensor decomposition
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 50
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)50
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method

WWW '24: Companion Proceedings of the ACM on Web Conference 2024

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Tensor-based anomaly detection

Outlier/Anomaly Detection of Univariate Time Series: A Dataset Collection and Benchmark

A unified benchmark for the unknown detection capability of deep neural networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method

WWW '24: Companion Proceedings of the ACM on Web Conference 2024

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Tensor-based anomaly detection

Outlier/Anomaly Detection of Univariate Time Series: A Dataset Collection and Benchmark

A unified benchmark for the unknown detection capability of deep neural networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media