An Information-Theoretic Approach for Detecting Edits in AI-Generated Text

We propose a method to determine whether a given article was written entirely by a generative language model or perhaps contains edits by a different author, possibly a human. Our process involves multiple tests for the origin of individual sentences or other pieces of text and combining these tests using a method that is sensitive to rare alternatives, i.e., non-null effects are few and scattered across the text in unknown locations. Interestingly, this method also identifies pieces of text suspected to contain edits. We demonstrate the effectiveness of the method in detecting edits through extensive evaluations using real data and provide an information-theoretic analysis of the factors affecting its success. In particular, we discuss optimality properties under a theoretical framework for text editing saying that sentences are generated mainly by the language model, except perhaps for a few sentences that might have originated via a different mechanism. Our analysis raises several interesting research questions at the intersection of information theory and data science.


Introduction 1.Background and Motivation
Suppose an article initially written by a generative language model (GLM) such as Chat-GPT (GPT3.5)undergoes relatively minor changes.For example, a human editor adds, removes, or rephrases certain sentences as in the example in Figure 1.This work aims to detect the presence of such edits and, as much as possible, identify edited parts if they exist.
1 Introduction The Welsh Corgi, often simply referred to as Corgi, is a charming and beloved breed of herding dog with roots deeply entwined in the landscapes and folklore of Wales.Renowned for their distinctive appearance, characterized by their short legs and elongated bodies, Welsh Corgis have captured the hearts of dog enthusiasts worldwide.Beyond their adorable looks Originally bred to herd cattle, sheep, and horses, Corgis are celebrated for their intelligence, agility, andunwavering loyalty eagerness to please, and adorable looks.History The history of the Corgi breed is deeply rooted in the pastoral landscapes of Wales, where they played a vital role as herding dogs.Corgi breeds are classified as Pembroke Welsh Corgi and Cardigan Welsh Corgi, both originating from a common ancestry.There are two distinct breeds of Corgis: the Pembroke Welsh Corgi and the Cardigan Welsh Corgi.... Figure 1: Left: The GLM ChatGPT is sequentially prompted to generate sections of a Wikipedia-style article titled Welsh Corgi.Right: The composition of the generated text with section titles leads to a so-called GLM-written article.The human editor alters the article in some places.We are interested in detecting the presence of edits if they exist, and their locations.
As can be deduced from Figure 1, we mean "written by a GLM" in a relatively broad sense.The pre-edited article may combine a series of GLM outputs produced in response to different human-written prompts or instructions.The situation above might arise when a human editor wishes to improve the GLM text or to hide the fact that the GLM was involved in the writing process altogether.
The popularity of GLM-assisted tools for writing raises interest in detecting text generated by a GLM for several reasons, e.g., to maintain trust and ethical standards in authored material or to study the limitations of GLM technology1 [LIZ + 24, WWANB + 23, CJV23,SW22].The scenario we address corresponds to a particularly challenging situation in this context because human edits, if present, are typically scattered across the article in locations that are unknown in advance.Furthermore, edits are associated with relatively short pieces of "text atoms" like sentences whereas the signal discriminating a GLM from a human in a short text piece may be faint or not exist [ZHR + 19, KSK + 23, JHN23,Els23].Metaphorically, we seek to detect a few needles in a haystack, uncertain how many needles there are, if any.This challenge points in the direction of rare (sparse) and weak signal detection problems in statistics [DJ04, CW14, MPL15, DJ15, Ke16, ACY19, ACY19].In this work, we propose an approach to detect human edits of mostly GLM text that is based on adapting some of these well-understood tools.
A mirror image of our detection problem is detecting machine text hidden within mostly human text.One interesting motivation for this problem is avoiding undesirable effects of training language models using machine-generated text [OMC + 23, ACRL + 23, BSR23,DFK24].We believe that many insights from the analysis in this paper are also relevant to this problem.

Existing Approaches
Existing approaches for discriminating AI text from human text usually focus on detecting relatively large portions of text [ZHR + 19, KSK + 23, CJV23, CBZ + 23], whereas we focus on detecting effects that might emerge at sentence level.Due to the apparent rarity of the signal underlying the problem and well-understood limitations of binary detection in the presence of rare and weak features [Jin09], it seems that any approach that does not capitalize on the signal's sparsity structure is generally ineffective.For example, machine learning approaches as described in [KAAL23] and [CBZ + 23] may be effective if they can learn some lower-dimensional representation of the data.However, such representation does not exist for rare and weak signals with unknown sparsity [Jin09].An additional disadvantage of these approaches is their typical lack of transparency, i.e., limited ability to reason a method's outcome and thus limited ability of a user to take actions based on the outcome.
The problem of detecting edits is also related to the "style change" detection problem in mixed authorship studies [BCF + 22, Juo08, Sta09, NSF + 17, ZJ23].This problem is considered to be notoriously challenging due to the weakness of the authorship signal when mixed text pieces are short [KTS + 18, BCF + 22].The problem we consider includes the additional difficulty that changes in authorship across an article, if they exist, are few.

Our Approach
Our approach involves two main steps: (i) Testing the authorship of every sentence individually with respect to the candidate GLM.
(ii) Combining the multiple tests to a global test of significance against the null hypothesis of purely GLM text using a method that is sensitive to sparse alternatives.
Specifically, we implement Step (i) using the log-perplexity statistic (a.k.a.negative log-likelihood, logloss, crossentropy loss) under a pre-trained large LM that can provide token probabilities.

Contributions and Paper Organization
We describe the method in Section 2. In Section 3, we demonstrate its effectiveness in some realistic scenarios encompassing several text domains.In Section 4, we analyze the method components using tools from information theory, discussing their optimality under a certain theoretical framework that exposes the interesting properties of the problem.Our results promote several open challenges that we discuss in Section 5.

Method Description
In this description and throughout the paper, it is useful to distinguish between two types of language models based on the output they provide.
• (predictive) Language Model (LM): provides a probability distribution over a dictionary of tokens conditioned on an input context.We typically denote such a 6 model by P.
• Generative Language Model (GLM): produces sequences of tokens in response to an input context.We typically denote such a model by G 0 .
Any GLM can be conceptually treated as a LM since it induces a distribution over tokens; we distinguish the two types because in many situations the next token probabilities of the GLM are inaccessible to the user, e.g. the ChatGPT interface [Ope22] does not provide such probabilities as opposed to the open source LMs Llama [TLI + 23] and Falcon [AAA + 23].

Testing individual sentences
We think about a sentence S as a sequence t 1:|S| = (t 1 , . . ., t |S| ) of tokens from a finite dictionary.Given a LM P, the (normalized) log-perplexity (LPPT) of S with respect to P is (in (1) and throughout we use the notation t 1:1 = ∅.).Other names for the LPPT statistic are the negative log-likelihood under P, the logarithmic loss with respect to P [BN06], the self-information of P evaluated at S [MF98], and crossentropy with a mass distribution at the sequence S. We prefer the name LPPT because of the association of perplexity with natural language processing [JM23].Throughout this paper, we use 2 as the basis of the logarithm, hence lppt(S; P) is measured in bits per token.An empirical result illustrated in Figure 3 says that under a specific P, sentences written by a particular GLM tend to have lower values of lppt(S; P) than sentences written by humans in a similar context.This observation justifies using the LPPT statistic to test the authorship of individual sentences, i.e. testing against the null H 0 (S) : "sentence S was written by the GLM G 0 ." (2) The right-hand side of Figure 3 illustrates the receiver operating characteristic (ROC) curves of the LPPT test against the null (2) under different datasets.Given a document partitioned into sentences D = (S 1 , . . ., S n ), we summarize the evidence against H 0 (S i ) using a P-value: The evaluation of p i requires the distribution of lppt(S; P) for S ∼ G 0 , represented by the red histograms in Figure 3.As it turns out, this distribution is affected by several factors including the text's domain and the length of every sentence as we discuss in Section 2.4.We adjust this distribution for sentence length in all P-value evaluations in this paper.The table in Figure 2 shows examples of LPPT and the corresponding length-adjusted P-values of several sentences from the example in Figure 1.s We note that the non-trivial power of the classifier based on LPPT values observed in each case in Figure 3 suggests that, for long enough documents, it is possible to reliably separate between the class of documents written entirely by the GLM and the class of non-GLM documents.Indeed, fix one of the dataset cases in Figure 3 and consider a simple model in which a document is generated by independently sampling sentences from one of the distributions represented by the histograms of this case.Use the likelihood ratio test of the LPPTs of individual sentences to classify a document to GLM or non-GLM.The error probability of this test can be made to vanish rapidly as the number of sentences increases (e.g., this follows from the Chernoff-Stein Lemma [CT06].).This note emphasizes that the problem this paper considers -separating the class of documents written by the GLM from the class of documents that contain mostly GLM-written text with some non-GLM edits -is much more challenging.

Global Testing using Higher Criticism (HC)
We combine the per-sentence P-values p 1 , . . ., p n , n = |D|, of (3) to a single value using HC [DJ04, DJ08, DJ15]: Here p (1) ≤ . . .≤ p (n) are the order statistics of the P-values and γ 0 ∈ (0, 1) is a fixed parameter that limits the range of P-values involved.Typically, γ 0 does not affect the large sample behavior of HC * and is taken to be in the range (0.1, 0.5) [DJ15]; we use γ 0 = 0.25 in this paper as this choice appears to provide good results.HC is known to be sensitive to departures in a small and unknown set of individual tests, thus it is useful as an index of discrepancy between the two classes, indicating that the document was edited for large values of HC * .This property leads to a binary classifier whose threshold (thr in Figure 2) can be calibrated, e.g., by a held-out dataset.We can also use HC as level α-test against the global null The document was written entirely by G 0 ", by setting thr = HC * 1−α , where HC * 1−α is the 1−α quantile of HC * under H 0 for α ∈ (0, 1), e.g.α = 0.05.We may estimate HC * 1−α from the data using documents from the null class, i.e., written entirely by G 0 , if these are sufficiently available.Otherwise, we may simulate critical values under H 0 provided some conditions are met.Specifically, when the P-values are independent and uniformly distributed under H 0 , the asymptotic distribution of HC * under H 0 as n → ∞ follows that of a maximum Brownian bridge, although it may be significantly stochastically smaller in finite samples [DJ04].For this reason, it is common to simulate critical values for a test based on HC * for specific sample sizes as illustrated in Figure 4.In practice, LPPT values of sentences are likely to be dependent since the sentences are.The critical values of HC * are known to be relatively unaffected when the P-values experience a form of short-term dependency [DH09].Under other types of dependency, the test may experience a reduction in power [HJ08,HJ10].For this reason, if possible, we recommend estimating HC * 1−α based on complete documents from the null class to improve the test's power.

Identifying edited sentences
When the HC test rejects H 0 , the set corresponds to P-values affecting HC * most and thus providing the strongest evidence against H 0 .This set is known to have interesting optimality properties in the context of feature selection for binary classification [DJ09,DJ08].We use this set to indicate sentences that we suspect are not written by the GLM; we may want to examine the authorship of these sentences manually or using other means that do not necessarily rely on the LPPT statistic.We summarize the full procedure in Algorithm 1, and illustrate it using an example text in Figure 2. Table 1 shows sentences included in I * from the article generated in the example in Figure 1.
Algorithm 1 Test whether a document D was written by the language model G 0 or not Input: language model P; document D = (S 1 , . . ., S n ); survival function FG 0 ;P of the LPPT of sentences from G 0 under P; threshold thr (e.g, thr = HC * 1−α ) # Step I: Testing individual sentences:

Adjusting the log-perplexity distribution for sentence's length
Tokens appearing later in the sentence tend to be more reliably predicted than tokens at the beginning, a phenomenon observed in [Sha51].It follows that the average logperplexity tends to be smaller for longer sentences, as illustrated in Figure 5.We can therefore attain better sensitivity of the perplexity test by comparing the LPPT of the i-th sentence S i to the distribution of LPPT of sentences produced by G 0 with the same length as S i .Formally, this means replacing the test (3) with pi := Pr and thus the survival function FP,G 0 in Algorithm 1 receives two parameters: the LPPT of S i and its number of tokens |S i |.In practice, we estimate FP,G 0 for every possible number of tokens.When the number of data points for calibration is somewhat scarce, a curve-fitting estimate is useful since FP,G 0 appears to vary smoothly with the number of tokens.
Another factor affecting the perplexity is the sentence's location within the document.For example, the first sentence in every paragraph appears to have higher perplexity than subsequent sentences.We leave the adjustment of our method to this factor as future work.

Unusually short and long sentences
Our experience shows that the perplexity detector is ineffective for sentences of about 10 tokens or fewer.We excluded such sentences from the process and did not evaluate their P-values.
We found it difficult to estimate the LPPT distribution of sentences of more than 50 tokens since they are very infrequent in our data.We only consider the first 50 tokens when evaluating the LPPT of such sentences.

Generalizing Step I: Testing pieces of text individually
Our method uses sentences as text atoms and considers their LPPT.Natural generalizations of this step include the considerations of other pieces of text like paragraphs, as well as detectors that are not necessarily based on the perplexity, e.g., probability curvature [MLK + 23], word-frequencies [MW12], and other feature [MSS23].

Generalizing
Step II: Inference based on multiple testing HC is just one approach for testing the global significance of individual tests, motivated by the rare editing model over sentences and the sensitivity of HC to rare effects.Under deviations from this model or due to other considerations, other methods from multiple comparisons and meta-analyses in statistics may be preferable [RJ + 12, Ben10, Efr12].For example, instead of HC, we may combine P-values using Fisher's method F n is known to be effective in detecting many relatively frequent but potentially very faint effects [ACCP11,Kip24b].Therefore, F n can be used when we test H 0 of (5) against an alternative specifying that the GLM text has gone through substantial editing.
Another alternative to inference based on HC is useful when we are interested in selecting a set of suspected edits with some control over the proportion of falsely reported edits.In this case, the Benjamini-Hochberg (BH) false discovery rate (FDR) controlling procedure to the P-values in (3) may be useful [BH95].We note that the BH procedure is in general less powerful for global testing than HC.Namely, it is possible that while HC correctly finds the body of P-values significant, the BH procedure with an FDR parameter α may report on an empty set of P-values with probability at least 1 − α, for every α ∈ (0, 1) [Jin03,Kip24b].

Empirical Results
We conducted extensive simulations using publicly available datasets and new datasets that we created.The new datasets and the code for obtaining all the results are available in the link at the end of this paper.
We tried several publicly available LMs for the detection model P in Algorithm 1, including GPT2 1.5B parameters [RWC + 19], Falcon 7B parameters [AAA + 23], Llama 7B parameters [TLI + 23], and Phi2 2.7B parameters [JBA + 23].We only report on results with GPT2 1.5B parameters (aka.GPT2xl) and Phi2 since these models attained the highest area under the ROC curve in the binary detection problem of individual sentences for all datasets we considered.We discuss in Section 5 the open challenge of selecting or crafting P with optimal detection properties.
We experimented with data created by the GLMs GPT3-curie and GPT3.5-turbo (ChatGPT) arranged in 5 datasets as we explain in detail below.We tried to generate data using publicly available GLMs not in the GPT family, but they did not produce articles of satisfactory quality.
In the sections below we report the method's performance under different settings and data.

Power analysis using mixed machine and human sentences
We first demonstrate the method's effectiveness using a synthetic dataset of articles of mixed authorship involving GLM and non-GLM text.We generated each article by sampling sentences from the non-GLM article and inserting those into the GLM article at random locations.Since the GLM and non-GLM articles are on the same topic, the mixed article is typically coherent in content hence the situation simulates well a GLM text edited in a few locations.As raw data for mixing, we use the three datasets listed below, in which every entry has two articles under the same title, one written entirely by a GLM and one written by a human or several humans.
• Wikipedia Introductions [Aad23].Each entry corresponds to a Wikipedia article.The dataset contains the several first sentences of the Introduction of this article as non-GLM text and text generated by GPT3-curie in response to a relevant prompt as a GLM-written article.We excluded entries from this dataset in which the length of the GLM text was less than 15 sentences, resulting in a total of 9, 821 remaining entries.
• News Articles [Ana23].Each entry contains a news article, its highlights as provided by a human annotator, and an article generated by GPT3.5-turbo from these highlights.The dataset has 20, 000 entries.
• Scientific Abstracts [Nic23].Each entry contains the abstract of a scientific research paper and text produced by GPT3.5-turbo in response to a prompt requesting a paragraph of text with similar properties.The dataset has 10, 000 entries.
We divided all entries within each dataset into roughly 10 equally sized groups.For a given edit ratio and article length, we report on results averaged over 10 iterations in a cross-validation fashion: In Iteration i, we simulated edits only in Group i, leaving the other groups unaffected and using them to characterize the null LPPT distribution G 0 and estimate HC * 1−0.05 .We arranged entries in either group randomly and consolidated them into articles according to the prescribed number of sentences, truncating excess sentences.Each article in Group i is then modified by inserting sentences randomly sampled from the corresponding non-GLM articles according to the mixing ratio ϵ.We set thr i , our estimate of HC * 1−0.05 , based on the remaining articles: thr i is the 0.95-th quantile of HC * value of articles not in Group i.We estimate the power by the fraction of articles in Group i exceeding thr i .We repeat this process for i = 1, ..., 10 and report on the average across all groups as the power estimate.
The resulting power estimates are shown in Figure 6.It appears that our method has non-trivial power for an editing rate as low as 10% of the sentences, for articles as short as 50 sentences.The power generally increases with the editing rate and the length of the text and varies between datasets and the two detection models.Figure 6: Estimated power at level α = 0.05 in detecting simulated edits versus the number of sentences in an article and the fraction of edited sentences.We compared two detection models (GPT2 and Phi2) across 3 datasets.The estimated power is the true positive rate averaged over 10 splits; the standard error in all cases is less than 0.01.The power generally increases with the fraction of non-GLM sentences and article length.

Realistic edits of Wikipedia-style articles by topic
We created a dataset of Wikipedia-style articles using GPT3.5-turbo by repeatedly prompting this GLM to write article sections2 , similarly to Figure 1.We used titles of articles from Wikipedia falling into 5 topics, roughly 200 articles per topic.Article titles were randomly selected within the topic, provided they satisfy our inclusion criteria: at least 5 sections within the article.Section titles in the prompts for generating each article are taken from the corresponding Wikipedia article.The number of sentences in each article is between 50-300 with an average of about 180.
We simulated realistic edits in this dataset by randomly sampling sentences from the actual Wikipedia article and inserting each under the same section name.Therefore, the resulting article has a realistic structure.Our experience shows that it is very challenging for a human to reliably determine which sentences, if any, were not written by the GLM.
Table 7 shows the accuracy of our method in this evaluation under two different base models.Each topic's accuracy is evaluated over a randomly chosen test set containing %20 of the articles and their simulated edited versions.We use the other %80 as a training set to evaluate the per-length survival function FG 0 ;P and to calibrate the threshold of HC * that maximizes the accuracy.Standard analysis of the variance in the data shows that the performance significantly depends on the topic (in all 2 × 3 detection model and edit ratio combinations, the F-test's P-value is smaller than 10 −7 ).
Figure 7: Accuracy in detecting the presence of human text within GPT3.5-turbo articles, by topic.We compare the accuracy across 5 topics, 3 edits ratios, and 2 detection models (GPT2 and Phi2).

Comparing to other approaches
We compared the performance of our method to a classifier that only considers the minimal P-value p (1) in (3) as in Bonferroni-type inference.Additionally, we considered a classification approach that may seen as standard in this challenge [KAAL23]: embedding articles using an LLM and using the embedded representation as features for a trained classifier such as in logistic regression.We also tried several trained classification methods when using word frequencies as features, but these methods were completely ineffective.We used the dataset described in Section 3.2, with accuracy averaged over 10 splits in a cross-validation fashion: each split is used as a test set and the rest as a train set.Note that the training procedure for HC and minimal P-value (min P) involves the characterization of the null LPPT distribution G 0 and estimating HC * 1−0.05 .We report on the results of such comparison in Figure 8.This figure shows that our HC-based approach attains the best accuracy in all configurations, significantly surpassing the trained classifier.
Figure 8: Accuracy in detecting the presence of actual Wikipedia sentences planted within Wikipedia-style GPT3.5-turbo articles.Our method (HC) with detection model Phi2, inference based on the minimal P-value (min P), and a logistic regression classifier with features obtained via document embedding using OpenAI's text-embedding-3-small.We used 10 train/test splits and reported the accuracy averaged over all splits.The standard error in all configurations is smaller than 0.02.Table 2: Detecting a few human edits in Wikipedia-style biographies written by ChatGPT with a perplexity detector based on GPT2.The table shows HC of (4) and the P-value of the HC test based on simulated values as in Figure 4. Also shown are the fraction of edited sentences in every document and its length n in sentences.Values significant at level α = 0.05 are in color: blue for a true positive and red for a false positive.In all cases, Bonferroni's correction p (1) × n was insignificant, mainly because the number of sentences in the empirical P-value calculations of the LPPT test (7) is small.

Manually edited articles
In Table 2 we report on the results of applying our method to 18 articles that were created via the following process: we used the GLM GPT3.5-turbo3 to write a Wikipedia-style biography of one of the authors in the list according to a prescribed structure involving 4 sections: Early Life, Adulthood, Contributions and Achievements, and Legacy; the GLM wrote each section in response to a separate prompt, similarly to the process illustrated in Figure 1.We used the text written by the GLM and the prescribed subtitles to form a coherent article which we denote as the pre-edited GLM article.Next, we asked a human editor to modify this article by adding, rephrasing, or entire sentences.We applied our method to both the edited and non-edited articles.We used sentences from additional articles created similarly to characterize FG 0 ,P and evaluate the P-values in (7).
Table 2 shows that all edited articles have larger HC values than their non-edited version.We also report on the P-values associated with these HC values under the null of uniformly distributed P-values obtained via simulations.We also evaluated (num. of sentences) × min p i in each article, which is associated with the significance of the Bonferroni correction applied to the body of sentence P-values.These values turned out to be above 0.05 in all articles.This peculiar behavior of the Bonferroni correction is because the minimal LPPT P-values in our process are approximately 1/(1 + n ℓ ), where n ℓ is the number of GLM sentences in our calibration data of length ℓ.However, in most lengths, n ℓ is relatively small at the order of 10 2 .In a future analysis, we may make Bonferroni's correction more useful by increasing our calibration set or by fitting a curve to the tail of the survival function in Figure 5 and verifying the goodness of this fit.We emphasize that the discrimination reflected in Table 2 is without calibrating the threshold of HC * for separating the class of edited from non-edited articles.Such calibration is expected to increase the accuracy of the procedure.

Information-Theoretic Analysis
We now analyze our method under a theoretical framework of text editing and discuss some factors affecting its success.

Optimality of the Higher Criticism test
A simple mixture model for the generation of an edited document proposes that most sentences are written independently by a GLM G 0 , except perhaps a few sentences that are generated by a different mechanism associated with the editor that we denote here by G 1 .Importantly, we do not know in advance which sentences were written by each model.Let ϵ denote the expected fraction of G 1 sentences, and let L j be the distribution of lppt(S; P) under S ∼ G j , for j ∈ {0, 1}.The setting described above induces a mixture model for the log-perplexity Likewise, we have a mixture model for the P-values in (3): where here Q i is a sub-uniform distribution that describes the non-null behavior of the P-values (3).The optimality of HC for mixture models of the form (9) and ( 10) have been studied in several contexts [DJ04, Jin03, HJ08, CW14, ACW15, MPL15, JK16, DK22, Kip24b].In particular, when the mixture parameter is calibrated to n as ϵ = n −β , for some β ∈ (1/2, 1), and the effect size in Q i is moderately large, a test based on HC of p 1 , . . ., p n attains the information-theoretic limit of detection in (10) when n → ∞.
Namely, in a configuration of the calibrating parameters in which there exists a test of asymptotically non-trivial power, there exists a test based on HC that is asymptotically powerful in the sense that its power tends to one while its size tends to zero.The works of [DH09, HJ08, HJ10] extended the optimality properties of HC to some situations of dependent individual effects, unlike the model (10).One relevant conclusion from these works is that HC is relatively unaffected when the P-values experience a form of short-term dependency as expected among sentences.

Optimality properties of the perplexity test
The justification for using the LPPT test of (3) is primarily its empirical success in separating GLM from non-GLM sentences shown in Figure 3.In what follows, we analyze the power of this test beyond this empirical observation.

Language model as an information source and the asymptotic perplexity
Let P a be a language model.Sampling a sentence t 1:n = (t 1 , . . ., t n ) form P a is achieved by conditioning current token probability by previous tokens and an initial context.Namely, for some initial state t 0 that can represent some initial context like the text's topic or a null value.We view P a as an information source in the sense that it defines a stationary distribution over sequences of tokens from a finite alphabet [Sha48].When P a is ergodic, the Shannon-McMillan-Brieman theorem says that the entropy rate H(P) is well-defined by the limit which is independent of t 0 [AC88].In the absence of ergodicity, the limit (12) may still exist but it generally depends on the initial state [KSS77].We note that Shannon's source coding theory proposes an alternative definition of the entropy rate: the minimum number of expected bits per token needed to represent t More generally, suppose that we evaluate the LPPT with respect to another stationary probability law P b defined over the same alphabet as P a .Under some conditions on the laws (P a , P b ), the limit of lppt(t 1:n ; P) as n → ∞ exists almost surely and obeys The term H(P b ; P a ) is denoted as the crossentropy rate of P b under the law P a .Relation (13) appears to provide an interesting insight about the LM P * that maximizes the power of the LPPT for testing H 0,S of (2) versus a simple alternative for some information source G 1 that represents the effect of editing the sentence S. Suppose that G 0 and G 1 are fixed, i.e. determined by the problem's nature.We seek a base model P for the perplexity detector that maximizes the power of the perplexity test in (3).The companion article [Kip24a] shows that under some reasonable assumptions, the ideal base model P maximizes where D indicates the relative entropy rate of information sources [Gra11].We discuss possible implications of (15) in Section 5.2 below.
5 Avenues for Future Research

Incorporating context
Typically, a sentence written by a GLM depends on the previous sentence or another context affecting the GLM's state.The effect of the context on the perplexity may be significant due to a potential lack of ergodicity in actual writings or slow convergence of the LPPT to its limiting value if such convergence occurs.For this reason, it appears that incorporating a context in the LPPT evaluations may increase the power of the perplexity detector over individual sentences.Denote the LPPT of a sentence S = (t 1 , . . ., t |S| ) and context C as The context C is usually a sequence of tokens, e.g., the sentence preceding S, although it may also take other forms such as the activations of the attention mechanism in transformers-based language models [JM23, Chapter 11].If the policy determining C is also stationary (e.g., the preceding sentence policy), we can extend much of the analysis in Section 4 to use (16) instead of (1).

Maximizing the power of the perplexity detector
Our analysis in Section 4 shows that the power of the log-perplexity detector is proportional to the difference in relative entropy ∆(G 1 , G 0 ; P) of (15).The information projection principle [CM03,CT06] implies an interesting direction in searching for P that maximizes this difference.Informally, suppose that we search for a maximizer P * within a convex set P of available LMs.We assume that Namely, the candidate GLM is closer in relative entropy to the alternative model than any of the LMs in our search space P.This situation is justified, e.g., because G 0 is optimized to mimic the behavior of G 1 which represents human writing.This optimization is achieved primarily via log-perplexity minimization which asymptotically translates to relative entropy minimization [JM23].Now, by the information projection principle The last inequality is attained with equality when P = G 0 , implying that this choice of P is the worst choice over models with the property (17).Specifically, a better choice of P should also consider the relative entropy to the alternative model G 1 .The characterization of such an alternative model in applications appears to be challenging, although the relative entropy can be approximately evaluated using standard methods, e.g.via the excessive binary code length in lossless compression [COR98,GL03].

Assessing the minimal number of edits for detection
The connection between the problems of detecting edits and rare signals this paper promotes suggests the possibility of estimating the minimal number of edits one must make so that detecting their global presence is possible.
A great deal of literature discussed the tradeoff between the rarity and the magnitude of individual non-null effects in sparse signal detection [Ing93, Jin03, DJ04, DH09, CW14, MPL15, DK22, Kip24b].In particular, suppose that we have n asymptotically normal and independent tests, in which non-null effects are on the moderate deviation scale.On the P-value scale, this can be written as where µ n = 2r log(n) for some signal intensity parameter r > 0, σ > 0, and D ≈ is a form of asymptotic equivalence in distribution that is described in [Kip24b].Additionally, assume that the proportion of non-null effects is ϵ = n −β , β ∈ (0, 1).The asymptotic power of detecting the global significance of the body of P-values experiences a phase transition as n → ∞, described by the curve Namely, if r < ρ(β; σ) any test distinguishing the null hypothesis of uniformly distributed P-values from a situation that ≈ n 1−β of them obey (18) has asymptotically trivial power.If r > ρ(β; σ), some tests, such as HC of (4), are asymptotically fully powered in the sense that there exists a sequence of thresholds under which the sum of Type-I and Type-II errors goes to zero.Consequently, (19) describes the way the number of non-null effects nϵ = n 1−β must scale with n to guarantee a non-trivial power of HC or any other global testing procedure.For example, if σ = 1 and r ≤ 1/4, detection is possible when ϵ = Ω n 1/2−r .This description provides an estimate to the number of edits necessary for reliably detecting the presence of any edits, provided one can estimate the signal strength parameters r and σ associated with two GLMs from the data.A standing challenge in this context is analyzing the usefulness of this estimate, e.g. by establishing the relevance of the model (10) with perturbations of the form (18) in practice and real data evaluations.

Figure 3 :
Figure 3: Discriminating GLM from non-GLM sentences using the log-perplexity (LPPT) statistic (1).Left: histogram by class of LPPT of sentences from the dataset News Articles [Ana23] (top) and Wikipedia Introductions [Aad23] (bottom).Right: the receiver operating characteristic (ROC) of a test based on the LPPT.The area under the ROC curve (AUC) is indicated.In both cases, LPPT is under the language model GPT2 (1.5B).

Figure 4 :
Figure 4: Simulated critical values for a test of significance level α based on Higher Criticism of n independent P-values.The number of samples in each configuration is 10, 000.Bootstrapped 0.95 confidence intervals are indicated.
Figure 5: Adjusting the perplexity test for the number of tokens in a sentence.Left: averaged log-perplexity versus sentence length.The shaded area indicates 2 standard errors.Right: fitted log-perplexity survival functions of GPT2 for several lengths.Based on 20,000 samples from the dataset Wikipedia Introductions [Aad23].
lim n→∞ lppt(t 1:n ; P b ) = H(P b ; P a ) = H(P a ) + D(P a ||P b ), (13) where D(P a ||P b ) is the relative entropy rate of P b to P a [Gra11, Ch. 7].

Table 1 :
lppt(S i ; P) p i ← FG 0 ;P (l i ) # Step II: Global testing using HC: if HC * (p 1 , . .., p n ) > thr, then reject H 0 Sentences from the article Welsh Corgi of Figures1 and 2 # Step II': Report suspected edits: return {S i , : p i ≤ p i * } as suspected edits else do not reject H 0 sentences in I *