Text vectorization via transformer-based language models and n-gram perplexities

Škorić, Mihailo

Computer Science > Computation and Language

arXiv:2307.09255 (cs)

[Submitted on 18 Jul 2023]

Title:Text vectorization via transformer-based language models and n-gram perplexities

Authors:Mihailo Škorić

View PDF

Abstract:As the probability (and thus perplexity) of a text is calculated based on the product of the probabilities of individual tokens, it may happen that one unlikely token significantly reduces the probability (i.e., increase the perplexity) of some otherwise highly probable input, while potentially representing a simple typographical error. Also, given that perplexity is a scalar value that refers to the entire input, information about the probability distribution within it is lost in the calculation (a relatively good text that has one unlikely token and another text in which each token is equally likely they can have the same perplexity value), especially for longer texts. As an alternative to scalar perplexity this research proposes a simple algorithm used to calculate vector values based on n-gram perplexities within the input. Such representations consider the previously mentioned aspects, and instead of a unique value, the relative perplexity of each text token is calculated, and these values are combined into a single vector representing the input.

Comments:	10 pages, 6 figures
Subjects:	Computation and Language (cs.CL)
MSC classes:	68T50
Cite as:	arXiv:2307.09255 [cs.CL]
	(or arXiv:2307.09255v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2307.09255

Submission history

From: Mihailo Škorić [view email]
[v1] Tue, 18 Jul 2023 13:38:39 UTC (333 KB)

Computer Science > Computation and Language

Title:Text vectorization via transformer-based language models and n-gram perplexities

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Text vectorization via transformer-based language models and n-gram perplexities

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators