Analysing the digital transformation of the market for fake documents using a computational linguistic approach

The market for fake documents on the internet is a topic that has not yet been explored in depth, despite its importance in facilitating many crimes. This research explored the market of fake documents on the White House Market anonymous market with a computational linguistic methodology; more specifically using textometry. The textual corpus is composed of the data of the ads titles as well as the profiles of the sellers, which were analysed as traces of their online activities. We investigated how these remnants can help to answer general questions. What kinds of fake documents are sold? Can we distinguish types of sellers based on their selling activities or profiles? Can we link distinct vendors based on language trace similarities? The free software IRaMuTeQ was used to carry out the analysis. The results showed that the textometric methods have real potential in classification, highlighting the different products on the market, and grouping the sellers according to their offers.


Introduction
Identity documents are required for many everyday activities such as subscribing to a telephone service, taking out a loan from a bank, crossing borders, or buying alcohol. In addition to granting rights to their rightful holder, they can confer trust, authority, benefits, and responsibilities. This makes them highly attractive assets for individuals deprived of such benefits. Document fraud is thus a convenient solution, sometimes the only one, to get past identity checks and controls and access to the places or services sought [1]. But identity documents are not the only documents used for obtaining benefits and are thus not the only ones affected by forgeries. They are a particular type of 'secure document' such as travel documents, banknotes, or diplomas, which can be defined as a document giving legal or commercial function and value to the holder and having the property of allowing the confirmation of its veracity, validity, and authenticity as a genuine document [2]. This makes fake secure documents a hot product for the illicit market. This paper will refer to 'fake documents' as forgeries of identity document and secure documents.
The market for fake documents has found its way to extend to online marketplaces. The marketplace chosen for this research is the 'White House Market' (WHM) cryptomarket, still active at the moment of the study from January to June 2021. It was then one of the most active cryptomarkets on the Dark Web, 1 until its closure in October 2021, with nearly five hundred thousand users and almost three thousand sellers. We focus on the textual data present in HTML traces collected on the WHM cryptomarket. These traces can be apprehended from several forensic perspectives. The one we focus on is called by Renaut and colleagues (2017) the "language trace". It is the remnant of an action [3,4] which is the writing of an illegal or litigious text by an author with an informative potential on its source, but also on the illicit activity itself. Language trace may result from illicit acts that can be committed through language, such as threats, defamation, or even an apology for terrorism [5]. In this study, we investigate language traces resulting from the publication of illegal ads posted by vendors to reconstruct their activities and get insight into the online market for fake documents. We investigate how these remnants can help to answer general questions: (1) "What kinds of fake documents are sold?", (2) "Can we distinguish types of sellers based on their selling activities or profiles?", (3) "Can we link distinct vendors based on language traces similarities?".
The analysis of 'words as traces' in the forensic context raises many questions about the objectivity, reliability, and reproducibility of the methods used to analyse language traces. Since traces are more often than not considered as silent witnesses, considering words as traces is not obvious. From a methodological point of view, this research is thus based on computational linguistics, which, integrated with forensic science, is commonly called "forensic linguistics". Seen as a particular field of applied linguistics, it is defined as "a branch of linguistics which applies in the field of justice techniques from linguistics and phonetics for the analysis of evidence in court" ([5]; p. 426 free translation). However, such a definition reduces the scope of the methods to the trial [6], whereas forensic science covers the exploitation of traces more broadly in policing [4]. Indeed, computational linguistics approaches can be exploited for global forensic purposes, such as authorship attribution [7][8][9], or the recognition and classification of illegal activities [10,11].
In this case study, textometric methods have been selected to recognise and classify illegal activities tasks. Textometry is based on "the lexicon, that is the counting and distribution of words within the texts of a corpus, but also other levels of linguistic and textual description (morphosyntax, textual structures, etc.)" [12]. The main interest in choosing this method is that it includes both a quantitative and a qualitative dimension. Indeed, textometry is based on a statistical analysis of textual data, but it integrates what [12] calls a "back to the text" step, where the scientist evaluates the results of the computational analysis by considering the surrounding context of the detected textual forms.
This paper is structured as follows: first, a review of the existing literature on the online market for fake documents is presented. Then, the research methodology and the different technical aspects are developed. Finally, the results are presented and discussed.

The online market for fake documents
The market for fake documents has extended to online marketplaces [13][14][15]. These online markets form a specific type of 'virtual convergence settings' where offenders (i.e. sellers and buyers) interact and leave traces [16,17]. They can take multiple forms, such as publicly accessible websites (e.g. online shops or platforms) or more private channels of communication, such as private groups on social media or instant messaging app. Because private settings are more difficult to study due to accessibility and ethical issues, this research focuses on a specific type of public setting: Online anonymous market present on the ".onion" darkweb relying on the TOR network, which is also known as a 'cryptomarket'. A cryptomarket is an online marketplace on the darkweb, which is quite similar to regular e-commerce platforms. Sellers post their ads and payments are carried out by cryptocurrencies. These anonymous markets allow users to engage in illegal activities while limiting the risk of being checked by the authorities [18,19].
In addition to the reasons for accessibility, this choice to analyse a cryptomarket is based on three main reasons. First, online platforms bring together a variety of sellers and buyers, allowing for analysing the activity of multiple stockholders as a whole, whereas dedicated online shops selling fake documents appears to be quite rare [20]. Second, the tracking of dedicated online shops involves gathering heterogeneous data, whereas platforms have a unified internal structure. Finally, the choice of monitoring the ".onion" darkweb is adequate since illicit markets on the web are known to contain scams, while darkwebs give a higher level of anonymity.
Indeed, darkwebs such as ".onion", which is recognised as the main one, concentrate illicit activities and in particular illicit markets. They offer a high degree of anonymity for both the manager of the websites and their users. They are not regulated by the DNS system of the ICANN, but are "Special-Use Domain Names" that are auto-regulated and selfauthenticating since they are solely derived from cryptographic keys [21,22]. Moreover, the ".onion" darkweb is settled upon the TOR network which secures the content of communication through encryption and protect anonymity with the use of multiple intermediary nodes and a dedicated communication process known as the "onion routing" to exchange information between computers without directly exchanging identifying information such as IP addresses [23]. [24] have formalise the mechanism for the online selling of fake ID documents with a crime script. By analysing 19 sellers found both on the Clear and Dark Web, they identified four main steps: -"Precondition of potential customers": These are the arguments put forward by sellers to attract buyers, such as the possibility of travelling, -"Initiation and entry into the market": Buyers can access markets via their browsers, sometimes after viewing advertisements that allow them to choose the seller. An initial contact then takes place between the buyer and the seller, -"Vendor actualisation and doing of document creation": The buyer pays for the order after having outlined their requirements to the seller, who then proceeds to create the document. The seller then proceeds to create the document, -"Exit scripts of the customer and vendor": Once the transaction is done, contact is often broken between the seller and the buyer, except for those trying to build customer loyalty or offering order tracking.
This description of the process outlines two dimensions of investigation about the Market. The first one is related to the nature of the target of the transaction (i.e. the fake document). The questions are "what types of documents are buyers looking for" and "for what purposes"? The second dimension is related to the means of contact used to enter the market. The questions are "what are the means" and "how to detect and monitor online settings used"? Globally, there are still very few specific studies addressing these questions about the market for fake documents. This might probably be explained by the small proportion that fake documents represent among all other illicit products available on cryptomarkets.
According to the study of Baravalle and colleagues [13], that analyses the sale of fake documents on cryptomarkets, these products are much less prevalent than others, such as drugs, which account for 80% of the products for sale on the "Agora" cryptomarket (N = 30,680 products and sellers pages collected). By comparing ads for drugs and fake IDs on this platform, they determined that the market for fake IDs was more concentrated, with fewer sellers and ads than for drugs.
In his book, Akhgar and colleagues (2021) consider the fake identity document market within the "fraud and counterfeit" category of products that can be found on the Dark Web, among five other major product types. The description given is limited to "Fraud and counterfeitsthe document fraud, with the online trading of fraudulent, fake, stolen and counterfeited documents and cards, such as fake passports or identification cards and cloned and stolen credit cards or accounts, is emerging and one of the fastest-growing markets, in all types of criminal activities including terrorism. 'Card shops', for example, are one of the speciality markets in the Dark Web." [25]; p. 101).
In Mireault's MSc thesis (2016), fifty websites selling counterfeit documents on the web were analysed to describe their visibility, products sold and the sales process. The online stores appear to exploit online forms and emails as their preferred means of communication. They also favour payments by digital currency (e.g. Bitcoin), and international money transfers (Western Union and MoneyGram), which are wellknown to be used by scammers. The main types of fake documents detected were a driver's license on 68% of the websites (n = 34), identity cards (28%, n = 14) and student cards (24%, n = 12). Passports, visas, residence and civil status documents were detected on 16% of websites (n = 8). Professional cards, diplomas, and fancy documents were sold on fewer sites (10%, n = 5).
On the darkweb, dedicated online shops selling fake documents appears to be quite rare. Laferrière and Décary-Hétu [20] identified 108 illicit online shops, but only 6 (5.5%) are dealing in fake documents. Much more websites appear to sell drugs (37%, n = 40) or carding credentials (31%, n = 34). No information on the products sold is detailed in this global study.
Bellido and colleagues (2017) investigated the acquisition mechanisms of fake documents to establish a state of the market. Using a keyword search on Bing, Yahoo, and Google browsers, as well as a more extensive search for new links contained in previously crawled pages, they obtained a total of 375 URLs, 357 distinct hostnames, and 223 identifiers. They determined the most common ways sellers make themselves visible to their buyers via different web spaces. Dedicated videos represent "37% of the means of selling", publications on forums and blogs represent 27% of these methods, hidden TOR sites 19%, dedicated sites 12%, and evaluation and advice sites represent 5% of the means of selling. The authors also detailed the sales process by first determining the main motivations invoked by sellers to induce customers to buy a fake ID, as well as the main means of contact and ordering. Their results seem to show that, regardless of the distribution medium used, email is consistently found as a means of contact, even if it is not the most frequent. They then conducted a market analysis to see which products are the most sold and at what price. These parameters vary depending on the platforms used, but driver's licences seem to be the most commonly sold and cheapest document, compared to passports and ID cards. These results are consistent with the results found by [15].

Dataset
The data used for this research was collected from the cryptomarket 'White House Market' (WHM). This cryptomarket, online from February 2019 to October 2021, was one of the major cryptomarkets in the Dark Web at the end of the study. Twenty crawls were performed from August 11, 2020 to March 11, 2021. The webpages of the advertisements and the sellers' profiles have been extracted for 83 ′ 516 distinct ads and 2 ′ 519 distinct vendor profiles (see Table 1). All parts of the collection process were based on open-source APIs and our own developments.
The sections presented here have subsections. The subsections "Fake Document (Digital)" and "Fake Document (Physical)" are included in the section "Services". As this study focuses on fake documents, those two subsections are treated separately from the rest, for a total of 1103 advertisements (1.3% of all ads) and 86 vendors (3.4% of all vendors).

Pretreatment
To carry out the textometric analysis, we chose to use the software IRaMuTeQ, 2 a free software based on Python and R. It allows multiple statistical analysis and produces visualisations. It has been chosen for its ease of use and the available textometric methods.
To integrate the data into the software as corpus (i.e. a set of text units to be analysed), they have to fit with a particular format called "Alceste" [26]. First for the ads, each category is separated from the others and converted into a.txt document (UTF-8 encoding) containing the ad title, category and vendor's name. Every new text is introduced with four asterisks "****". These are followed by the first information, here the name of the vendor, like "*_name1" and then the name of the corresponding category in the same format. These variables are called "illustrative variables", which means that they are not part of the text analysed but are used to filter the dataset. The text submitted to textometric analysis is the title of the ad. The descriptions of the products in the ads have been tested in several analyses but didn't give sufficient results to be considered relevant and thus are excluded. The same process is used to prepare the corpus composed of the 86 vendors of fake documents, with their names and date of admission to White House Market as illustrative variables and their profiles for the textometric analysis.

Table 1
Number of distinct vendors and ads for each section of the cryptomarket. The number of ads is counted based on distinct URLs of the ads, but also with the number of distinct product titles for each product since the product title might have changed over time.   Table 2, it is possible to see that only 69 vendors are taken into account for the "vendor_fakedoc". This can be explained by the fact that 17 vendors don't have any written profile. The corpora containing two categories are called "mixed corpora". Section specific corpora are used to obtain monothematic sets to avoid replication of the initial structure of the sections [27].
Since most of the texts analysed are written in English, the English dictionary is used. For the other parameters of the software, the default values are used.
All the texts are then lemmatised, i.e. all the forms are reduced, "so that a conjugated verb can be reduced to its infinitive, plural and singular forms, masculine and feminine forms can be grouped together, and, more generally, forms corresponding to the same root with different inflections can be grouped together" [28]; p. 867). The interest of this step is to group the main 'forms' and their derivatives under a single label to have a more robust statistical analysis.
The next paragraph describes the textometric methods used on the corpora.

Descending Hierarchical analysis (DHA)
Marpsat describes DHA as a method that allows one to "give an account of the internal organisation of a discourse" [26]; p. 1). After separating the forms thus obtained into two categories, the "analysable forms" (i.e. terms of the text taken into account during the analysis) and the "illustrative forms" [26]; p. 2) that having a purely descriptive value for the classes obtained from the analysable forms, the text is cut into segments. These are parts of the text of fixed size, often delimited by punctuation or special characters. These text segments are then grouped together to contain enough analysable forms for analysis. They constitute the context of the words. They are created automatically by the software (three lines) [27]. A "lexical table" [26]; p. 2) is then formed with the groups of segments in rows and the analysable forms in columns. Finally, the DHA is carried out, gathering the groups of segments into classes. The table values contain '1' if the analysable shape is present in the segment group and '0' if it is absent. The algorithm then produces a successive division of the groups into classes, first two, then two more from the largest and so on. The aim is to obtain clusters based on form frequencies representing "lexical worlds" [29] of the texts classified. They are traces of the own 'world' (i.e. discourse universe) of the reconstructed class [29]. They are reconstructed solely upon the forms (and segments) independently of any semantic interpretation.
DHA is performed automatically with IRaMuTeQ, and has been applied to the product's corpus "listing_fakedoc". The program took into account 1187 texts over 1321. One hypothesis that could explain this exclusion of certain texts is that the software performs a pre-arbitration in the texts, if some are too heterogeneous compared to the rest of the corpus and are therefore excluded before the analysis. It has been applied to the mixed corpora too (see Table 2), to see which categories of products can be found with this method.
Once the DHA is done, several statistics are automatically performed. The number of occurrences of every studied form (i.e. a bag-of-word model) is used to examine each form in a concordance table. It allows observing the form in its original context (i.e. text segments) regarding the illustrative variables (i.e. the section to which the product belongs or the vendor for instance). Ads published in the wrong sections can thus be identified.
The analysis is finally performed with the "listing_all_without_drugs" corpus. The choice to remove all drug ads is made because there are too many drug ads compared to the rest of the products. Then the first five most frequent words of each class created with the DHA are compared with the classification made with the "listing_fakedoc" corpus.

Specificities and correspondence factor analysis (CFA)
CFA is a complementary analysis of DHA, which allows associating texts with variables. The DHA table is projected on the axis defined by chosen variables (e.g. the vendor id). It gives a graphical representation of the distance between the different groups according to the analysable forms [30]. CFA process a statistical analysis (in our case a hypergeometric law) based on the selected variable.
It is automatically generated successively to the DHA analysis applied to the "listing_fakedoc" corpus, showing the distance between the different classes found by the DHA, then to the mixed corpora. To find groups of vendors based on their catalogue and then based on their profile, CFA has been applied several times in succession to the corpora "listing_fakedoc" and "vendor_fakedoc". It was produced using the name of the vendors as the variable. Before each new analysis, the vendors furthest from the core group (named "outliers") were removed until no more outliers were detected. The groups of vendors were finally defined based on their position on the axes. Finally, the outliers were analysed separately to understand what makes them different from the main set of vendors.

Ethical consideration
The collection process relies on online open data gathered with ad hoc web-crawling and web-scraping technologies. The cryptomarket of interest can be considered as a public space regarding the massive number of users and sellers, with data available for every user. Access to the website is conditioned by account creation, but anybody can create one without any condition. To respect privacy, all the vendor's names were anonymised, and no other identifying information was used during the study. All the analyses were based on the texts, and the results are presented so that no link can be established with the virtual identity of the sellers. The vendor's profiles were crawled but are not presented in the results. The collected data is intended exclusively for research purposes and cannot be used in any way that could be harmful to the users since no personal data is shared.

Classification of fake documents
Three distinct classes have been found based on the title of the ads with the DHA. The dendrogram in Fig. 1 shows that the classes are quite balanced in terms of the percentage of forms: 44,5% for class 1, 25% for class 2, and 30,5% for class 3 (N = 1187 ads).
Class 1 gather terms linked to documents sold in digital format, with terms like "psd" (which corresponds to the Photoshop format), "template" (i.e. a base that can be modified by the user), or "gimp" (which refers to a tool for image edition like Photoshop). The presence of the term "passport" is linked to the presence of the expression "passport psd template" in 75 ads. The term "card" is also present, as well as many country names, which may be linked to advertisements offering passports for each particular country. Besides that, most of the terms are not specific to fake documents.
It is harder to find a main topic for the terms gathered in class 2. Nonetheless, terms linked to photos and scans seem to emerge. For instance, the term "selfie" corresponds to an image of a person holding an identity document. This type of photo is increasingly required in online authentication processes. The expression "custom listing" is also  present. Custom ads are specific ads created for a specific client that are often deleted after the sale is made. It is usually a personalised ad without description as a result of a prior agreement between the seller and the customer [33]. Noticeably, the term "passport" is also present in this class. The term "identity" is present but is not directly linked to the term "card".
For class 3, the main topic is driver's licenses. Ads for driver's licenses are more often than not linked to American driver's licenses (some American states even emerge as the main words). Some terms linked to security features like "secu", "hologram", "uv", or "holo" are also frequently present in this class.
Combined with the dendrogram, the bag-of-word analysis reveals similarities between classes. In particular, the terms in common are "passport", "fake", "license", and "quality".
The CFA (see Fig. 2), confirm that the three classes are well separated from each other. Class 3 is quite well isolated. This is particularly true for the American states, which seems to be very specific to this class. Some terms like "passport" stand between class 1 and 2 because they appear in the text segment from the two classes. This can be explained by the fact that some ads propose "passport scan" in class 2, and "passport psd template" is one of the major n-grams from class 1.

Similarity analysis
The analysis detects the terms that are most frequently used in the ad titles to describe the products and their relationships. The most frequent terms seem to match with the types of counterfeit documents (see Fig. 3): "id", "card", "passport", "driver" and "license". Certain terms are very often used together. For instance, the term "id" seems very central and rather generalist, as it leads to different types of documents, not only "id cards". Moreover, the analysis leads to the detection of the different digital forms in which products can be found, like "psd", "template", and "scan".

Driver's licenses.
A strong link is detected between both "license" and "psd", with "driver" as the central term. This analysis highlighted many forms of expressions to nominate driver's licenses, including spelling differences ("licence", "drive") or diminutive forms ("dl"). Most of the terms linked to the two principal ones are American States, consistent with the result found during DHA analysis. Finally, the term "dl" is mostly associated with security features, like "UV" or "hologram".

ID cards.
The term "id" is frequently linked to the words "card" and "fake". The cluster centered around the term "id" is mostly composed of American states and country names. Another interesting thing is that the cluster containing the term "card" is also composed of French terms like "conduire" "identité" or "carte". The "id" word is also frequently linked to the "psd" term.

Passports.
The term "passport" is also central and linked to 24 other forms. It is frequently linked to "template" and "scan", which gives an indication of the type of counterfeiting. It is interesting to notice that the term "physical" is also linked to "passport". We can also find "biometric" passports, which is indirectly related to "passport" (with "world" and "travelling" between them).

Digital forms of documents.
The first thing to notice is the strong link between "psd" and "template", which is coherent with the observations that those two terms are often used together in the same texts and seems to be a very current format for digital documents. Passports also Fig. 3. Similarity analysis, cooccurrence spanning tree, and clustering of the main terms (N = 322 forms analysed). The size of the nodes and width of the edges is proportional to the frequency of occurrences. seems to be frequently linked to the term "scan". Terms like "gimp", "editable", or "xcf" give other information about the digital documents format.
If digital forms seem to be central for fake documents, one term gives more insights about the context of their potential usage: the term "selfie". It corresponds to photos showing a person, holding an ID document. Indeed, the digital transformation of services like neo-banking allows users digitally validate their accounts without any physical validation. Clients are requested to send a photo of themselves holding their ID document. Sometimes a piece of paper with the current date is also required in the picture. The identity control process is thus completely digital and might explain the appearance of new forms of illicit market for fake documents. In conclusion, the analysis shows that it is possible to detect specific kinds of fake documents that appear to be different from those described in the literature [13,14].

Comparing fake documents with other products
A DHA has been performed on every mixed corpus to see if the method can allow the discovery of new categories of products. For each class, we can distinguish a main topic that links all the words of the class together. In every mixed corpus, a specific category containing the forms linked to fake documents was also detected, except for the mixed corpus "Fakedoc_drugs". This can be explained by the huge proportion of drug ads compared to fakedoc ads (85 ′ 262 and 1 ′ 321). This is also observed with the "fakedoc_defense" mixed corpus, where fake documents are predominant (1321 and 88). Observing the CFA generated successively to the analysis, it is possible to notice that, in two cases ("Services" and "Online business"), the class containing terms linked to fake documents are confounded with other classes. It can be explained by the fact that some terms are common among the products proposed in those categories, like "card" (which can fit with "gift card" or "id card", for example). Moreover, the two subsections of fake documents were originally a part of the "Services" category, so it makes sense that the proposed products are close. For the online business section, it is possible to see that some terms are also semantically close. For example, this category contains a lot of "bank drops" (i.e. accounts that can be used for money laundering or illegal transfers) or credit cards. Terms linked to payment methods were also detected in the analysis of the fake document sections, such as "paypal". (See Table 3)

Detecting fake documents in other sections
The major interest of using a concordance table is to determine if it is possible to detect bad categorisation of fake documents in other sections. For this analysis, the first five words of each class (by number of occurrences) found from the DHA analysis of the "listing_fakedoc" corpus were searched in the "listing_all_without_drugs" corpus. It seems important to notice that the terms studied in this analysis have been selected according to their number of occurrences in the corpus. They are thus not necessarily specific to the field of fake documents. Then, every category different from the two fake document subsections (physical/digital) was identified. Table 4 shows all the detected categories.
Twenty-eight categories containing the first five words of our "fake document" classes have been identified. The term with the highest diversity is "custom", present in 15 categories. As previously described, this can be explained by the particular usage of this term within the cryptomarket ecosystem. "passport" is nevertheless present in 8 other categories. The term "statement" is the one with the least other categories.
The specificity of the terms can also be analysed with the proportion of their occurrences in the "fakedoc" corpus compared to their total number of occurrences ( Table 5).
The terms with the highest rate of occurrences in the "listing_fakedoc" corpus are "statement" (97%), and "psd" (87%), which is consistent with the previous results, in particular concerning the most common format Table 3 Number of classes obtained by CHD per corpus and distinction of a class related to false documents. N indicates the number of analysed ads for every corpus. The first ten words of each class found are also reported.

Corpus
Number of classes of selling. The terms with the lowest rate of specificity are "hq" (13%) and "dl" (29%). "hq" is an abbreviation of "high quality", which is an expression that can be used in many other contexts than fake documents. "dl" can be translated as "driver's license", but also "download".

Grouping sellers 4.4.1. Based on the ad titles
Seven successive CFA were performed, during which 19 outliers were removed. Outliers ads are mostly written in other languages, like French or German. Products like Netflix accounts, Walmart receipt, Apple store subscriptions, and biometric passports and visas were also detected as very specific selling activities related to outliers. Fig. 4 shows the result of the last CFA, where no obvious outliers remain visible. Each square on the graph represents a group according to the dimensions selected by the algorithm.
By consulting the catalogue of the vendors from each group, we were able to extract a main topic of products for each group: -Group A: physical fake documents. This is the group with the fewest vendors, which is consistent with the other analysis showing that digital fake document are more common than physical ones.  Table 4 Presence/absence of the term in a category other than "Fake Document (Physical)" or "Fake Document (Digital)". -Group B: "scan", "custom", "selfie" are the most common products in this group. This result is also consistent with class 2 of DHA analysis discussed earlier. -Group C: "psd template". Given the omnipresence of this digital form of products on the market, it is not surprising to find a group comprising mainly the sellers of this type of document. -Group D: "pack" and other products. This group is more difficult to define in terms of products sold. The term "pack" is present in an important part of the ads, but this group contains some diversity.
We can see that some vendors are really close to the axis, which can be explained by the proximity of their offer with vendors from other groups. For example, vendor 10 (group A) is really close to group C, and by looking at his catalogue, we can see that his offer is mostly composed of "psd template" ads, among other products.
The contribution of each group can be visualised in Table 6. The distribution of the vendors through the groups is balanced.

Based on their profile
The same analysis was performed using the vendor's profile textual description. This analysis revealed two major issues: the first is that 17 vendors didn't have any written profiles, so they can't be taken into account. The second one is that, after six successive analyses, 29 vendors were excluded as they appear as outliers, resulting in a total of 46 vendors (53% of the total) that weren't analysed. The reading of the profiles did not reveal any insights into the groups formed with the remaining sellers. The profiles of the outliers did not reveal anything conclusive either to explain their exclusion from the others. Table 5 Proportion of occurrences in the "fakedoc" corpus compared to the total corpus (except drugs) (N = 15844 titles for the "listing_all_without_drugs" corpus and N = 1187 titles for the "fakedoc" corpus).

Conclusive discussion
Is it possible to set up a classification of fake documents using textual data? DHA analysis led to a classification of fake documents and highlighted other types of documents than fake identity documents described in the literature, which distinguish between three main categories: passports, ID cards and driver's licenses. The highlighting of other products like utility bills, and bank statements, but also a novel category related to "selfies", shows a bigger diversity in the market than expected. The similarity analysis is informative on the most common format of selling the products: the "psd template" format. Based on the observation that driver's licenses are mostly linked to American state names, it can be hypothesised that the demand for this kind of document is higher. Indeed, driver's licenses are much more used to check identity in the USA than the id card or passport. The discovery of the selfie brings to light new issues concerning identity control on the internet. Indeed, today, many sites require a photo of the user holding an ID to access their services. The availability of these selfies, therefore, offers a new way of evading these controls.
However, during DHA, IRaMuTeQ showed its first limits. The term "id" was absent from the analysis. The assumption made about this fact is that term was systematically contained in the texts that weren't taken into account. Another hypothesis was suggested by Ref. [34]. She suggests that the software does not take every form as "full forms". The major problem is thus that the operator has no control of the forms or texts analysed, which is a real issue from a forensic point of view. To test the hypothesis, the term "id" was replaced with "identity" in the corpus. After another DHA, the term "identity" appeared in the class associated with driver's licenses, with a higher number of occurrences. This finding raises the hypothesis of small words being excluded just like stop words. They may not be taken into account because of their size. However, terms such as "hq" and "dl" were taken into account in the analysis. This observation led to the fundamental methodological proposition recall by Ref. [12]: "back to the text". As it helps to identify these gaps induced by analyses over which the operator's control is limited, it compensates for the "black box" effect inherent in some algorithms. In our study, this problem appears to be specific to DHA analysis. The modification of the corpus made in the test may not be a viable solution, because, depending on the context, this action could be perceived as a modification of the textual trace.
Can fake documents be distinguished from other products? In most cases of mixed corpora studied, it was possible to distinguish a specific theme for the classes found with DHA on mixed corpus, and to get a separate class containing forms linked to fake documents from other classes. The main issue for the comparison is the variation in the sizes of the corpus. If one of the two categories used to create the mixed corpus has many more texts than the other, the second one is hardly detected. Following this, the concordance table led to the detection of forms that can be used in different contexts and wrong categorisations of fake documents. Freeing oneself from the sections used by the sellers to select the product to analyse is a key issue for analysing online marketplaces. This was not the main aim of this study, but results show the interest of the tested approaches to evaluate the results of fully automated IA approaches like deep-learning ones.
Can sellers of fake documents be grouped based on the textual data from the advertisements? Four main groups of vendors were detected. Globally, an important proportion of digital fake documents are observed compared to physical ones. The effort required for making physical documents and the ease of transferring digital documents may explain this result. Indeed, the manufacture of fake documents requires knowhow, equipment and materials to produce a document that is of satisfactory quality. There is also consistency between results found with the products and with the vendors, which might signify a certain degree of specialisation. The main issue of this analysis is the exclusion of the outliers during the successive CFA. Indeed, this part of the process is based on a visual analysis of the graphs. The operator decides which vendor is an outlier based on its graphical distance from the main group. In that case, there was always a compact group in the centre, so it was easy to determine the outliers.
Is it possible to find groups of sellers from the analysis of their profiles? This analysis suffers from the subjectivity required for the exclusion of the outliers. Contrary to the corpus of ads titles, the distributions obtained after the successive specificity analysis for the vendor's profiles were more shattered. This led to the exclusion of 29 vendors (34%). Knowing that 17 vendors have no profile, 53% of vendors were considered in the analysis. Vendors' profiles should thus be considered with cautiousness, and further analysis is required to evaluate their informative content. It was indeed impossible to identify the main topic for the groups formed. This can be explained by the fact that every vendor chooses to write whatever they want in their profile, and it doesn't necessarily have a link with what they sell. It could be interesting to try the method with a corpus of vendors of other types of products to see if it is an inherent problem for vendors of fake documents.
Globally, several steps of the methodology used required manual work, which leads to a certain risk of error. For example, in the concordance table analysis, it would have been difficult to estimate the number of products listed outside the fake document categories for each term studied, due to the high proportion of occurrence of each word. IRaMuTeQ did not allow for an automatic numerical estimate. The size of the corpus is also a limitation for some analyses, such as product classification. This method requires the assess the construction of the corpus itself, in order to ensure that all forms are taken into account. Finally, some limitations come from the software used. Indeed, IRa-MuTeQ is an easy-to-use software provides good results for exploratory analysis and relevant global information. However, it does not allow us to go deeper into the details of the data, at least not in an automatic way. Furthermore, the operator has little control over the forms used. It could therefore be interesting to place it in sequence with other techniques, where it would allow an initial sorting to be carried out before continuing with more elaborated methods and tools.
The analysis of words as a trace in the judicial context is an issue that still raises many questions. Indeed, words are more often than not considered subjective and sensible to a lot of variation and interpretation, an aspect that statistical methods tend to mitigate. But the potential of these methods during investigation and for intelligence purposes appears to be very high. This research work is intended to be a starting point and, above all, an open door to explore how the statistical analysis of textual data might help to answer crime analysis questions.

Declaration of competing interest
No conflict of interest.