A Multi-Input Machine Learning Approach to Classifying Sex Trafﬁcking from Online Escort Advertisements

: Sex trafﬁcking victims are often advertised through online escort sites. These ads can be publicly accessed, but law enforcement lacks the resources to comb through hundreds of ads to identify those that may feature sex-trafﬁcked individuals. The purpose of this study was to implement and test multi-input, deep learning (DL) binary classiﬁcation models to predict the probability of an online escort ad being associated with sex trafﬁcking (ST) activity and aid in the detection and investigation of ST. Data from 12,350 scraped and classiﬁed ads were split into training and test sets (80% and 20%, respectively). Multi-input models that included recurrent neural networks (RNN) for text classiﬁcation, convolutional neural networks (CNN, speciﬁcally EfﬁcientNetB6 or ENET) for image/emoji classiﬁcation, and neural networks (NN) for feature classiﬁcation were trained and used to classify the 20% test set. The best-performing DL model included text and imagery inputs, resulting in an accuracy of 0.82 and an F1 score of 0.70. More importantly, the best classiﬁer (RNN + ENET) correctly identiﬁed 14 of 14 sites that had classiﬁcation probability estimates of 0.845 or greater (1.0 precision); precision was 96% for the multi-input model (NN + RNN + ENET) when only the ads associated with the highest positive classiﬁcation probabilities (>0.90) were considered ( n = 202 ads). The models developed could be productionalized and piloted with criminal investigators, as they could potentially increase their efﬁciency in identifying potential ST victims.


Introduction
Trafficking in persons is one of the most harmful criminal industries internationally. Its prevalence continues to rise each year and it is currently identified as the second-most profitable illegal trade, after drug trafficking [1]. According to the U.S. Department of State's 2022 Trafficking in Persons Report [2], 1111 federal (or joint federal-local/state) investigations of human trafficking were opened during the fiscal year 2021, with the Department of Justice initiating prosecution in 228 cases, the majority of which (221) concerned sex-as opposed to labor-trafficking. At the local/state level, 2203 human trafficking offenses were reported by participating jurisdictions (U.S. Department of State 2022). These figures, however, are thought to grossly underestimate the true extent of the problem. For instance, 11,500 cases of human trafficking were reported to the National Human Trafficking Hotline during 2019 [3]. These calls led to the identification of 22,326 victims and survivors, of whom 14,597 (65%) had been sex trafficked, with an additional 1048 (5%) having been subjected to both sex and labor trafficking.
Federal law defines sex trafficking as "the recruitment, harboring, transporting, provision, obtaining, patronizing or soliciting of a person for the purposes of a commercial sex act, in which the commercial sex act is induced, through the use of force, fraud, or coercion, or in which the person induced to perform such an act has not attained 18 years of the model went further than prior research by using the model to identify unknown trafficking organizations and assign a risk score. However, this model only examined the text of the ads (and not the emojis or images).
Convolutional neural networks (CNN) have become a commonplace approach for image (and other) analyses. Granizo et al. compared SVM and CNN models to estimate the gender and age of individuals on a known public repository where sex services were advertised [16]. The training data set consisted of labeled images (n = 4096 posts). Accuracy rates for age classification were 80.6% (SVM) vs. 97.3% (CNN) for faces and 82.1% (SVM) vs. 51.4% (CNN) for upper body images.
More recently, Wiriyakun and Kurutach [17,18] utilized a feature selection approach and compared three updated ML models against the original work of Tong et al. [11] with the Trafficking-10k dataset, namely random forest, logistic regression, and linear SVM. These models significantly outperformed Tong et al.'s [11] bag of words approach, with F1 scores of 63.3%, 64.8%, and 61.3%, respectively, as compared to 24.5%. These results are relevant because the authors dichotomized the labels in the Trafficking-10k dataset, which is the approach adopted in the present study. Table 1 provides a summary of the results from related work. As ML methods advance and classification modeling improves, the need to determine the extent to which these higher-performance models apply in different contexts arises. This study attempts to improve on earlier methods via the provision of a high-performance deep learning model that includes text and imagery inputs. This has the potential to alleviate resource constraints placed on law enforcement by creating a model that can identify sex trafficking-related online escort ads with both high accuracy and precision, thus maximizing the efficiency of criminal investigators.

Data, Software, and Hardware
For this study, 12,350 escort ads were obtained from a user agreement with Marinus Analytics. This agreement provided access to the Trafficking-10k dataset [11]. The data included an identification, a classification label (more on this later), a title text, and a body text. The title and body were further concatenated for use. Observations included emojis (images, e.g., Kurutach (2021, 2022) Trafficking-10k dataset with outcome in binary form (collapsed original 7 labels) Feature selection approach Reported 0.77 accuracy in best-performing model and F1 scores of 63.3% for random forest model, 64.8% logistic regression, and 61.3% linear SVM.
As ML methods advance and classification modeling improves, the need to determine the extent to which these higher-performance models apply in different contexts arises. This study attempts to improve on earlier methods via the provision of a highperformance deep learning model that includes text and imagery inputs. This has the potential to alleviate resource constraints placed on law enforcement by creating a model that can identify sex trafficking-related online escort ads with both high accuracy and precision, thus maximizing the efficiency of criminal investigators.

Data, Software, and Hardware
For this study, 12,350 escort ads were obtained from a user agreement with Marinus Analytics. This agreement provided access to the Trafficking-10K dataset [11]. The data included an identification, a classification label (more on this later), a title text, and a body text. The title and body were further concatenated for use. Observations included emojis (images, e.g., ) and emoticons (text viewed as images, e.g., :/). Figure 1 shows the first five observations of the original dataset. The Anaconda distribution of Python 3.8 [19] was used for all models. The Tensor-Flow library [20] provided support for the machine learning algorithms. Programming code is freely available in supplementary materials. For this preliminary study, processing was performed on a high-end computer with a 14-core Intel i9-12900K Central Processing Unit (CPU) operating at 2.5 GHz, 64 GB of random access memory (RAM), and a single NVIDIA GeForce RTX 3080 Super Graphical Processing Unit (GPU).
) and emoticons (text viewed as images, e.g., :/). Figure 1 shows the first five observations of the original dataset.
As ML methods advance and classification modeling improves, the need to determine the extent to which these higher-performance models apply in different contexts arises. This study attempts to improve on earlier methods via the provision of a highperformance deep learning model that includes text and imagery inputs. This has the potential to alleviate resource constraints placed on law enforcement by creating a model that can identify sex trafficking-related online escort ads with both high accuracy and precision, thus maximizing the efficiency of criminal investigators.

Data, Software, and Hardware
For this study, 12,350 escort ads were obtained from a user agreement with Marinus Analytics. This agreement provided access to the Trafficking-10K dataset [11]. The data included an identification, a classification label (more on this later), a title text, and a body text. The title and body were further concatenated for use. Observations included emojis (images, e.g., ) and emoticons (text viewed as images, e.g., :/). Figure 1 shows the first five observations of the original dataset. The Anaconda distribution of Python 3.8 [19] was used for all models. The Tensor-Flow library [20] provided support for the machine learning algorithms. Programming code is freely available in supplementary materials. For this preliminary study, processing was performed on a high-end computer with a 14-core Intel i9-12900K Central Processing Unit (CPU) operating at 2.5 GHz, 64 GB of random access memory (RAM), and a single NVIDIA GeForce RTX 3080 Super Graphical Processing Unit (GPU). The Anaconda distribution of Python 3.8 [19] was used for all models. The TensorFlow library [20] provided support for the machine learning algorithms. Programming code is freely available in Supplementary Materials. For this preliminary study, processing was performed on a high-end computer with a 14-core Intel i9-12900K Central Processing Unit (CPU) operating at 2.5 GHz, 64 GB of random access memory (RAM), and a single NVIDIA GeForce RTX 3080 Super Graphical Processing Unit (GPU).

Training, Validation, & Test Sets
Data were split randomly into an 80% training set with 9880 observations and a 20% test set with 2470 observations. The training set was further split for hyperparameter tuning, with 80% (7904) used for training and 20% (1976) used for validation. After estimating optimal hyperparameters, the entire training set was used for model fitting.

Image Creation
Marinus Analytics was unable to provide the original images used in the initial study. To investigate the utility of multi-input models that include image components, Unicode emojis from the concatenated title and body were converted to color images for use in investigating CNNs using spaCy [21]. The extracted emojis for each observation were arranged into a square grid just large enough to contain each sequential image. For example, if there were 100 emojis in the combined text, then the grid was 10 by 10. This grid was then converted to a 224 by 224 image using OpenCV [22] and the Segoe UI Emoji True Type Font. Figure 2 shows the image generation for a single observation. use in investigating CNNs using spaCy [21]. The extracted emojis for ea were arranged into a square grid just large enough to contain each sequen example, if there were 100 emojis in the combined text, then the grid was grid was then converted to a 224 by 224 image using OpenCV [22] and the S True Type Font. Figure 2 shows the image generation for a single observat

Text Preprocessing
The title and body text were concatenated for use in natural langua Text was converted to lowercase, numbers were converted to text, and ad tags remaining in the dataset were removed using the "re" library. Emotic were mapped to text using the "emot" library. Punctuation was identified library in Python and removed. Common English stop words were remove lemmatized (to account for parts of speech), tokenized (to convert from tex and padded to the maximum sentence length using the Natural Lang (NLTK) [23]. Figure 3 shows four examples of pre-processed vs. original lemmatization, tokenization, and padding.
In Figure 3, "both" reflects the original concatenated title and body. In there are numbers and .html tags that are removed (along with punctuatio zation) in the transformed columns "title" and "body". In observations 3 an are converted to text. Doing so supports text-based modeling.

Text Preprocessing
The title and body text were concatenated for use in natural language processing. Text was converted to lowercase, numbers were converted to text, and additional. html tags remaining in the dataset were removed using the "re" library. Emoticons and emojis were mapped to text using the "emot" library. Punctuation was identified by the "string" library in Python and removed. Common English stop words were removed, and text was lemmatized (to account for parts of speech), tokenized (to convert from text to numbers), and padded to the maximum sentence length using the Natural Language Tool Kit (NLTK) [23]. Figure 3 shows four examples of pre-processed vs. original images before lemmatization, tokenization, and padding.

Feature Creation
Eight features were extracted from the data. Title and body emojis were counted an used as separate variables. Derivations of the word "sex" were counted separately for th body and text as well. Proportions of emojis and derivations of the word "sex" were als calculated for the title and the body. These eight variables were included in a four-laye neural network model (64 neurons, 50% dropout, 32 neurons, and 20% dropout with lin ear activation functions) after min-max scaling. Min-max scaling uses the range of th variable as the denominator and the difference between each observation and the min mum of the variable as the numerator, effectively providing each observation a locatio In Figure 3, "both" reflects the original concatenated title and body. In observation 0, there are numbers and. html tags that are removed (along with punctuation and capitalization) in the transformed columns "title" and "body". In observations 3 and 4, the emojis are converted to text. Doing so supports text-based modeling.

Feature Creation
Eight features were extracted from the data. Title and body emojis were counted and used as separate variables. Derivations of the word "sex" were counted separately for the body and text as well. Proportions of emojis and derivations of the word "sex" were also calculated for the title and the body. These eight variables were included in a four-layer neural network model (64 neurons, 50% dropout, 32 neurons, and 20% dropout with linear activation functions) after min-max scaling. Min-max scaling uses the range of the variable as the denominator and the difference between each observation and the minimum of the variable as the numerator, effectively providing each observation a location within the distribution as a percentage. Neural networks (NN) require scaling to facilitate convergence and improve computational performance as well as accuracy [24].

Label Definition & Processing
The dependent variable for this study was based on the original question from the Tong et al. study [11]. "In your opinion, would you consider this advertisement suspicious of human trafficking?" The possible responses were "Certainly no", "Likely no", "Weakly no", "Unsure", "Weakly yes", "Likely yes", and "Certainly yes". Classification decisions were made by three raters with many years of combined experience in the field of human trafficking. Pairwise agreement among the raters was 83%. The original dataset included images; however, these are no longer available. The best F1 score from the original study was 66.5% with a deep network approach [11].
For this study, the label was collapsed from seven levels to binary, as follows: "Certainly no", "Likely no", "Weakly no", and "Unsure" were combined into the category "Not likely or unsure". The remaining categories were collapsed into "More likely than not". This type of binary coding allows for probabilistic mapping between the classification algorithm and the raters' assessments. Dichotomization of the labels in the Trafficking-10k dataset has been proven to be adequate by previous studies [17,18].

Models
Seven models were evaluated to assess the value of the engineered features, the title and body content, the generated images, and combinations of the three as a design of experiments (see [25]), as it seeks to identify the marginal and additive contributions towards the classification metrics by modifying the architecture. Classification based on engineered features leverages neural network architecture, a common machine learning approach. Text classification (natural language processing, or NLP) uses stacked bidirectional long shortterm memory (LSTM) [26] recurrent neural networks (RNN) [27]. Bidirectional LSTMs have proven to be powerful in classifying textual content [28]. Classification models leveraging imagery were based on convolutional neural network (CNN) architecture, specifically the EfficientNetB6 or ENET [29]. Efficient-Net has proven to be a highly accurate architecture for handling image classification tasks [30]. Multi-input models were combined with the same neural network scaffolding. These types of models have proven to be highly effective in prediction [30,31]. The seven models of this study include (see Table 2): Model 1, a neural network with decreasing nodes and dropout (64 nodes → 50% dropout → 32 nodes → 20% dropout, linear activation functions); Model 2, a CNN (EfficientNetB6 or ENET) flattened with decreasing nodes and dropout (128 nodes → 50% dropout → 64 nodes → 20% dropout); Model 3, an RNN with a size 512 embedding layer and two-layer bidirectional LSTMs; and Models 4 through 7, every combination of these three layers. Table 2 provides the definitions and abbreviations for all seven models. All these individual models were joined (or concatenated for multi-input models) with the final architecture, a neural network (16 nodes (relu activation function) → 10% dropout → 8 nodes (relu activation function) → 1 sigmoidal node). Figure 4 is the TensorFlow architecture display for the full model: NN, RNN, and ENET.

Hyperparameter Tuning and Loss Metric
The number of epochs for each separate model was set based on training/validation split performance. The optimizer used was "adam" [32], although many others were investigated and often switched during performance testing. Batch size for mini-batch metrics was 32. The loss metric for optimizing model weights (and filters) was binary cross-entropy or log loss, appropriate for binary classification.

Descriptive Statistics: Dependent Variable (Label)
The distribution of online escort ads across the classification label (dependent variable) categories is shown in Figure 5. After recoding, 4054 observations (32.8%) were coded as likely sex trafficking, with the remaining 8296 (67.2%) coded as not likely or unknown.

Hyperparameter Tuning and Loss Metric
The number of epochs for each separate model was set based on training/validation split performance. The optimizer used was "adam" [32], although many others were investigated and often switched during performance testing. Batch size for mini-batch metrics was 32. The loss metric for optimizing model weights (and filters) was binary crossentropy or log loss, appropriate for binary classification.

Descriptive Statistics: Dependent Variable (Label)
The distribution of online escort ads across the classification label (dependent variable) categories is shown in Figure 5. After recoding, 4054 observations (32.8%) were coded as likely sex trafficking, with the remaining 8296 (67.2%) coded as not likely or unknown.

Descriptive Statistics: Features
Eight features were generated for inclusion in the classification models. Table 3 provides descriptive statistics for those features. The "average" observation had 20.3 emojis in the title (30.8% of the words) and 113.7 emojis in the body (24.1% of the words). This "average" observation also had 0.2 derivatives of the word "sex" in the title (0.3% of the words) with an additional 0.7 in the body (0.1% of the words). The distribution of these variables is right-skewed, as they are left-censored at zero. The standard deviation of the emojis in the title and the body show high variability (standard deviation and range), something that ostensibly might be valuable for algorithmic learning of the classification status.

Descriptive Statistics: Features
Eight features were generated for inclusion in the classification models. Table 3 provides descriptive statistics for those features. The "average" observation had 20.3 emojis in the title (30.8% of the words) and 113.7 emojis in the body (24.1% of the words). This "average" observation also had 0.2 derivatives of the word "sex" in the title (0.3% of the words) with an additional 0.7 in the body (0.1% of the words). The distribution of these variables is right-skewed, as they are left-censored at zero. The standard deviation of the emojis in the title and the body show high variability (standard deviation and range), something that ostensibly might be valuable for algorithmic learning of the classification status.

Model Comparison
The focal point of this study is the ability to provide law enforcement tools to efficiently identify those advertisements that are likely to be associated with sex trafficking (ST). To evaluate model performance, five measures were used: sensitivity (also known as recall), specificity, accuracy, precision (positive predictive value), and the F1 score. In terms of probability, these measures provide different types of information. Sensitivity or recall, P(identified as positive|truly ST), tells us the ability of the classifier to identify sex trafficking (i.e., true positive rate), but if the classifier suggested that all observations are positive, then sensitivity would be a perfect 1.0, while the specificity, P(identified as negative | truly not ST) (i.e., true negative rate), would be a worst-possible 0.0, so the model would not be informative. Accuracy combines both in its calculation and equals the proportion of cases that were accurately classified (i.e., true positives + true negatives) over the full sample (i.e., true positives + true negatives + false positives + false negatives). A model that correctly classifies most of the true positives and true negatives may still be problematic in terms of precision, which is the positive predictive value, or P(truly ST|identified as positive), as lower precision means that many identified positives are likely not to represent ST (i.e., there would be too many false positives). Finally, the F1 score provides a mixture of precision and recall (sensitivity) scaled between 0 and 1 and is calculated as 2 × Precision × Recall/(Precision + Recall).
In the original model from Tong et al. [11], the best model achieved 80% accuracy with an F1 score of 0.67. Comparative metrics for all models from the present study, as well as the Tong et al. model [11], are shown in Table 4. The best-performing models in terms of accuracy are Models 5, 6, and 7. For recall (sensitivity), the NN + RNN (Model 6) slightly outperforms Model 5 (ENET+RNN). The most precise model is Model 3, the RNN model. Model 1 (NN) and Model 2 (ENET) are not robust enough for classification, as both have recall measures below 0.50 and F1 scores below 0.51. Building on the work of Tong et al. [11], we were able to improve accuracy (by 2 percentage points), precision (7 points), and the F1 score (3 points) by using the ENET-RNN model alone without having the original imagery. The ENET-RNN-NN model improved accuracy by 2 percentage points and precision by 6. Model estimation time ranged from 1.333 min (NN) to 32.95 min (ENET + RNN + NN). Table 5 provides the confusion matrix for the complete model (Model 7). The precision, recall, and accuracy metrics suggest a reasonable model.
The precision of models is important for law enforcement, as a lack of precision is associated with wasted effort since precision is calculated as the proportion of cases identified as positive that truly are associated with ST. Thus, we sorted the model predictions by probability (rounded to three decimal places) in descending order, so that the highest estimated probability of ST was ranked first, the second highest was ranked second, etc. We then evaluated the performance of all models at these two hinge points: the highest-ranked probabilities for at least 10 observations and then again at the highest-ranked probabilities for at least 50 observations. (Many classification probabilities occur more than once, so the exact number of samples within each group differs.) For law enforcement officers, time is limited. Targeting all potential sites simultaneously is infeasible, so some sort of prioritization is required.  The results in Table 6 show that the best model, the ENET + RNN multi-input model, was able to correctly identify all 14 of the observations associated with the estimated probability of 0.845 or greater (1.00 precision). The second-best model, the NN + RNN, achieved 96% precision on 23 observations based on a model probability estimate of 0.976 or higher. The complete model (NN + RNN + ENET) achieved 0.9554 precision with 202 observations (a model classification probability estimate of 0.90 or better).

Discussion
This study has shown that recent advances in deep learning for classification allow us to more accurately and precisely identify online escort ads that may be associated with sex trafficking activity. High-precision models are particularly favored in that wasted effort by investigators with limited time resources should be avoided; the complete multiinput model (NN + RNN + ENET) developed here achieved 77% precision (as compared to the original 71% precision reported by Tong et al. [11]), and this increased to almost 96% when only the ads associated with the highest positive classification probabilities (>0.90) were considered. Other model metrics for this complete model were comparable to Tong et al.'s [11], demonstrating the increased precision was not associated with a trade-off deterioration in other metrics.
These results are based on the analysis of texts, emoticons, and emojis. Unfortunately, the advertisements' photographic images could not be accessed and incorporated into the model. It is, therefore, possible (if not likely) that even better results could be obtained if the images of the ads were available for analysis.
As with any other research study, this one suffered from certain shortcomings. The Trafficking-10k dataset is aging, so the results reported here would need to be replicated using newly harvested data. While manually labeling ads can be time consuming and expensive, automatic classification based on widely accepted indicators of sex trafficking activity (e.g., movement of sex providers, apparently minor providers) may be performed in its place (see [9]). Further, the multi-input model developed was complex, which yielded greater accuracy but obscured its theoretical underpinnings. Future research could develop and test a more theoretically sound model, then fit neural nets to the residuals to increase the model's accuracy, using expert ratings to annotate the ads.
Although the binary reclassification of the original outcome labels in Tong et al.'s [11] Trafficking-10k dataset could be perceived as a disadvantage, as it would arguably lead to a loss in granularity, our best accuracy score was comparable to those reported by [13], who applied ordinal regression neural network (ORNN) models to this same dataset. The advantage of our binary outcome model is that it estimates the probability of a given online escort ad being associated with ST, which is much easier to interpret than coefficients or estimates from an ordinal model. Such functionality could then be productionalized to allow criminal investigators to identify the ads with the highest probability values. This would allow law enforcement to prioritize such ads, which we have shown to have precision scores as high as 96-100%.

Conclusions
Deep learning binary classification models hold much promise in increasing the efficacy with which law enforcement could identify online escort ads that are potentially associated with ST. Any increase in model performance can translate into a more efficient use of limited public safety resources. By optimizing identification and investigation efforts and integrating a low-cost strategy approach, increasing productionalized tool accessibility can be achieved. Multi-input models benefit from the collective strength of each respective model from which they are composed while mitigating individual weaknesses.