Combining Text and Images for Film Age Appropriateness Classiﬁcation

We combine textual information from a corpus of ﬁlm scripts and the images of important scenes from IMDB that correspond to these ﬁlms to create a bimodal dataset (the dataset and scripts can be obtained from https://tinyurl.com/se9tlmr ) for ﬁlm age appropriateness classiﬁcation with the objective of improving the prediction of age appropriateness for parents and children. We use state-of-the art Deep Learning image feature extraction, including DENSENet, ResNet, Inception


Introduction
The question "Is this film appropriate for my children of X years of age?" frequently arises in parents' minds. Up till now, age-appropriateness of films has been recommended by censorship bodies, in the form of age rating certificates. In the United States and the United Kingdom, these age rating certificates are issued mainly by two organizations: the Motion Picture Association of America (MPAA) in the United States of America and the British Board of Film Classification (BBFC) in the United Kingdom. The two "censorship" bodies base their ratings on the film content and provide descriptions for each certificate. Different ratings for the US and UK and their interpretations can be found in Table 1. The BBFC define their classification as "the process of giving age ratings and content advice to films and other audiovisual content to help children and families choose what's right for them and avoid what's not.". The

Introduction
The question "Is this film appropriate for my children of X years of age?" frequently arises in parents' minds. Up till now, age-appropriateness of films has been recommended by censorship bodies, in the form of age rating certificates. In the United States and the United Kingdom, these age rating certificates are issued mainly by two organizations: the Motion Picture Association of America (MPAA) in the United States of America and the British Board of Film Classification (BBFC) in the United Kingdom. The two "censorship" bodies base their ratings on the film content and provide descriptions for each certificate. Different ratings for the US and UK and their interpretations can be found in Table 1. The BBFC define their classification as "the process of giving age ratings and content advice to films and other audiovisual content to help children and families choose what's right for them and avoid what's not.". The classification is, in principle 1 , based on the content of the films. As a result, we hypothesise that it is possible to use automatic methods to perform the classification. This, in turn, would, among other things, improve the consistency and productivity of the classification process. An automatic classifier would also provide insights into the differences in the perception of appropriateness in different countries or decades (e.g. if a machine classifier trained on data from one decade performs differently on data from different decades, we could infer that there are some differences in human perceptions across different decades, as similar texts and images, as determined by machine classifiers, are now looked at differently by human classifiers). The contribution of factors such as the country of the censor board, the time the film was produced, and the quantified content of violence or explicit material could also form the basis of various studies in Digital Humanities and Computational Social Science. While not the main focus of this research, such aspects could be very important to the understanding of the making, reception, and perception of films in different times and cultures.
Previous research indicates that using the textual content of the films alone, it is possible to build classifiers that could perform the classification fairly accurately for various aspects of the film [13,9,8]. Mohamed and Ha [10] compiled a dataset of film scripts and their age-appropriateness ratings, developed various classification models and reported fairly good accuracies (79.1% accuracy for American MPAA and 65.3% accuracy for British BBFC) using TFIDF values of character based ngrams as features. In this paper, we try to see whether using image features extracted using state-of-the-art image feature extraction could improve the classification performance further. From a human perspective, we know that vision adds more information, and should thus improve the classification accuracy, a fact also supported by machine vision research [11,15,12]. Our research focuses on whether the use of current state-ofthe-art image feature extraction could improve the automatic classification models. If it could, we then can have further evidence that these image feature extraction methods can capture abstract concepts such as age-appropriateness. We add images to Mohamed and Ha's dataset by using the Internet Movie Database (IMDB) to extract images associated with each film. We then use state-of-the-art image feature extractors to extract vectors representing the images, combine these vectors with textual vectors, and investigate the impact of these image feature vectors on the accuracy of the classifiers. The contributions of this paper are: (1) a bi-modal datatset combining images and texts for 17000 films, The rest of this paper is organised as follows: section two introduces the data and the methods used in the research, section three outlines the results and provides analysis, examples, and the confusion matrix, followed by the conclusion and plans and suggestions for future work.

Data and Methods
Mohamed and Ha's dataset was created using an INNER JOIN of two resources: film scripts and film certificates. Film scripts were obtained from the website www.springfieldspringfield.com, which unfortunately does not exist any more. The files, available in html, were converted into text and were run through a basic cleaning pipeline that involved transforming the utterances into proper sentences using the Spacy package [6]. Mohamed and Ha also removed nondialogue elements from the scripts like scene descriptions and actor actions, a practise that we follow for two reasons: (1) these are not consistent across the film scripts as many films do not have them, and (2) because these are external to the film content proper. These scripts were combined with IMDB Certificates, which indicate, for each film, the age for which the film is appropriate. These certificates may vary by country and cut. For example, the film "The Hobbit: The Battle of the Five Armies" has been rated both PG-13 and R in the United States based on which cut is intended. The main certificate used on IMDB for both the UK and the USA is used, which in the case of this Hobbit film is 12A for the UK and PG-13 for the USA. We then collected IMDB Images, accompanying and characterizing main scenes of the film by downloading the images in the photo gallery of each film. The number of images accompanying each film description on IMDB is limited; and we understand that this limitation will affect the accuracy of prediction. We nonetheless hypothesise that, even with the limited number of images, the combination of text and images will lead to better performance in classification since text alone could be ambiguous. The combination of these two modalities would then contribute to the disambiguation of otherwise difficult to interpret textual content, and will thus lead to better classification accuracy.
The IMDB certificates are used as labels, in what is known as distant annotation. The BBFC website explains that they use two raters for each film and when there is a dispute, a third, more experienced, rater steps in. This is very similar to human linguistic annotation. We do not know the inter-rater agreement, and thus are unable to determine the ceiling of human performance. Similar to [10], we use the following upper bounds and baselines: The Upper Bound. The IMDB hosts certificates from 70 countries around the world. The upper bound takes as predictor variables all these certificates and as a target the country in question. If we want to predict the UK certificate, we use all the other certificates as features. This method achieves accuracies of 84.7% and 80% for the US and the UK (OtherCts in Table 5). Both experiments were performed using XGBoost, our best classifier for this task. The baseline for this paper is 55.0% and 41.8% for the USA and UK respectively, representing the majority classes ("R" for the US and "15" for the UK).

Dataset Description and Statistics
The dataset comprises 17018 titles. Transcripts of these titles contain a total of 181 million words. USA certificates are available for 8923 titles and British certificates are available for 10920 titles. 7068 titles have both countries' certificates. The mapping between the UK and the USA ratings is not one to one. A classifier that uses the UK ratings to predict the USA ratings would only have an accuracy of 80.6% (SingCt in Table 5). For each title in the dataset, we download images that belong to the title gallery excluding those images that are not part of the film itself, for example, those whose captions include descriptions such as "X at an event to promote the film Y" from IMDB. A total of 429050 photos have been collected, with an average of 46.94 photos per title. The average numbers of photos per title for each certificate rating can be found in Table 1. We use the same train (70%), test (20%), and dev (10%) subsets.

a) Texts:
Mohamed and Ha tried a variety of classification methods from both traditional machine learning and Artificial Neural Networks. They concluded that the best setting is to use character ngrams tf-idf as features, and XGBoost as classifier, achieving an accuracy of 79.1% when predicting USA certificates, and 65.3% when predicting British certificates. We have replicated their experiments and we have reached the same results using textual features. The next section will combine these textual features with image features and will also explore the use of images alone in film age appropriateness classification.
b) Images only: Recent advances in machine vision have produced models that almost surpass human performance in image object recognition tasks, specifically the ImageNet challenges. Information needed to distinguish between all the 1,000 classes in ImageNet is also often useful to distinguish between new kinds of objects. Such information can be harvested from the outputs of penultimate layers of models originally trained to distinguish between all the classes in ImageNet. We use these outputs as our image feature extractors. Specifically, we use NASNetMobile [18], Dense169 [7], InceptionV3 [14], ResNet152V3 [5], and NASNetLarge [18]. These models represent the state-of-theart in image object recognition ( Table 2). Keras implementations of these models 2 are used. For each of the images, we produce a feature vector; we then pool feature vectors of all the film's images first, and use the resulting vectors as input for certificate classifications. We try mean, median, and max pooling, and find mean pooling to be the best. We also try dimension reduction methods such as PCA as pooling methods; and find that they also are inferior to mean pooling. We also produce an ImageConcat vector, which is the concatenation (stacking the vectors horizontally) of the pooled vectors produced by individual feature extraction models.
Film age classification using images may not be as easy as it sounds. The reason for this is that in an R-rated movie, most of the images may be innocent, the equivalent of PG-rated, but only some may contain violence or explicit references. This poses even a bigger a challenge to our experiments since we use only the set of images provided by IMDB, which, for various reasons, may not contain the most violent or explicit images in the film. It is thus useful to check the accuracy of using only the images as per category using a balanced dataset. To build our balanced dataset, we first choose titles of which we have at least 40 images. We then build a balanced training set of 450 titles of each rating. We then choose 40 random images for each title to form the training set. Similarly, from the titles that have at least 40 images in the test set, we choose a set of 150 random titles for each rating. For each of these titles, we pick 40 random images. They form our test set. We perform this experiment only for the USA certificates. We only use three ratings in this experiment: PG, PG-13, and R. The two other categories have not been used due to the small number of films in the categories, which make it impossible to balance them. In experiment ImagePool, we pool feature vectors of all the film's images first then classify the pooled vectors, while in ImageIndividual, we classify individual images then count how many times images belonging to a film have been classified as belonging to a specific rating, and then take the rating with the most count as the predicted rating for each film.
c) Text and Images combined: For each title, the character based ngram TFIDF vector and the image vector are concatenated into a single vector, and fed into XGBoost. Other classification algorithms such as Random Forests and Logistic Regression have also been experimented with, but the results are inferior to those of XGBoost. While TFIDF is not usually thought of as comparable to word embeddings, Mohamed and Ha's experiments show that in this specific case, word embeddings (from BERT and ELMO) were not as good as this traditional method. In the experiments we ran, word embeddings did not produce good results. The use of XGBoost was also beneficial in other ways. Since the TFIDF vector is very large, corresponding to the vocabulary size of X words, neural network implementations in Keras and PyTorch did not scale well, unlike XGBoost and similar algorithms that can deal with a large number of textual features. d) Evaluation metrics: For the balanced image experiment, we use the standard precision, recall, f-measure, and overall accuracy. For other experiments, we use the standard metrics of accuracy and the Area Under the Curve of the Receiver Operating Characteristic (AUC), which incorporates the trade-off between precision and recall. Two settings for evaluation of accuracy are used: strict accuracy (Acc in Table 5) is the normal accuracy, and relaxed accuracy (RelaxAcc), in which a prediction of a certificate that is either the same as, or only one age rating higher or lower than, the true certificate, is considered correct. While the relaxed accuracy is in common use in Machine Learning, it is especially important in the context of film ratings due to the differences among countries. This relaxed evaluation thus mirrors the state of the data set.
There has been previous work in combining texts and images for downstream tasks. Chen and Zhuge [2] combine text and image information to generate a multimodal summary comprising images and their captions. Rafkind et al. [11] combine text and image features to classify images in bioscience literature. Taniguchi et al. [15] and Sakaki et al. [12] combine text and image classifiers to identify the gender of Twitter users. They classify the images first, then pool the image classifications to classify the users, whereas we pool the image feature vectors first. We tested the former methods (classifications and then pooling), and found them to be inferior to pooling first (the accuracy for US certificates, image only, classification of individual images first: 59% compared to classification of pooled vectors: 62%). Generating captions from images has also gathered attention recently [17,4,3]. Ailem et al. [1] learn textual and visual representations jointly; this leads to competitive performance on tasks of assessing pairwise word similarity and image/caption retrieval.

Results & Analysis
Tables 3 and 4 present the results for two experiments using a balanced dataset of the categories PG, PG-13, and R. We can see that when the data is balanced, it is easier to classify PG then R then PG-13. This may be due to the fact that PG-13 is a confused category that has elements of both PG and R. In a PG film, one does not expect to see images of violence, gore, or sex, making it more consistent, whereas in R films innocent images may also be found, hence the easier classification of PG vs. R films. To give some examples of the classifications assigned by our image classifier versus the true category, figure 1 shows a number of images predicted as PG-13 and the true category of the film they come from. For example, the third image on the first row, which comes from the R-rated film "Courage Under Fire" (1996) 3 has been classified as PG-13. From a human perspective, the image does not show any violence or explicit material.
Tables 5 shows the results of our experiments, and figure 2 shows the confusion matricies. When image feature vectors are combined with text vectors, the performances of the classifiers are approaching or surpassing those using ratings from one country to predict those of another country (SingleCt in the table). Around 95% or more of the predictions are within one rating of the correct ones. Despite its incomplete nature, visual data, in the form of extracted feature vectors, do help improve the accuracy of the prediction of age rating certificates when combined with TFIDF. Only InceptionV3 shows statistically significant improvements in accuracy of predictions for both the USA 6 Le An Ha and Emad Mohamed / Procedia Computer Science 00 (2021) 000-000  and the UK. Other image feature extraction models provided statistically significant improvements for either the USA (Dense169) or the UK (NASNetMobile, ResNet152V2, NASNetLarge, and ImageConcat). Using visual data alone, ImageConcat provides the best results for both countries. Given that the categories of certificates follow a certain order with respect to age appropriateness, we have experimented with regression models such as Random Forest regression and XGBoost regression, and found them not to be as good as the classification models (73.1% vs 81.1% for USA and 58.1% vs 68.1% for UK). We have tried a ordinal regression method [16], which does not assume the distances between two consecutive classes are a constant (as normal regression methods do), to take advantage of the fact that the age-appropriateness is progressive, i.e. films suitable for a 12-year-old should also be suitable for a 15-year-old. The results are slightly worse than what we reported here with regard to accuracy (79.2% vs 81.1% for USA, 67.8% vs 68.1% for UK), but slightly higher with regard to relaxed accuracy (97.4% vs 97.0% for USA and 97.0% vs 95.2% for UK). 4

Conclusion and future work
We have conducted experiments with the target of predicting the age rating of films based on images, and the combinations of text and images. Our experiments included ones on a general corpus as well as limited experiments on a balanced subset geared towards examining the errors produced by the classifier. Our results indicate that the combination of images and texts is better than either images or text alone, reaching an accuracy comparable to that of using ratings from one country to predict the ratings in another country in spite of the fact that we use only a very limited subset of the images that can potentially be used for such a task.
Our future work will focus on two aspects: (1) investigating the use of the whole video and audio of the film in age rating classification. We believe that with such an amount of data, we can produce results that are on par with, if not more accurate than, those produced by censorship bodies, and, when we reach the point where we can quantify the distribution of these materials in the film, we will (2) conduct computational social science analysis of the distribution of sex and violence in films and its relationship to cultural and country-based differences, for which we will use not only the textual and audiovisual data, but also the reports provided by parents on film contents. The two future concerns are both related to our desire to conduct responsible Computational/Digital Humanities research.