Future Generation Computer Systems

Social media data represent an important resource for behavioral analysis of the aging population. This paper addresses the problem of age prediction from Twitter dataset, where the prediction issue is viewed as a classification task. For this purpose, an innovative model based on Convolutional Neural Network is devised. To this end, we rely on language-related features and social media specific metadata. More specifically, we introduce two features that have not been previously considered in the literature: the content of URLs and hashtags appearing in tweets. We also employ distributed representations of words and phrases present in tweets, hashtags and URLs, pre-trained on appropriate corpora in order to exploit their semantic information in age prediction. We show that our CNN-based classifier, when compared with baseline models, yields an improvement of up to 12.3% for Dutch dataset, 9.8% for English1 dataset, and 6.6% for English2 dataset in the micro-averaged F1 score.


Introduction
In the digital era, social media has become a ubiquitous part of our daily life where users constantly interact with Facebook, Twitter, Snapchat, among other social media platforms, sharing their experiences and opinions on various topics. The availability of many social media datasets (e.g., Twitter, public Facebook pages and blogs) offers golden opportunities to social scientists to study psychological and social questions at an unprecedented scale [1]. For instance, social media has been employed in stock market prediction [2], Oil price prediction [3], health monitoring [4], disaster management [5], forecast box-office revenues for movies [4], inferring national mood throughout the day [6], measuring behavioral risk factors [7], among others.
On the other hand, the open access of many of social media platforms has made it possible for people of every age to become author and reader without any formal restriction. This created an ideal environment for online predators to gain access to sensible user related information, which render internet activities of many vulnerable communities (e.g., kids, teenagers, females) at risk. Therefore, automatic identification of age groups from social media posts would offer an edge to crime prevention as well as many other activities -e.g., online tutoring, personalized advertisement, content personalization, and (intelligent) plagiarism detection to identify for instance whether the homework is performed by the student or another person. Besides, since the last decade, several large scale corporations, e.g., Amazon and Apple, have massively invested in human factors that comprehend consumer behaviors and predict consumer retention based on user's activity and registered profile [8,9], where age group plays key role in inferring the community membership of the user. Nevertheless in the absence of supporting biographical information, as it is the case in many social media platforms, inferring age attributes from textual posts only is rather challenging. Sociolinguistic theory advocates a strong connection between language use and age, or more generally between discourse and identity, where users employ language patterns to construct their identity [10]. In this context, age is considered as a social and fluid variable that is shaped depending on the societal context, the culture, individual experiences and social roles [11].
In the area of computational linguistics, there is an inherent interest in determining latent attributes of an author, which include his categorical age, where a variety of published work has been focused on linguistic analysis for author age in on-and offline texts, much of which corresponds to lexical and contextual clues, such as analyzing topic and genre or n-gram patterns. For instance, [12] analyzed online behavior associated with blogs (which are usually more comprehensive contents than tweets) and found that behavior ( etc.) could effectively be used in binary age classifiers. [13] identified a set of writing and speech patterns, referred to as agegrading, that changes over time as a person learns a language and develops socially. This suggests that the assumption of textbased age prediction is tenable, although no universally accepted solution emerged. Especially, it is widely debatable whether age patterns can be elicited from short social media related posts. [14] argued that at least 10000 words per author is needed (in [15], an estimate of 5000 words per author) in order to infer reasonable age patterns. This order of magnitude is widely unavailable in most of microblog posts, e.g., Twitter, Facebook messages where high proportion of noisy terms is expected in order to restrict the length of the message, which, in turn, challenges standard natural language processing tools [16,17]. In an attempt to tackle the above challenge, several works have investigated fine-grained specialized natural language processing tools that deal with microblog posts, construction of large scale corpora that includes a variety of noisy terms/abbreviations, and, therefore, devise a sound machine learning architectures for age categorization and prediction. [18] developed pre-processing techniques to normalize orthographic modifications as well as twitter-specific elements (@-usernames and #hashtags) that translate the noisy Twitter texts into standard English and found to work well in classification problems. [19] used [18] and [20] pre-processing techniques in order to identify types of creative lexical transformations resulting in OOV tokens in Twitter messages, which are then employed to differentiate between users' biographical attributes. A multi-corpus based approach was investigated by [21] that include transcribed telephone speech corpus and posts from breast-cancer online forum. Next, a domain adaptation approach was used to train a model on these corpora and separate the global features from corpus-specific features that are associated with age. [22] constructed a predictive lexica from a dataset of Facebook users who agreed to share their status updates and reported their age and gender. Age prediction has been widely considered as a classification problem where machine learning techniques were employed. For instance, [23] considered age as a latent attribute of Twitter user and SVM classification were employed to estimate the age class.
Moreover, computational linguistics competitions PAN 2013 and PAN 2014 of author profiling task [24,25] have seen a growing interest in the research community, where the participating teams have proposed various methodologies that vary widely in terms of pre-processing, choice of feature set and classification methods. Especially, in PAN 2014, the age prediction task was defined as a challenging multi-class prediction problem with five classes (18-24, 25-34, 35-49, 50-64, 65+). Notable observations of these works include the relative lack of predictive utility of ngram based models, as well as the high level of accuracy achieved by a group using class similarity based features [26].
In summary, the preceding highlights two key findings. First, the issue of eliciting age from author's posts is tenable from both sociological and computational linguistics perspectives. Second, the issue of optimal configuration of generic estimation architecture remains widely open, which motivates further work on this matter. Previous work [27][28][29][30][31][32][33][34] has identified several features based on both language and social media specific meta-data that are relevant for the age prediction task including bag-ofwords, linguistic features, stylometric features, profile features, (e.g., background color, profile image), social network, and preferences (e.g., liked tweets). Similarly, many approaches based on multi-class classification have been proposed, such as SVMs [23,35], logistic regression [12,27] and Naive Bayes [36].
As [28] pointed out, the problem of age estimation is very challenging due to the inherent variability of human language together with unstructured text in tweet messages, which raises the question of finding appropriate cues that elicit categorical age and the subsequent reasoning. The above studies revealed at least three key limitations and challenges. First, the variety of contexts and discourses poses serious challenges to inter-operability of linguistic features from one study to another. For instance, the application of the comprehensive WWBP linguistic database [29], which is developed using Facebook dataset, to Twitter has shown to be unsatisfactory. Second, although, many of these studies acknowledge the importance of metadata in social media posts, they do not explicitly make use of the content of this metadata. Third, the variety of estimation architecture ranging from the type of preprocessing, number and type of features employed, and machine learning or estimation algorithms testifies of the need for further research on the issue. This motivates the current paper which focuses on Twitter data and builds on previous work on age prediction by relying on language related features and social media metadata to classify users in age groups, considering thus age as a categorical variable. Especially, the following contributions are highlighted.
1. Motivated by its sound theoretical convergence properties and good performance achievements in text-mining related applications [37][38][39][40] a CNN-based model for classification is adopted that integrates heterogeneous features for age-category classification. 2. New feature-set constituted of Hashtags and URLs content analysis is introduced. To the best of our knowledge, there is no previous work using the content of hashtags and URLs for age prediction. Although they have been considered as social media specific features (i.e. their number has been included in [27,30], their content has not been considered. We contend that hashtags and URLs in tweets are indicative of user's age since they reflect user's interests and activities [41]. We propose a novel method to derive relevant features from hashtags and URLs and incorporate these into our CNN-based classification model. 3. In order to tackle the lack of scalability and interoperability, an enhanced semantics through pre-trained word embeddings is introduced. Unsupervised learning of distributed representations (word embeddings) obviates the need for careful feature engineering and such representations are richer in semantic information than standard bag-of-words [42][43][44]. We propose to employ word embeddings for tweet texts, title text of the pages referred to by the URLs in the tweets, and for most frequently cooccurring words with each hashtag in the tweet. Furthermore, we pre-train word embeddings on different corpora to take the context into account. More specifically, word embedding vectors used for tweet texts and hashtags are pre-trained on large collection of tweets, and those used for URLs are pre-trained on blogs/news. 4. Comparison with some state-of-the-art estimation algorithms (SVM, logistic regression, random forest) is carried out in order to demonstrate the feasibility and good performance of our proposal.
We employ three existing datasets from two different languages (i.e. English and Dutch) to test the validity of our novel features and classification method against the versatile classification models SVM, Logistic Regression, and Random Forest as a baseline and show that our CNN-based model outperforms the baseline significantly. While our approach is easily extendable to include other demographic variables such as gender, ethnicity, etc., it is also adaptable to solve other related problems in social media mining ranging from online mental health surveillance, to social spammer detection, and to assist recommendation systems to find similar users on the social media. The rest of the paper is organized as follows. In Section 2, we discuss related work. Section 3 depicts our approach to extract features for age prediction of Twitter users and describes our CNN-based model. Section 4 explains our experimental setup and results compared with baseline approaches. Finally, in Section 5, we present our conclusions.

Related work
In the context of Twitter based age estimation/prediction, one distinguishes at least two streams of research. The first one, sometimes, referred to as age-annotation, deals with the construction of dataset that genuinely relates author's posts to his categorical age. The second emphasizes the design and implementation of the software architecture that enables age estimation from pre-processed dataset.
Indeed, labeled demographic data, including age data in particular, are not systematically collected by Twitter when users set up new accounts. Even when very occasionally reported by the users, the latter often do not assign the correct age, which stresses on the need to use external resources or direct / indirect system query-based approaches. For instance, [31] employed the Twitter API to identify Twitter accounts that had tweets about birthdays mentioning the person's age (e.g., ''Happy XX birthday YY '') or individuals who sent birthday wishes (e.g.,''Wishing @xxxxxx a happy XX birthday''). This ultimately enabled linking the underlined Twitter user with the mentioning age. [27] inferred age estimate by adjoining LinkedIn profiles for youth who tweeted about a particular grade level in school. [45] inferred age attribute by looking at first names cross-referenced with baby name frequency data from Social Security Administration. [46] advocated the use of proxies and/or associated meta-data in order to derive demographic information of Twitter users. [47] put forward a distributed digital social research platform, referred to collaborative online social media observatory (COSMOS) that provides on-demand analytics including age attribute from Twitter stream by correlating it with other dataset and events. Other alternative work focused on the profile picture of the Twitter user, assuming the picture is genuine and include a clear face portrait, [48] used Face++, a free facial recognition service that can estimate a user's age within a 10-year span.
In the second stream of research related to software architecture for age prediction, the developed approaches vary according to type of pre-processing, input features, choice and number of age categories, supervision versus non-supervision scheme, type of supervision algorithm and validation strategy employed.
Pre-processing of Twitter messages allows to filter out the abundant noise present in social media, and to normalize the orthographic modifications as well as translate the various slang and SMS-like vocabulary into semantically meaningful text. Input features employed for age prediction vary from linguistic cues, network (e.g., number of friends, ratio of followers to friends), and user's profile related information (e.g., background picture, text color). Possibly due to difficulty in processing the network information in real time, unreliability of profile information together with advances in linguistics that distinguish language use of childhood, adolescence and adulthood, the quasi-majority of reported works employ linguistic features. In this respect, one of the most notable works was carried out as part of World Well-Being Project (WWBP) [29] where an open vocabulary analysis framework was advocated, whereby they link a series of individual words, phrases, and topics that emerge from open text context. Authors in [29] have shown clear distinctions across four age grouping categories (ages 13-18, 19-22, 23-29, 30-65) where they highlighted: (I) the greater use of emoticons and slang among younger groups and, (II) the developmental progression of individuals at different life stages (e.g., school, college, career, marriage, children, family).
Social media content, including Twitter, also exhibits mediumspecific features (i.e. metadata) that show a different use wrt. age, as is the case for sharing links and/or images and tagging/hashtag use. [34] show that incoming communications from an individual's strong ties are more revealing of the individual's identity, but that this relationship only holds for publicly visible aspects of the identity. Authors in [49] have studied blog posts and showed that there is no trend in image sharing, but there is a gentle increase in usage of URLs in posts with respect to age: apart from an inexplicable peak at the age of 24, link sharing increases with age with users older than 35 posting the most. This result on URLs is supported also by [50], [51], who have continued research on blogging and found that the sharing of links increases with age. On the other hand, work in [27] demonstrated that a sharp rise in the use of links for Dutch Twitter users in their 20s, that stagnates in their 30s. They associated this finding with information sharing and impression management. [12] showed that the use of links and images in their blog data varies across all ages. In the case of hashtags, [27] found that hashtags are used more often by older Twitter users: low usage in teens, a steep climb in the 20s, the highest and continuous use through the years up until the oldest participants category (over 60 years of age). According to these authors, hashtags are, similarly to links, connected to the sharing of information and older tweeters apparently are more concerned with information sharing than younger users. Younger people seem to display a certain kind of online identity, something older people are less concerned with [52].
Computational work on age prediction has exploited these differences in the use of URLs and hashtags across age groups, but have not considered the content associated to them, that also reflects an age related use. For example, [49] have researched the behavior of two groups on Instagram: First, they found that the adult group (25-39) displays a wider range of interests in topics and are very diverse: arts/photos/design, locations, mood/emotion, nature, social/people. Second, the majority of the teens' (13)(14)(15)(16)(17)(18)(19) hashtags concern mood/emotion and follow/like. [41] concluded that hashtags are an important feature to discriminate age since older adults above 67 use mainly hashtags related to politics and leisure in Twitter while people below 55, use mainly hashtags in the context of work related activities and technology. Table 1 summarizes the key results in terms of age prediction from online social media platforms (blogs, Facebook, and Twitter), highlighting the main features, approach employed and the level of accuracy obtained. It is evident from Table 1 that all approaches in the previous research considered bag-of-words (BoW) representation for features. Accuracy results on blogs are typically higher than Twitter because of more data available per user. The highest accuracy (94.13%) was reported by [56] on ICWSM 2009 blog dataset utilizing Naive Bayes model trained on content words, slang words and stylistic features. For Twitter users, [27] report micro-averaged F1 score for three age categories classification of 86.32% on their own dataset of 2,494 users. These results were obtained by Logistic Regression trained on unigram features. Table 2 presents a summary of previous work on the use of Twitter metadata for demographic attribute prediction. These approaches utilize Twitter metadata attributes such as friends/followers ratio, profile image, background color, etc. along with tweet content. However, only count values of hashtags and URLs were included ignoring the content thereof.
It is evident from the literature review tables (ref. Tables 1  and 2) that, while the previous research has exploited rich set of features such as linguistic, stylometric, and social network features, attention to the following predictors of age attribute is missing: (a) the content of hashtags and URLs embedded in posts (especially, on Twitter), and (b) the use of richer semantic models (such as word embeddings) instead of BoW representation.

Data
In order to classify a user into an age group we need a dataset with (a) Twitter user id's, and (b) corresponding ages. A major problem for the age prediction task in Twitter is the limited availability of validated data annotated with the age of users. We use three datasets: two in English language, and one in Dutch. [27] sampled Dutch Twitter users in the fall of 2012. They employed external annotators to annotate the chronological age using information available through tweets, the Twitter profile and external social media profiles such as Facebook and LinkedIn. In total, over 3000 Twitter users were annotated. However, not all of these Twitter profiles are currently active, leaving us with the profile information of 2150 users that we have included in our Dutch dataset. For the English corpus, we used the datasets from [30] and [46]. The dataset from [46] was created by applying pattern matching rules to the profile descriptions of the Twitter users. Through a process of iterative testing and refinement, they derived three rules for age extraction using variations of the following phrases: 1. I am X years old 2. Born in X 3. X years old where X can be a (typically) two-digit number or a date of the form DD/MM/YY or DD.MM.YYYY. Applying these three ageextraction rules to each user description field in the order presented above, [46] had collected a dataset of 1470 users. However, we could find 1074 of them still active. The dataset from [30] was created by capturing self-reported and congratulatory birthday announcements/wishings by using the search parameters ''Happy nth Birthday''. Birthday tweets for ages 13 to 50 were collected on August 22, 2014, September 29, 2014, April 2, 2015, and June 21, 2015. Each birthday tweet was manually reviewed to determine whether a user could be identified from the birthday message, to determine whether the declared age seemed reasonable (rather than a joke exaggerating the age of the user for comedic effect), and to exclude ''celebrity'' users whose content feed may be curated for promotional and endorsement reasons. Out of the 3184 labeled users, we could only find 1794 twitter profiles active at the time of writing this paper.
More information about the three datasets can be found in Table 3. Despite the availability of these three age-labeled datasets from [27,46], and [30], a direct comparison of our results with theirs on our re-created datasets is limited owing to: (a) the differences in size due to some users' data being not available because of account inactivation, privacy mode setting changes, etc., and (b) the unavailability of the actual tweets used by them since the datasets have only published Twitter user-id and age. In the rest of the paper, we refer to our re-created dataset from [27,46], and [30] as Dutch, English1, and English2, respectively.
For each user, we collected his/her recent 200 tweets. Although the Twitter API allows collection of up to 3200 most recent tweets, prior studies have shown that examining more than 100 to 200 posts per user provides minimal gain in model performance when predicting user demographics [22,67]. Table 4 shows the distribution of numbers of hashtags, URLs, and media in the tweets for each dataset in different age categories. It is evident from the table that people in the age categories of 20+ and 40+ frequently cite hashtags in their tweets. More than half of the total tweets from these age categories have at least one hashtag in them. Similarly, almost one third of these users cite at least one URL in their tweets. With an average length of 8.5 words per tweet in our dataset, it is evident from the table that ignoring hashtags and URLs in tweets for age prediction results in important information loss. Twitter allows up to 4 media files to be included in a tweet (photos, videos, or animated GIFs). However, while some users actively make use of this feature, other users do not use media in their tweets. We also notice that the age-group of a user is not correlated with the number of media included in the tweet.

Feature engineering
We identified two broad categories of features, namely, language features, and Twitter-specific features to detect age. Many of these features have individually been explored in the literature [27][28][29][30][31][32][33][34]. These features are mainly derived from tweet contents (tweet text) and meta-information, such as Twitter user profile, network, and activity. These features were used in conjunction with a supervised machine learning framework to create a model for age detection. Previous research has explored SVM, Random Forest, and Logistic Regression methods with these features. However, we show that our novel features engineered from hashtags and URLs, and feature transformation to distributed representations of words and phrases incorporated in our CNNbased model produce better results on age prediction than the state-of-the-art.
All proposed features are investigated (see Section 4.2) using the features analysis technique to determine the best combination of features with the highest discriminative power. The proposed features are discussed in detail in the following subsections. • Features from pre-trained lexica: Researchers in sociolinguistics have derived lexicons of words and phrases that correlate with different age groups. We use two such resources: EMNLP2014 [22], and WWBP [29]. We use logit transformed values from these lexicons.

Language features
• Sentiment scores as features: average number of tweets with positive/negative/neutral sentiment.

Twitter-specific features
• Tweet Features: earliest, latest, and average timestamp from among all 200 tweets of a user, number of geolocatable tweets of a user, number of tweets favorited, number of tweets which are in-reply-to, number of tweets which are re-tweets, number of user mentions, number of tweets with media (photo, video, or animated GIF), average number of media files per tweet.
• Twitter user profile features: account creation date, listedcount, verified or not, geo-enabled or not, status-count.
• Twitter social network features: number of friends, number of followers, ratio of the number of friends to the number of followers, number of friends or followers with directed tweet exchanges, number of friends that are also followers.
• Twitter hashtag features (new): most frequently cooccurring words with the hashtags used in user's tweets (described in more details in Section 3.3).
• Twitter URL features (new): words from the titles of the pages pointed to by the URLs in user's tweets (described in more details in Section 3.3)

Method to extract features from tweet text, hashtags, and URLs
Our approach to include the URL content and hashtag as novel features makes use of innovative deep learning approach that genuinely combines word embedding and convolution neural  network architecture in order to extract useful patterns. A general skeleton of the approach is described in Fig. 1. We hypothesize that such features can play a role in the age detection task from Twitter data since it includes semantic information that reflects the interests of Twitter users that change with age. More specifically: 1. We extract the 200 most recent tweets after discarding retweets for each user to ensure a large enough dataset . We name this set the TweetTextSet.
2. We expand all URLs and hashtags in the TweetTextSet as follows: Avoiding computationally expensive option to find basis vectors for the span of word embedding vectors of HashtagCooccurringSet, we choose these three vectors as capturing the semantic space associated with user's hashtags which represents user's interests and activities. 5. We fetch Twitter metadata (e.g. tweet timestamps, see Section 3.2.2 for more details) for each user. 6. We extract additional features such as linguistic, stylometric, and Twitter-specific (see Section 3.2.2 for more details) from the metadata and TweetTextSet. These include the average number of media files in the users 200 most recent tweets, and the average sentiment score of these tweets. We create a normalized vector of these additional features.
For English tweets, we use the Carnegie Mellon's TweetNLP suite 6 to tokenize the tweets and to assign Part-of-speech tags. For Dutch tweets, we use Frog 7 for tokenization and POS tagging. For sentiment score evaluation per tweet, we use NLTK's 8 implementation of VADER model [68]. 7. We incorporate the word embeddings from TweetTextSet and URLTitlesSet and the additional features vector, as described above, into a Convolutional Neural Network based model for classification (see Section 3.4 for further information).
Our approach to include the URL content and hashtag as novel features in combination with word embeddings is summarized in Algorithms 13 and 2.  In order to exemplify the process of feature construction set, let us consider an example of Twitter Id 123xxxyy from one of the datasets.
• We present in Table 5, the profile information, some of the user's 200 most recent tweets, and their tweet-specific metadata we fetch using the Twitter API. In this example, we show only 4 of the user's 200 most recent tweets which have 2 hashtags and 1 URL altogether: #BikeToWork, #MothersDay, and URL: https://www.blog.google/topics/ai/ ai-principles/.
• Next, for each of these 2 hashtags, we collect 1000 tweets from a time-window of [9.11.2017 -29.11.2017]. Table 6 shows some of these tweets.
• Following the process in Algorithm 2, we find the mostfrequently co-occurring terms with these hashtags in tweets. These are shown in bold-face letters in Table 6. For the hashtag #MothersDay, most-frequently occurring terms are: {another, mother's day, event, mum, mom}. And for the hashtag #BikeToWork, they are: {health, work, kids, week, bike, office}. Together, for the two hashtags, the set {mom, event, work, office} represents the most-frequently co-occurring terms.
• Next, we look-up these terms in a pre-trained wordembedding model trained on Twitter dataset. 9 and obtain a set of vectors H 1...M and min, max, and avg vectors are computed as outlined in Section 3.3(4).
• We fetch the title of the page pointed to by the URL https: //www.blog.google/topics/ai/ai-principles as ''AI at Google: our principles''. We look-up these terms in a pre-trained word-embedding model trained on GoogleNews 10 and obtain a set of vectors U 1...L .
• For each of the 200 recent tweets, after pre-processing and normalizing the length of a tweet to 30 terms, we look up the terms in a pre-trained word-embedding model trained 9 https://github.com/loretoparisi/word2vec-twitter.

Convolutional neural network model
Inspired by [37,69], Fig. 2 illustrates our model based on convolutional neural network (CNN) for predicting the age category of a Twitter user. Our model is innovative in two aspects: • Two separate input channels receive inputs from word embeddings of TweetTextsSet and URLTitlesSet and separate convolution filters of various sizes were used. This innovative design is proposed to prevent learning false associations between tweet words and words from the titles of webpages. The output of convolutional layer is passed through a non-linear activation function ReLU. Pooling layer aggregates vector elements by taking the maximum from each element of the convolutional feature map. Thus, these two output vectors after max-pooling represent features extracted from tweet texts and URLs for age-category prediction.
• Since CNNs require fixed-sized homogeneous data sources, in order to utilize additional features, we propose another design innovation: the two vectors described above are concatenated with (a) a vector representing the additional features (Section 3.2.2), (b) the three vectors min, max, and avg calculated from word embeddings of HashtagCooccurringSet, and (c) a vector of features from pre-created lexicons after normalizing values to logits (-1 to +1). This concatenated vector is then fully connected with the output layer in soft-max setup. Since this vector is huge, we use drop-out method for regularization.
The details of the layers of the CNN architecture are as follows: Convolution layer comprises of multiple filters of fixed length which are convolved with the input sentence matrix to extract discriminative word sequence patterns useful for classification. The convolution operation is defined as: where S is input sentence matrix, h is filter width, and F m k,j are mth filter's coefficients. c i is the value of the learned feature. The entire convolution of the mth filter with the input tweet produces n − h + 1 values which are concatenated together to produce a vector c ∈ R n−h+1 . The vectors c are then aggregated over all m filters into a feature map matrix C ∈ R m×(n−h+1) .

(c) Max Pooling
The output of the convolutional layer is passed through a non-linear activation function such as hardTanh or sigmoid or ReLU. Pooling layer aggregates vector elements by taking the maximum from each element of the convolutional feature map. The resulting vector is C pooled ∈ R m×1 .

URL titles sentences CNN (Model 2):
(a) URL Titles Sentences Matrix (Input) All title sentences of URLs in all of a given user's tweet are represented by horizontal concatenation of k-dimensional word embeddings of its n constituent tokens. These word embeddings are pre-trained on a large corpus of blogs, news-posts, and generic text in a given language (English, and Dutch). This generates a matrix S ∈ R k×n which is input to the convolutional neural network model. (b) Convolutional Layer and Max Pooling layers identical to Tweet sentences.

Hashtag vectors
From the HashtagCoocurringSet comprising of most frequently co-occurring words with all hashtags of from the recent 200 tweets of a given user, we choose three embedding vectors min, max, and avg from the pre-trained word embedding model on generic text (English, and Dutch) as explained in Section 3.3. This generates three vectors V 1, V 2, and V 3 ∈ R k which is input to the concatenation layer (merge layer of keras 11 ).

Concatenation Layer and Dropout
Output of max pooling from (i) Tweet Sentence CNN and URL titles CNN, (ii) three hashtag vectors, and (iii) other features (stylometric, social network features, etc.) are concatenated. Since this vector is huge, in order to avoid overfitting, we use the dropout method proposed by [70]. Each dimension is randomly set to 0 using a Bernoulli distribution B(p) where p is a hyper-parameter. In addition, we complement this method of regularization with L2-Regularization of softmax parameters. After dropout, the vector is passed onto the softmax layer.

Softmax
Output from the concatenation layer C concat ∈ R m is used for softmax regression which returns the classŷ ∈ {1, K } with largest probability. i.e., y = arg max j P(y = j | x, w, a) where w j denotes the weights vector of class j and a j the bias of class j.

CNN training details
In order to train the above model, for each dataset, data was split into 85% training and 15% for validation sample. Batch size for training CNN was kept at 200 tweets. Since tweets differ in length, we limited each tweet to a maximum of 30 words (excluding emoticons, hashtags, and URLs). Tweets with shorter than 3 words or longer than 30 words were discarded from further processing. While the average length of a tweet in our datasets was small, some users have availed the newly-introduced (in November 2017) feature of Twitter supporting longer tweets (up to 280 characters). Since the limit of 30 words covers 100% of our collected tweets, we normalize the length of each tweet to be 30 words. Tweets shorter than 30 words were padded with a special PAD token. Since each tweet is different in content from others even by the same user, we carefully adjusted the sizes of kernel masks in order not to learn spurious features. In this way, we ensured that the convolution kernel masks did not move over different tweets. We limited URL title words to 25. If a URL title was shorter, we again padded the title with a special PAD token. If a title was longer than 25 words, we ignored the rest of the words. Since search engines typically display the first 50-60 characters of the titles, most web pages do not have titles exceeding the length of 25 words. Also, in our two datasets, we found that 93% titles are of shorter length than 25 words. Similarly, we found only 0.8%, 1.2%, and 2.5% words from tweets and URLs not available in pre-trained dictionaries for English1, English2 and Dutch datasets, respectively. We found that for Dutch dataset, the number of words not available in pre-trained word embedding resource was almost double than that for English datasets. This is indicative of lack of adequately sized resources for Dutch. The following choice of hyper-parameters was made: for both datasets, we use ReLU as non-linear activation function, filter windows of sizes 3, 4, 5, 6 with 128 feature maps each, and mini-batch size of 200. The loss is minimized using the Adam optimizer [71]. For regularization, we use the dropout method proposed by [70] after the max pooling layer with p = 0.5. In addition, we complement this method of regularization with L2-Regularization of softmax parameters.

Experiments and results
In order to show the effectiveness of our novel features, we conduct both regression and classification (into pre-defined age category bins) experiments. Below we define various combinations of features as baseline features to compare performance of regression and classification models on the datasets. The feature representation for tweet text, URL titles, and HashtagCoocurringSet in the baseline setting is bag-of-ngrams. While we show the improvement in results using these features, further improvement is yielded by using distributed representations of words incorporated in our 2-channel novel CNN model.

Feature analysis
Following [30], we use Cohen's d measure to show the effectiveness of our features in classification of age-groups of Twitter users. We first convert Chi-square values into correlation coefficient r by using the formula r = √ X 2 N . This value was then converted into a Cohen's d effect size per the formula 2r 1−r 2 . pvalue was chosen as 0.001. Table 7 shows the top predictive features for each age category in all three datasets. The plus sign (+) shows the direction of association.

Linear regression (ridge) with baseline features
We use ridge regression for predicting age as a continuous variable with different feature sets (see baseline features above). Feature sets B1 and B2 include the content of hashtags and URLs respectively in their bag-of-ngrams representation. Since the features as discussed above are very high dimensional, we used principal component analysis (PCA) to reduce the number of features. With using 10% of data as validating sample, we set the regularization parameter for ridge regression. As goodnessof-fit statistics, we have used R-squared (R 2 ) statistic. The results of regression are shown in Table 8.

Classification results
In order to evaluate the performance of novel 2-channel CNN model with the new features proposed on the above datasets, we compare the results with those obtained from (a)Support Vector Machines (SVM) classifier (with linear kernel), (b) Logistic regression, and (c) Random forest, using various combinations of features. We also use the additional language and social media specific features that have been utilized in the previous works. While the feature representation for SVM, Random Forest, and Logistic Regression is bag-of-ngrams, for CNN, we used distributed representations to capture semantic correlations between words in a dense semantic vector space induced by word embedding models. Feature sets used in our CNN model are: where TP ag , FP ag , and FN ag are the true positives, false positives, and false negatives for age group class ag, respectively. Microaveraged F1 score is defined as the harmonic mean of Precision µ and Recall µ :

Discussion
It is evident from Table 7 that lexical features from tweet text, most frequently co-occurring words with hashtags, and words from URL titles indeed are discriminative for age-group classification. For example, number of occurrences of the word ''family'' is much higher among the older age-groups than among the younger. Similarly, slang words (such as, ''lol'') are more frequently observed in tweets of younger age-group. Number of words with non-standard spellings was found to be negatively correlated with age among the older age-group of Twitter users while it was positively correlated in the younger group. For the millennials (ages between 18 and 40), the number of user mentions in the tweets was observed much higher than among other groups suggesting that this generation practices engaging online social communication. This is further evidenced by noting their friends to followers ratio: higher values of the ratio indicating their proclivity to be a part of a social network. Use of hashtags and URLs is prevalent across all age categories (ref. Table 4), but the 'topics' as indicated by the hashtags or URLs differ among different age groups. For example, the frequency of the word 'music' was more pronounced in the HashtagCooccurringSet of younger users; whereas the word 'news' was found more prominent among the older groups.
Average number of media files per tweet of a user was not found to be discriminative for age-category classification (ref. Table 7. Very few users availed the feature of attaching media to their tweets and such users are from all age groups rather than being restricted to a particular age group (ref. Table 4. Similarly, we found that average sentiment score of all 200 tweets of a given user does not help in classification. Distribution of number of tweets with positive, negative, and neutral sentiment is almost similar across all age groups (ref. Table 4. Sentiment scores were evaluated for each tweet and while the average of sentiment score across all 200 recent tweets of a user may give insight into the personality of that user, does not characterize the behavior of his/her group. Encouraged by the findings of feature importance evaluation, we carried out ridge regression experiments (ref.  [27], English1 [46], and English2 [30]. We show the effectiveness of various features in each of these models. It is observed that all approaches yield higher performance on English1 dataset which can be attributed to its much smaller size (almost half) compared to the other two datasets. Further, there are differences in the method of data collection: English1 dataset relied on selfdeclared age value in profile descriptions which was manually verified, while English2 dataset was created by applying patternmatching rules on congratulatory tweets, and Dutch dataset creation relied on external sources such as Facebook or LinkedIn. Also, the dataset creation of Dutch users may introduce a sampling bias since they select the users who are in the same social subnetwork.
From Tables 9-11 it is observed that older age groups across all three datasets yield lower accuracy. On the other hand, better performance results are noticed for the 0-20 age group (Dutch dataset), 18-40 age group (English 1 data), and 18-24 age group (English 2 data). Clearly, our supervised machine learning approach yields better results when the amount of training data is higher: Table 3 shows that these age groups represent the largest proportions in the respective datasets; whereas for older age groups, the number of data samples is much smaller across all datasets.
As can be seen from the experimental results utilizing baseline features (Section 4.1) from the tables (A,B, and C in Tables 9-11), including the most frequently co-occurring words with hashtags in the user's tweets into the bag-of-words model (column 2) actually degrades the performance of age-prediction. Hashtags are used to index keywords or topics so as to categorize tweets and to allow people to easily follow the topics they are interested in. In this experiment, we attempt to capture the topics a user is interested in by finding hashtag-relevant words from other tweets that include the same hashtag. In order to overcome the problem of topic drift, we only use tweets in a window of −10 to +10 days from the tweet containing the hashtag. Based on the hypothesis that a person's age is correlated with the topics he is interested in, we expect to see improvement in the accuracy. However, we notice that by bringing in hashtag-relevant words Table 9 Results on Dutch dataset. B1-B6 and W1-W6 are explained in the text.

Table 10
Results on English1 dataset. B1-B6 and W1-W6 are explained in the text. for all hashtags from all 200 recent tweets introduce too much noise: while false negatives decrease resulting in improved recall, false positives increase resulting in poorer precision. Similar observation is made about including words from the URL titles (column 3). Utilizing linguistic, stylometric, and Twitter-specific features along with 1-to −3-grams from tweet texts improve the precision and recall over the basic BoW model. This confirms the sociolinguistic hypothesis that linguistic and stylometric features serve as indicators of person's age. Finally, exploiting predictive lexica which are pre-trained for age, such as EMNLP2014 [22] and WWBP [29] help in improving the accuracy slightly. Despite such pre-created lexica having potential to improve the accuracy for age prediction, we observe that because such lexica were trained on Facebook and our data is from Twitter, they fail to achieve higher accuracy as expected (since there is a difference in discourse styles of Facebook and Twitter).
(D) in Tables 9-11 show results of our experiments based on the use of CNN as our classification model in combination with word embeddings for tweet words/phrases. We find that utilizing word embeddings into our CNN model improves on the baseline of using BoW (compare column 1 of A, B, and C with that of D in each of the tables) since CNN learns complex features associating different dimensions of word embedding vectors of a tweet word sequence. In the case of URL titles, replacing them with a Table 11 Results on English2 dataset. B1-B6 and W1-W6 are explained in the text.
sequence of corresponding word embedding vectors and selecting features using convolutional filters improve the performance of the system (compare column 3 of A, B, and C with that of D in each of the tables). We also notice that instead of directly using HashtagCooccurringSet words for classification, utilizing three vectors min, max, and avg derived from word embeddings of HashtagCooccurringSet yields improvement in precision and recall both (compare column 2 of A,B, and C with that of D in each of the tables). Since much less noise is included as opposed to the method of including all words, both false positives and false negatives decrease resulting in improvement in precision and recall. Finally, similar to SVM baseline model, including linguistic, stylometric, Twitter-specific features and using external lexica help to improve the accuracy further.
Overall, using our CNN-based architecture along with novel features improves the micro-F1 score by 12.3%, 9.8% and 6.6% for Dutch, English1 and English2 datasets, respectively when compared against the best results of SVM, Random Forest, and Logistic Regression models employing bag-of-ngrams representation of baseline features.

Conclusion
In this paper, we proposed a novel way to include features derived from hashtags and URLs from tweets for age prediction of Twitter users. We show that using distributed representations incorporated into convolutional neural network improve the accuracy over the baseline bag-of-words model. Augmenting these features with features derived from URLs and hashtags further improves the precision and recall. We examined the effect of adding novel features incrementally and conclude that our model outperforms the baseline by 12.3%, 9.8% and 6.6% for Dutch [27], English1 [46], and English2 [30] datasets, respectively.
Present-day social media platforms facilitate effective social communication by offering several meta-data features that users may avail to make their messages more meaningful. The proposed method presents a way to include information from URLs and Hashtags for analytics of social media messages. While the evaluation of accuracy of our approach is limited by the amount of the labeled data available for age demographics, as a future work, we plan to utilize this approach for prediction of other demographical information such as gender, ethnicity, etc. Another limitation of the proposed work is in its partial reliance on the use of language to identify the age. Research has shown that since public messaging in social media may reveal significant information about a person, some users modulate their communication strategies to preserve privacy [72]. However, language can still reveal the identify of the users when they engage in one-to-one communication on a public forum; for example, when talking with a close friend [73], or parents' messages to/about their children violating their privacy [74]. Our approach to discard retweets but to preserve the reply-to messages helps capturing the linguistic signature of an individual. Further, we capture interest profile of an individual as manifested by hashtags and URLs in his/her tweets.