Text categorization methods for automatic estimation of verbal intelligence

https://doi.org/10.1016/j.eswa.2012.02.173Get rights and content

Abstract

In this paper we investigate whether conventional text categorization methods may suffice to infer different verbal intelligence levels. This research goal relies on the hypothesis that the vocabulary that speakers make use of reflects their verbal intelligence levels. Automatic verbal intelligence estimation of users in a spoken language dialog system may be useful when defining an optimal dialog strategy by improving its adaptation capabilities. The work is based on a corpus containing descriptions (i.e. monologs) of a short film by test persons yielding different educational backgrounds and the verbal intelligence scores of the speakers. First, a one-way analysis of variance was performed to compare the monologs with the film transcription and to demonstrate that there are differences in the vocabulary used by the test persons yielding different verbal intelligence levels. Then, for the classification task, the monologs were represented as feature vectors using the classical TF–IDF weighting scheme. The Naive Bayes, k-nearest neighbors and Rocchio classifiers were tested. In this paper we describe and compare these classification approaches, define the optimal classification parameters and discuss the classification results obtained.

Highlights

► Probably the first report of experiments attempting to automatically predict verbal intelligence. ► This work has shown that verbal intelligence may be recognized by computers through language cues. ► Some of the most popular text categorization methods have been successfully applied to this task. ► Class-based feature selection approach enables better performance and lower computational cost. ► Word lemmatisation, typically successful for TC tasks, has not proven to be helpful for this one.

Introduction

Next-generation spoken language dialog systems (SLDSs), developed to provide users with required information and/or to help them to accomplish certain goals, are expected to be able to deal with difficult tasks and react to a wide range of situations and problems. They should help users to feel free and comfortable when interacting with them. Moreover, they should also be user-friendly and easy to use. Including aspects of adaptation to users into SLDS may help to increase the systems’ communicative competences and influence on their acceptability (Fig. 1). Next-generation SLDS may change the level of dialog depending on users’ experience. For example, a spoken dialog system aimed at providing guidance and support for the installation of some software may try to estimate whether the user is an expert or a novice in this field. Based on this information suitable words and explanations may be generated. These explanations may be very detailed and without specific vocabulary for a non-experienced user; in contrast, for an expert, the system may provide only a sequence of important steps or inform about more difficult operations. From the beginning of the dialog, SLDS may analyse the user’s speech, behavior and requests and also the existing difficulties. When deciding on the best response to a user, the dialog manager may change words and sentence structures based on the information about cognitive processes. Its responses may become more helpful and the user-friendliness of the system may be improved. For this purpose it is necessary to identify differences in language use of people yielding different educational background and abilities to analyse situations and to solve problems.

The ability to use language for accomplishing certain goals is called verbal intelligence (VI) (Cianciolo and Sternberg, 2004, Goethals et al., 2004). In other words, verbal intelligence is “the ability to analyse information and to solve problems using language-based reasoning” (Logsdon, 2011). Automatic verbal intelligence estimation may help dialog systems to choose the level of communication and be more simple, useful and effective.

Fig. 2 explains the adaptation process of spoken dialog systems based on verbal intelligence estimation in more detail. When talking to the system, all j spoken utterances of a user are analysed for the verbal intelligence determination. This means that the intelligence level is re-estimated at each turn based on features extracted from the new spoken utterances and from all the phrases which were pronounced at the previous turns. In Fig. 2 the SLDS has three different dialog scenarios corresponding to users yielding a higher, an average and a lower verbal intelligence. At the beginning of the dialog, the systems uses scenarios corresponding to users yielding an average verbal intelligence. At the following turns, the system might switch to alternative dialog scenarios.

The automatic estimation of users’ verbal intelligence may help SLDS to more effectively control the flow of the dialogs, engage users in the interaction and be more attentive to human needs and preferences. For training machine learning algorithms, we need to know a maximum number of language features that reflect differences in language use of people yielding different verbal intelligence. In this work we investigate to which extent the vocabulary of test persons reflect their levels of verbal intelligence when they all describe the same event and explain their thoughts and feelings about it. The investigation is based on a corpus containing descriptions of a short film along with the corresponding intelligence scores of the speakers (Zablotskaya, Walter, & Minker, 2010).

The paper is structured as follows. In Section 2 we describe the corpus which was used for the experimental research. Section 3 describes our primary efforts at defining film related features which could be useful for distinguishing test persons yielding different verbal intelligence. Section 4 describes typical TF–IDF approaches and explains the details of the feature selection process for the monologs. In Section 5 we describe and compare the Naive Bayes, k-nearest neighbors and Rocchio classifiers. Classification results are presented and discussed in Section 6. Finally, Section 7 presents conclusions and future work.

Section snippets

Corpus description

For the data acquisition in Zablotskaya et al. (2010), a short film was shown to German native speakers. It described an experiment on how long people could stay without sleep. The test persons were asked to imagine that they met an old friend and wanted to tell him about this film. Our goal was to record everyday speech when talking to relatives and friends. This corpus, described in Zablotskaya et al. (2010), consists of 56 descriptions (3, 5 h of audio data) of a short film (i.e. monologs).

Modelling verbal intelligence by using film derived features

To analyse the vocabulary of people yielding different verbal intelligence when describing the same event, at first we decided to compare the monologs with the film transcription. Fig. 4 shows excerpts from the film and from one of the monologs.1

For the comparison, the following features were extracted:

  • Number of reused words – number of words which a test person “reused” from the film. For our example in

Text categorization solutions

Film derived features presented in the previous section showed to be good predictors of verbal intelligence. Particularly, some of them suggested that test persons belonging to different verbal intelligence classes may be distinguished by word or lemma patterns, even regardless of the order of these words and lemmas in the monologs.

This result led us to the main hypothesis that we investigate in this work: is it possible to solve the problem of inferring the corresponding level of verbal

Vector space classification

As stated above, the vector space model represents each document as a vector with one real-valued component (i.e. TF–IDF weight) for each term. Therefore, we need text classification methods that can operate on real-valued vectors. In this section we introduce those ones that have been tested so far.

A number of classifiers has been used to classify text documents, including regression models, Bayesian probabilistic approaches, Nearest Neighbors approaches, Rocchio algorithm, decision trees,

Experimental set-up

Our main goal is to identify the algorithm that best computes class boundaries and reaches the highest classification accuracy. In our experiments for comparing the performance of the different approaches, a Leave-One-Out cross validation (LOO-CV) method was used. The idea of this method is to use N  1 observations for training (where N is the number of data points) and only 1 data point for testing. This procedure is repeated N times and each observation is used once as the testing data.

Baseline approach: class-based vs corpus-based feature selection

As

Conclusions and future work

This work showed that verbal intelligence may be recognized by computers through language cues. The achieved classification accuracy can be deemed as satisfying for a number of classes that is reasonably high enough to enable its integration into SLDSs. To our knowledge, this is the first report of experiments attempting to automatically predict verbal intelligence.

Some of the most popular TC algorithms were applied to this task: NB, Rocchio and kNN. NB models are typically expected to perform

Acknowledgement

This work is partly supported by the DAAD (German Academic Exchange Service).

Parts of the research described in this article are supported by the Transregional Collaborative Research Centre SFB/TRR 62 ”Companion-Technology for Cognitive Technical Systems” funded by the German Research Foundation (DFG).

For this work, Fernando was granted a fellowship by the Caja Madrid foundation.

References (33)

  • Y.-Q. Miao et al.

    Pairwise optimized Rocchio algorithm for text categorization

    Pattern Recognition Letters

    (2011)
  • S. Tan

    Neighbor-weighted k-nearest neighbor for unbalanced text corpus

    Expert Systems with Applications

    (2005)
  • A. Vinciarelli

    Application of information retrieval techniques to single writer documents

    Pattern Recognition Letters

    (2005)
  • R.A. Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • Y. Bi et al.

    Combining multiple classifiers using dempster’s rule for text categorization

    Applied Artificial Intelligence

    (2007)
  • A.T. Cianciolo et al.

    Intelligence: a brief history

    (2004)
  • S. Dumais et al.

    Inductive learning algorithms and representations for text categorization

  • G. Goethals et al.

    Encyclopedia of leadership. No. v. 1 in encyclopedia of leadership

    (2004)
  • Hall, P., Park, B. U., & Samworth, R.J. (2008). Choice of neighbor order in nearest-neighbor classification. Annals of...
  • Hui, G. G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). Using knn model-based approach for automatic text. In...
  • Ittner, D. J., Lewis, D. D., & Ahn, D. D. (1995). Text categorization of low quality images. In Proceedings of...
  • Y. Jianliang et al.

    Application of iterative-knn based on knn and automatic retrieval in automatic categorization

    Journal of The China Society For Scientific and Technical Information

    (2004)
  • T. Joachims

    Text categorization with support vector machines: Learning with many relevant features

  • Kupietz, M., Belica, C., Keibe, H., & Witt, A. (2010). The german reference corpus dereko: A primordial sample for...
  • D.D. Lewis et al.

    Rcv1: A new benchmark collection for text categorization research

    Journal of Machine Learning Research

    (2004)
  • Logsdon, A. (2011). Learning disabilities...
  • Cited by (6)

    View full text