Skip to main content

Analyzing Spoken and Written Discourse: A Role for Natural Language Processing Tools

  • Chapter
The Palgrave Handbook of Applied Linguistics Research Methodology

Abstract

This chapter provides a general overview of research methods used in the analysis of both spoken and written discourse. In addition, it provides a specific overview of how natural language processing (NLP) tools that measure lexical, syntactic, rhetorical, and cohesion features of text can be used to examine spoken and written discourse. The chapter provides an overview of how NLP tools have been used in previous studies of discourse, an introduction to freely available tools, an overview of the output produced by these tools, and statistical methods used to analyze and interpret the output produced from these tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Ai, H., & Lu, X. (2013). A corpus-based comparison of syntactic complexity in NNS and NS university students’ writing. In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 249–264). Amsterdam: John Benjamins Publishing Company.

    Chapter  Google Scholar 

  • Allen, L. K., Mills, C., Jacovina, M. E., Crossley, S., D’Mello, S., & McNamara, D. S. (2016). Investigating boredom and engagement during writing using multiple sources of information: The essay, the writer, and keystrokes. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (pp. 114–123). Edinburgh: ACM.

    Google Scholar 

  • Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39(3), 445–459. https://doi.org/10.3758/BF03193014

    Article  Google Scholar 

  • Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D., Conrad, S. M., Reppen, R., Byrd, P., Helt, M., Clark, V., … Urzua, A. (2004). Representing language use in the University: Analysis of the TOEFL 2000 Spoken and Written Academic Language Corpus. TOEFL Monograph Series. Retrieved from http://www.ets.org/Media/Research/pdf/RM-04-03.pdf

  • Biber, D., Gray, B., & Staples, S. (2014). Predicting patterns of grammatical complexity across language exam task types and proficiency levels. Applied Linguistics, amu059. https://doi.org/10.1093/applin/amu059

    Article  Google Scholar 

  • Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Sebastopol: O’Reilly Media, Inc.

    Google Scholar 

  • BNC Consortium. (2007). The British National Corpus, version 3. BNC Consortium. Retrieved from http://www.natcorp.ox.ac.uk/

  • Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. https://doi.org/10.3758/BRM.41.4.977

    Article  Google Scholar 

  • Burstein, J. (2003). The E-rater® scoring engine: Automated essay scoring with natural language processing. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 113–121). Mahwah, NJ: Lawrence Erlbaum Associates Publishers.

    Google Scholar 

  • Cambria, E., Havasi, C., & Hussain, A. (2012). SenticNet 2: A Semantic and Affective Resource for Opinion Mining and Sentiment Analysis. In G. M. Youngblood & P. M. McCarthy (Eds.), FLAIRS conference (pp. 202–207). Palo Alto: Association for the Advancement of Artificial.

    Google Scholar 

  • Cambria, E., Speer, R., Havasi, C., & Hussain, A. (2010). SenticNet: A Publicly Available Semantic Resource for Opinion Mining. In C. Havasi, D. Lenat, & B. Van Durme (Eds.), AAAI fall symposium: commonsense knowledge (Vol. 10).

    Google Scholar 

  • Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37(1), 51–89.

    Article  Google Scholar 

  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Coltheart, M. (1981). The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology, 33(4), 497–505.

    Article  Google Scholar 

  • Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238.

    Article  Google Scholar 

  • Crossley, S. A., Allen, D., & McNamara, D. S. (2012). Text simplification and comprehensible input: A case for an intuitive approach. Language Teaching Research, 16(1), 89–108.

    Article  Google Scholar 

  • Crossley, S. A., Kyle, K., & McNamara, D. S. (2016a). Sentiment analysis and social cognition engine (SEANCE): An automatic tool for sentiment, social cognition, and social-order analysis. Behavior Research Methods, 1–19.

    Google Scholar 

  • Crossley, S. A., Kyle, K., & McNamara, D. S. (2016b). The tool for the automatic analysis of Text Cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48(4), 1227–1237.

    Article  Google Scholar 

  • Crossley, S. A., Kyle, K., & Salsbury, T. (2016). A usage-based investigation of L2 lexical acquisition: The role of input and output. Modern Language Journal, 100(3), 702–715.

    Article  Google Scholar 

  • Crossley, S. A., Louwerse, M. M., McCarthy, P. M., & McNamara, D. S. (2007). A linguistic analysis of simplified and authentic texts. Modern Language Journal, 91(1), 15–30.

    Article  Google Scholar 

  • Crossley, S. A., & McNamara, D. S. (2008). Assessing L2 reading texts at the intermediate level: An approximate replication of Crossley, Louwerse, McCarthy & McNamara (2007). Language Teaching, 41(3), 409–429.

    Article  Google Scholar 

  • Crossley, S. A., & McNamara, D. S. (2010). Cohesion, coherence, and expert evaluations of writing proficiency. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd annual conference of the Cognitive Science Society (pp. 984–989). Austin, TX: Cognitive Science Society.

    Google Scholar 

  • Crossley, S. A., & McNamara, D. S. (2011). Text coherence and judgments of essay quality: Models of quality and coherence. In L. Carlson, C. Hoelscher, & T. F. Shipley (Eds.), Proceedings of the 29th Annual Conference of the Cognitive Science Society (pp. 1236–1241). Austin, TX: Cognitive Science Society.

    Google Scholar 

  • Crossley, S. A., & McNamara, D. S. (2012). Predicting second language writing proficiency: the roles of cohesion and linguistic sophistication. Journal of Research in Reading, 35(2), 115–135.

    Article  Google Scholar 

  • Crossley, S. A., & McNamara, D. S. (2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 26, 66–79. https://doi.org/10.1016/j.jslw.2014.09.006

    Article  Google Scholar 

  • Crossley, S. A., Paquette, L., Dascalu, M., McNamara, D. S., & Baker, R. S. (2016). Combining click-stream data with NLP tools to better understand MOOC completion. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (pp. 6–14). Edinburgh: ACM.

    Chapter  Google Scholar 

  • Crossley, S. A., Salsbury, T., & McNamara, D. (2009). Measuring L2 lexical growth using hypernymic relationships. Language Learning, 59(2), 307–334.

    Article  Google Scholar 

  • Crossley, S. A., Salsbury, T., & McNamara, D. (2010). The development of polysemy and frequency use in English second language speakers. Language Learning, 60(3), 573–605. https://doi.org/10.1111/j.1467-9922.2010.00568.x

    Article  Google Scholar 

  • Ellis, N. C. (2002). Frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 143–188. Retrieved from https://doi.org/10.1017/S0272263102002024

  • Fairclough, N. (2013). Critical discourse analysis: The critical study of language. New York, NY: Routledge.

    Book  Google Scholar 

  • Friginal, E. (2013). Twenty-five years of Biber’s Multi-Dimensional Analysis: introduction to the special issue and an interview with Douglas Biber. Corpora, 8(2), 137–152.

    Article  Google Scholar 

  • Friginal, E., & Weigle, S. (2014). Exploring multiple profiles of L2 writing using multi-dimensional analysis. Journal of Second Language Writing, 26, 80–95. https://doi.org/10.1016/j.jslw.2014.09.007

    Article  Google Scholar 

  • Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press.

    Google Scholar 

  • Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193–202. https://doi.org/10.3758/BF03195564

    Article  Google Scholar 

  • Grant, L., & Ginther, A. (2000). Using computer-tagged linguistic features to describe L2 writing differences. Journal of Second Language Writing, 9(2), 123–145.

    Article  Google Scholar 

  • Guo, L., Crossley, S. A., & McNamara, D. S. (2013). Predicting human judgments of essay quality in both integrated and independent second language writing samples: A comparison study. Assessing Writing, 18(3), 218–238.

    Article  Google Scholar 

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software. ACM SIGKDD Explorations Newsletter, 11(1), 10. https://doi.org/10.1145/1656274.1656278

    Article  Google Scholar 

  • Higgins, D., Xi, X., Zechner, K., & Williamson, D. (2011). A three-stage approach to the automated scoring of spontaneous spoken responses. Computer Speech & Language, 25(2), 282–306. https://doi.org/10.1016/j.csl.2010.06.001

    Article  Google Scholar 

  • Hunston, S., & Francis, G. (2000). Pattern grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins.

    Book  Google Scholar 

  • Hutto, C. J., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on Weblogs and Social Media.

    Google Scholar 

  • Jung, Y., Crossley, S. A., & McNamara, D. S. (2015). Linguistic features in MELAB writing performances.

    Google Scholar 

  • Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An introduction to natural language processing, speech recognition, and computational linguistics. Englewood Cliffs, NJ: Prentice-Hall.

    Google Scholar 

  • Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics—ACL ’03 (Vol. 1, pp. 423–430). https://doi.org/10.3115/1075096.1075150

  • Kucera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press.

    Google Scholar 

  • Kyle, K. (2016). Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication. Georgia State University. Retrieved from http://scholarworks.gsu.edu/alesl_diss/35/

  • Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757–786. https://doi.org/10.1002/tesq.194

    Article  Google Scholar 

  • Kyle, K., Crossley, S. A., & Berger, C. (in press). The tool for the analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods.

    Google Scholar 

  • Kyle, K., Crossley, S. A., & McNamara, D. S. (2016). Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing. Language Testing, 33(3), 319–340. https://doi.org/10.1177/0265532215587391

    Article  Google Scholar 

  • Langacker, R. W. (1987). Foundations of cognitive grammar: Theoretical prerequisites (Vol. 1). Stanford: Stanford University Press.

    Google Scholar 

  • Levy, R., & Andrew, G. (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In 5th International Conference on Language Resources and Evaluation (LREC 2006).

    Google Scholar 

  • Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. https://doi.org/10.1075/ijcl.15.4.02lu

    Article  Google Scholar 

  • Lu, X. (2011). A corpus-based evaluation of syntactic complexity measures as indices of college-level ESL writers’ language development. TESOL Quarterly, 45(1), 36–62. Retrieved from http://www.jstor.org/stable/41307615

    Article  Google Scholar 

  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In ACL (System Demonstrations) (pp. 55–60).

    Google Scholar 

  • Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: the penn treebank. Computational Linguistics, 19(2), 313–330. Retrieved from http://dl.acm.org/citation.cfm?id=972470.972475

  • McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Mehl, M. R., Gosling, S. D., & Pennebaker, J. W. (2006). Personality in its natural habitat: Manifestations and implicit folk theories of personality in daily life. Journal of Personality and Social Psychology, 90(5), 862.

    Article  Google Scholar 

  • Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.

    Article  Google Scholar 

  • Mohammad, S. M., & Turney, P. D. (2010). Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text (pp. 26–34). Association for Computational Linguistics.

    Google Scholar 

  • Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 29(3), 436–465.

    Article  Google Scholar 

  • Myers, M. (2003). What can computers and AES contribute to a K–12 writing program. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 3–20). Mahwah, N.J.: Lawrence Erlbaum Associates Publishers.

    Google Scholar 

  • Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29(5), 665–675.

    Article  Google Scholar 

  • Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics, 24(4), 492–518.

    Article  Google Scholar 

  • Pennebaker, J. W., Chung, C. K., Ireland, M., Gonzales, A., & Booth, R. J. (2007). The development and psychometric properties of LIWC2007: LIWC.net.

  • Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count: LIWC 2001. Mahwah: Lawrence Erlbaum Associates, 71.

    Google Scholar 

  • Polanyi, L., & Zaenen, A. (2006). Contextual valence shifters. In Computing attitude and affect in text: Theory and applications (pp. 1–10). Netherlands: Springer.

    Book  Google Scholar 

  • Römer, U. (2005). Shifting foci in language description and instruction: Towards a lexical grammar of progressives. Arbeiten Aus Anglistik Und Amerikanistik, 30(1), 145–160.

    Google Scholar 

  • Salsbury, T., Crossley, S. A., & McNamara, D. S. (2011). Psycholinguistic word information in second language oral discourse. Second Language Research, 27(3), 343–360.

    Article  Google Scholar 

  • Schiffrin, D. (1994). Approaches to discourse. Oxford, UK: Blackwell.

    Google Scholar 

  • Secui, A., Sirbu, M.-D., Dascalu, M., Crossley, S., Ruseti, S., & Trausan-Matu, S. (2016). Expressing Sentiments in Game Reviews. In In International Conference on Artificial Intelligence: Methodology, Systems, and Applications (pp. 352–355). Varna, Bulgaria: Springer.

    Chapter  Google Scholar 

  • Sexton, J. B., & Helmreich, R. L. (2000). Analyzing cockpit communications: the links between language, performance, error, and workload. Journal of Human Performance in Extreme Environments, 5(1), 6.

    Article  Google Scholar 

  • Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.

    Article  Google Scholar 

  • Sinclair, J. M. (1987). Looking up: An account of the COBUILD project in lexical computing and the development of the Collins COBUILD English language dictionary. London: Collins ELT.

    Google Scholar 

  • Tabachnick, B. G., & Fidell, L. S. (2014). Using Multivariate Statistics (4th ed.). Needham Heights, MA: Allyn & Bacon.

    Google Scholar 

  • Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54.

    Article  Google Scholar 

  • Thorndike, E. L., & Lorge, I. (1944). The teacher’s wordbook of 30,000 words. New York: Columbia University, Teachers College. Bureau of Publications.

    Google Scholar 

  • Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—NAACL ’03 (Vol. 1, pp. 173–180). Morristown, NJ, USA: Association for Computational Linguistics. https://doi.org/10.3115/1073445.1073478

  • Witten, I. H., & Frank, E. (2005). Data mining practical machine learning tools and techniques. Amsterdam; Boston, MA: Morgan Kaufman. Retrieved from http://public.eblib.com/choice/publicfullrecord.aspx?p=234978

  • Wolfe-Quintero, K., Inagaki, S., & Kim, H.-Y. (1998). Second language development in writing: Measures of fluency, accuracy & Complexity. Honolulu, HI: University of Hawaii Press.

    Google Scholar 

  • Yang, W., Lu, X., & Weigle, S. C. (2015). Different topics, different discourse: Relationships among writing topic, measures of syntactic complexity, and judgments of writing quality. Journal of Second Language Writing, 28, 53–67. https://doi.org/10.1016/j.jslw.2015.02.002

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Scott A. Crossley .

Editor information

Editors and Affiliations

Copyright information

© 2018 The Author(s)

About this chapter

Cite this chapter

Crossley, S.A., Kyle, K. (2018). Analyzing Spoken and Written Discourse: A Role for Natural Language Processing Tools. In: Phakiti, A., De Costa, P., Plonsky, L., Starfield, S. (eds) The Palgrave Handbook of Applied Linguistics Research Methodology. Palgrave Macmillan, London. https://doi.org/10.1057/978-1-137-59900-1_25

Download citation

  • DOI: https://doi.org/10.1057/978-1-137-59900-1_25

  • Publisher Name: Palgrave Macmillan, London

  • Print ISBN: 978-1-137-59899-8

  • Online ISBN: 978-1-137-59900-1

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics