Dropped personal pronoun recovery in Chinese SMS*

CHRIS GIANNELLA; RANSOM WINDER; STACY PETERSEN

doi:10.1017/S1351324917000158

Dropped personal pronoun recovery in Chinese SMS*

Published online by Cambridge University Press: 30 May 2017

CHRIS GIANNELLA ,

RANSOM WINDER and

STACY PETERSEN

Show author details

CHRIS GIANNELLA: Affiliation:
Department of Human Language Technology, The MITRE Corporation, 7515 Colshire Drive, McLean, VA, 22102, USA e-mails: cgiannella@mitre.org, rwinder@mitre.org, spetersen@mitre.org
RANSOM WINDER: Affiliation:
Department of Human Language Technology, The MITRE Corporation, 7515 Colshire Drive, McLean, VA, 22102, USA e-mails: cgiannella@mitre.org, rwinder@mitre.org, spetersen@mitre.org
STACY PETERSEN: Affiliation:
Department of Human Language Technology, The MITRE Corporation, 7515 Colshire Drive, McLean, VA, 22102, USA e-mails: cgiannella@mitre.org, rwinder@mitre.org, spetersen@mitre.org

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

In written Chinese, personal pronouns are commonly dropped when they can be inferred from context. This practice is particularly common in informal genres like Short Message Service messages sent via cell phones. Restoring dropped personal pronouns can be a useful preprocessing step for information extraction. Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting dropped personal pronoun slots and (2) determining the identity of the pronoun for each slot. We address a simpler version of restoring dropped personal pronouns wherein only the person numbers are identified. After applying a word segmenter, we used a linear-chain conditional random field to predict which words were at the start of an independent clause. Then, using the independent clause start information, as well as lexical and syntactic information, we applied a conditional random field or a maximum-entropy classifier to predict whether a dropped personal pronoun immediately preceded each word and, if so, the person number of the dropped pronoun. We conducted a series of experiments using a manually annotated corpus of Chinese Short Message Service. Our approaches substantially outperformed a rule-based approach based partially on rules developed by Chung and Gildea (2010, Effects of Empty Categories on Machine Translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. pp. 636–45). Our approaches also outperformed (though by a considerably smaller margin) a machine-learning approach based closely on work by Yang, Liu, and Xue in (2015, Recovering Dropped Pronouns from Chinese Text Messages. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics. pp. 309–13). Features derived from parsing largely did not help our approaches. We conclude that, given independent clause start information, the parse information we used was largely superfluous for identifying dropped personal pronouns.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 6 , November 2017 , pp. 905 - 927

DOI: https://doi.org/10.1017/S1351324917000158 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Also affiliated with The Dept. of Linguistics, Georgetown University, 3700 O Street NW, Washington DC USA.

We are thankful for the assistance provided by our MITRE colleagues. Dr Sichu Li annotated, in efficient and professional fashion, a large subset of the SMS we downloaded from the National University of Singapore. Dr John Prange, Mr Rob Case, and Mr Rod Holland provided valuable feedback on a presentation we gave describing our preliminary research findings. We are also thankful for the assistance provided by our colleagues at other institutions. Professor Nianwen (Bert) Xue at Brandeis University, Boston, USA shared his thoughts and expertise on Chinese dropped pronoun detection, at an early stage of our research. Professor Derek F. Wong and Mr Junwen Xing at the University of Macau, Macau, SAR PRC applied their word segmenter to the National University of Singapore corpus.

References

Baran, E., Yang, Y., and Xue, N. 2012. Annotating dropped pronouns in Chinese newswire text. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA). pp. 2795–9.Google Scholar

Cai, S., Chiang, D., and Goldberg, Y. 2011. Language-independent parsing with empty elements. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 212–6.Google Scholar

Chen, C., and Ng, V. 2013. Chinese zero pronoun resolution: some recent advances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 1360–5.Google Scholar

Chen, T., and Kan, M.-Y. 2013. Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47 (2): 299–335. doi: 10.1007/s10579-012-9197-9.Google Scholar

Chen, C., and Ng, V. 2014. Chinese zero pronoun resolution: an unsupervised approach combining ranking and integer linear programming. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Palo Alto, CA USA, Association for the Advancement of Artificial Intelligence Press. pp. 1622–8.Google Scholar

Chung, T., and Gildea, D. 2010. Effects of empty categories on machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 636–45.Google Scholar

Edington, E., and Onghena, P. 2007. Randomization Tests, 4th ed. Boca Raton, FL, USA: CRC Press, Taylor & Francis Group. ISBN: 978-1-58488-589-4.CrossRef Google Scholar

Grosz, B., Joshi, A., and Weinstein, S., 1995. Centering: a framework for modeling the local coherence of discourse. Computational Linguistics 21 (2): 203–25.Google Scholar

Huang, C. T. J. 1989. Pro-drop in chinese: a generalized control theory. In Jaeggli, O. and Safir, K. (eds.), Studies in Natural Language and Linguistic Theory: The Null Subject Parameter, vol. 15, pp. 185–214. Netherlands: Springer. doi: 10.1007/978-94-009-2540-3_6.Google Scholar

Kawahara, D., and Kurohashi, S. 2005. Zero pronoun resolution based on automatically constructed case frames and structural preference of antecedents. In Su, K.-Y., Tsujii, J., Lee, J.-L., and Kwong, O. Y. (eds.), Lecture Notes in Computer Science, vol. 3248, pp. 12–21. Berlin Heidelberg: Springer. doi: 10.1007/978-3-540-30211-7_2.Google Scholar

Kong, F., and Zhou, G. 2010. A tree kernel-based unified framework for Chinese zero anaphora resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 882–91.Google Scholar

Kong, F., and Zhou, G. 2013. A clause-level hybrid approach to Chinese empty element recovery. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), Palo Alto, CA USA, Association for the Advancement of Artificial Intelligence Press. pp. 2113–9.Google Scholar

Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), Burlington, MA USA, Morgan Kaufmann. pp. 282–9.Google Scholar

Levy, R., and Galen, A. 2006. Tregex and tsurgeon: tools for querying and manipulating tree data structures. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA). pp. 2231–4.Google Scholar

Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 55–60.Google Scholar

McCallum, A. 2002. Accessed July 16, 2013. http://mallet.cs.umass.edu.Google Scholar

Rahman, A., and Ng, V. 2012. Translation-based projection for multilingual coreference resolution. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Palo Alto, CA USA, Association for Computational Linguistics. pp. 1051–60.Google Scholar

Rao, S., Ettinger, A., Daume, H. III, and Resnik, P. 2015. Dialogue focus tracking for zero pronoun resolution. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Palo Alto, CA USA, Association for Computational Linguistics. pp. 494–502.Google Scholar

Sasano, R., and Kurohashi, S. 2011. A discriminative approach to japanese zero anaphora resolution with large-scale lexicalized case frames. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 758–66.Google Scholar

Seki, K., Fujii, A., and Ishikawa, T. 2002. A probabilistic method for analtyzing japanese anaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th International Conference on Computational Linguistics (COLING), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 1–7.Google Scholar

Wang, L., Wong, D., Chao, L., and Xing, J. 2012. CRFs-based Chinese word segmentation for micro-blog with small-scale data. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Stroudsburg, PA USA, Association for Computational Linguistics. pp. 51–7.Google Scholar

Xue, N., and Yang, Y. 2011. Chinese sentence segmentation as comma classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 631–5.Google Scholar

Xue, N., and Yang, Y. 2013. Dependency-based empty category detection via phrase structure trees. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 1051–60.Google Scholar

Xue, N., Xia, F., Huang, S., and Kroch, A. 2000. The bracketing guidelines for the penn Chinese treebank (3.0). Technical Report No. IRCS-00-08, University of Pennsylvania Institute for Research in Cognitive Science. http://repository.upenn.edu/ircs_reports/39/.Google Scholar

Yang, W., Dai, R., and Cui, X. 2008. Zero pronoun resolution in Chinese using machine learning plus shallow parsing. In Proceedings of the IEEE International Conference on Information and Automation, New York, NY USA, Institute of Electrical and Electronics Engineers. pp. 905–10.Google Scholar

Yang, Y. 2014. Reading between the lines: recovering implicit information from Chinese texts. Ph.D. Stroudsburg, PA USA: Dissertation, Department of Computer Science, Brandeis University.Google Scholar

Yang, Y., Liu, Y., and Xue, N. 2015. Recovering dropped pronouns from Chinese text messages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 309–13.Google Scholar

Yang, Y., and Xue, N. 2010. Chasing the ghost: recovering empty categories in the Chinese treebank . In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), Beijing, P.R. CHINA, Tsinghua University Press. pp. 1382–90.Google Scholar

Yeh, A. 2000. More accurate tests for the statistical significance of result differences. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 947–53Google Scholar

Yeh, C.-L., and Chen, Y.-C., 2007. Zero anaphora resolution in Chinese with shallow parsing. Journal of Chinese Language and Computing 17 (1): 41–56.Google Scholar

Zhao, S., and Ng, H.T. 2007. Identification and resolution of Chinese zero pronouns: a machine learning approach. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA USA, Association for Computational Linguistics. pp. 541–50.Google Scholar

Article contents

Dropped personal pronoun recovery in Chinese SMS*

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests