Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization
Abstract
:1. Introduction
- Are multimodal and multiparty features useful in predicting important utterances?
- Which model performs best: Hand-crafted or deep learning?
- Is the proposed model more suitable for selecting important utterances in visualizing the summary of group discussion videos?
2. Related Work
2.1. Multimodal Multiparty Corpus Studies
2.2. Text, Speech, and Meeting Summarization
2.3. Deep Learning and Multimodal Fearures
2.4. User Studies on Meeting Summarization
3. MATRICS Corpus
3.1. Corpus Collection Experiment
3.1.1. Participants
3.1.2. Experimental environment and tasks
- Booth planning for a school festival: The participants were instructed to discuss and create a plan for a small booth intended to sell food or drinks at a school festival. The participants were given a map that indicated the location of other booths, as well as possible places for opening their own booth. They also had a document that showed data for the distribution of visitors’ ages and the number of visitors by time. The participants were instructed to review these documents for five minutes before starting the discussion. Then, based on the data shown in the documents, they were allowed to discuss where to open their booth and the type of goods they would sell, within 20 min.
- Travel planning for foreign friends: The participants were instructed to create a two-day travel plan for foreign friends visiting Japan on a vacation. The discussion time allowed was 20 min, and there was no time granted to think individually.
- Celebrity guest selection: The participants were asked to pretend that they were the executive committee members for a school festival, and were choosing a celebrity guest for the festival. Their discussion task was to decide the ranked order of 15 celebrities by considering cost and audience attraction. For the first five minutes, each participant was requested to read the instructions and decide alone (that is, without interacting with other members) the celebrity order. Subsequently, the participants were engaged in a discussion to determine the ranked order as a group.
3.2. Analyzed Data
- Head acceleration: An IMU (ATR-Promotions: WAA-010) was attached to the back of each participant’s head, more specifically, to each participant’s cap. These sensors can measure head acceleration and angular velocity in the x, y, and z coordinates at 30 fps. The measured data were sent to a server machine through Bluetooth, which received and saved the data with a timestamp. By applying the angular velocity of the three axes to equation , we calculated the head composite angular velocity of each participant. Here, , and are the angular velocities for each frame for the axes, respectively.
- Video: Two video cameras (SONY HDR-CX630V) were set to record an overview of the communication from opposite directions. In addition, four web cameras (Logicool HD Pro Webcam C920t) were placed in the center of the table to record close-up front face images of each participant. The images had a resolution of 1280 × 720 and frame rate of 30 fps. The distance between a web camera and each participant was approximately 1 m. We obtained head position and rotation data by applying the close-up face images to a vision-based face tracker (FaceAPI: https://www.seeingmachines.com/). We used head pose data to create a face direction classification model that estimated four directions of the face (forward participant, right participant, left participant, and his/her memo). The classification accuracy of the model was 89.6%. We used this model to classify the head-gaze direction. The classification results were double-checked manually and corrected if necessary.
- Audio: All participants wore a hands-free headset microphone (Audio-technica HYP-190H) to record speech data individually. The speech input from each microphone was sent to a PC via an audio interface, and recorded in four channels using a recording software. The sampling rate of the WAV format was 44.1 kHz. In addition, using the Praat (Praat: http://www.fon.hum.uva.nl/praat/) audio analysis tool, the speech intensity and pitch were computed every 10 ms during an utterance and the speech rate was measured for each utterance.
- Transcription: Utterance transcription was obtained through an ASR for automatically detected utterances, and manually segmented utterances were transcribed manually. The utterance segmentation methods will be explained in Section 3.3 in more detail.
3.3. Analysis Units
3.3.1. Automatically detected utterances
3.3.2. Manually segmented utterances
3.4. Annotating Important Utterances to be Included in a Meeting Summary
4. Nonverbal Models for Important Utterance Prediction
4.1. Defining Hand-Crafted Features
- SP/OT: Features for speaker and other participants.
- PR: Features with respect to the ranked order of utterance frequency.
- CO: Features for behavior co-occurrence patterns.
4.1.1. Features for Speaker and Other Participants’ Behaviors (SP/OT)
- Number of attention shifts: The number of attention shifts of the participant during his/her speech. This feature is normalized by utterance duration. The feature value for others is defined as the average number of attention shifts of the other participants.
- Amount of attention received from participants: Frequency of receiving attention from at least two participants in the group during the speech. The feature value for others is computed as the average amount of received attention of other participants.
- Proportion of attention to others: The ratio of the time during which the speaker gazes at any other participant. It is defined only for the speaker.
- Proportion of attention to speaker: The average value of the percentage of time during which the speaker is gazed at, calculated for the other three participants. This feature is defined only for other participants.
- Proportion of attention to Rank1/2/3/4: The ratio of the time during which the participant gazes at the Rank 1/2/3/4 participant.
- Proportion of attention to his/her memo: The ratio of the time during which the participant gazes at his/her notes.
4.1.2. Features with Respect to the Ranked Order of Utterance Frequency (PR)
4.1.3. Features for Behavior Co-Occurrence Patterns (CO)
- Visual attention: Looking at Rank1, Rank2, Rank3, or Rank4, or looking down at his/her memo.
- Binary judgment of head motion: To binarize the head movement data, the composite head angular velocities are divided into two clusters—moving and not moving. We use the EM algorithm for clustering.
- Speaking state: If a given participant is currently speaking, that time frame is labeled as a speaking state.
4.1.4. Feature Selection by Statistical Tests
4.2. Deep Neural Networks
4.2.1. Structure of Nonverbal Unimodal Models
- 3D-CNN
- 2D-CNN
- AlexNet-based CNN
4.2.2. Nonverbal Unimodal Models
4.2.3. Nonverbal Multimodal Model
5. Verbal Models
5.1. Verbal Hand-Crafted Features
- Hand-crafted verbal features (HC_V): we defined 12 linguistic features by referring to a study of the meeting summarization by [2]. We used the following features: Number of words, number of nouns, number of new nouns, average/variance/maximum/minimum of tf-idf, cosine similarity between the entire meeting and the target utterance, cosine similarity between the five preceding utterances and the target utterance, and number of frequently appearing unigrams, bigrams, and trigrams in the utterance.
- Bag-Of-Words (BOW) features: Bag of words to represent an utterance.
5.2. Verbal Model using Deep Learning (V Model)
6. Model Evaluation (and Verbal-Nonverbal Fusion Models)
6.1. Overview of Evaluation Method
- LU: In order to define the longest utterances in each meeting, we sorted utterances by their duration, and then set up a threshold where the F-measure was the highest. As a result, 44% utterances in order of length were selected as important utterances.
6.2. Evaluation of Hand-Crafted Feature Models
- SP/OT: SP/OT features only
- PR: PR features only
- CO: CO features only
- SP/OT + PR: Union of SP/OT and PR features
- SP/OT + CO: Union of SP/OT and CO features
- PR + CO: Union of PR and CO features
- NV-ALL: Union of SP/OT, PR, and CO features
- HC_V: HC_V features only
- BOW: BOW features only
- V_ALL: Union of HC_V and BOW features
- HC_V-SP/OT: Early fusion model of the best hand-crafted nonverbal model (SP/OT) and HC_V.
- BOW-SP/OT: Early fusion model of SP/OT and BOW.
- V_ALL-SP/OT: Early fusion model of SP/OT, HC_V, and BOW.
6.3. Evaluation of Deep Learning Models
6.4. Comparison between Two Approaches
6.5. Performance using Manually Segmented and Transcribed Data
6.6. Discussion
6.6.1. Characteristics of Deep Learning Models
6.6.2. Toward Meeting Summarization
7. Multimodal Meeting Browser
7.1. System Design
7.2. Conducting User Experiment
7.2.1. Hypotheses and conditions
- H1: The multimodal meeting browser allows the users to understand the content of the discussion better than the text-based meeting browser.
- H2: The multimodal meeting browser allows the users to understand the role of each participant better than the text-based meeting browser.
- H3: The users’ impression on the multimodal browser is better than that on the text-based browsers.
7.2.2. Task
7.2.3. Procedure
7.3. Results
8. Conclusions and General Discussion
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Appendix A
Metric | Test | Result | |
---|---|---|---|
Prec | ANOVA | F(1.608, 24.123) = 19.687 (p<.05) | |
Post-hoc test | Proposed models vs. baseline |
| |
Handcrafted feature vs. deep learning |
| ||
Rec | ANOVA | F(2.326, 34.893) = 40.803 (p < 0.05) | |
Post-hoc test | Proposed models vs. baseline |
| |
Handcrafted feature vs. deep learning |
| ||
F1 | ANOVA | F(1.453, 21.794) = 36.295 (p < 0.05) | |
Post-hoc test | Proposed models vs. baseline |
| |
Handcrafted feature vs. deep learning |
| ||
Acc | ANOVA | F(1.726, 25.887) = 29.378 (p < 0.05) | |
Post-hoc test | Proposed models vs. baseline |
| |
Handcrafted feature vs. deep learning |
|
Metric | Test | Result | |
---|---|---|---|
Prec | ANOVA | F(2.457, 36.858) = 19.869 (p<.05) | |
Post-hoc test | Proposed models vs. baseline |
| |
Handcrafted feature vs. deep learning |
| ||
Rec | ANOVA | F(2.834, 42.51) = 12.691 (p < 0.05) | |
Post-hoc test | Proposed models vs. baseline |
| |
Handcrafted feature vs. deep learning |
| ||
F1 | ANOVA | F(2.147, 32.202) = 18.524 (p < 0.05) | |
Post-hoc test | Proposed models vs. baseline |
| |
Handcrafted feature vs. deep learning |
| ||
Acc | ANOVA | F(2.539, 38.086) = 23.908 (p < 0.05) | |
Post-hoc test | Proposed models vs. baseline |
| |
Handcrafted feature vs. deep learning |
|
References
- Murray, G.; Carenini, G. Summarizing Spoken and Written Conversations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, USA, 25–27 October 2008; Association for Computational Linguistics: Stroudsburg, PA, USA, 2008; pp. 773–782. [Google Scholar]
- Xie, S.; Hakkani-Tur, D.; Favre, B.; Liu, Y. Integrating prosodic features in extractive meeting summarization. In Proceedings of the IEEE Workshop on Speech Recognition and Understanding (ASRU), Merano, Italy, 13 November–17 December 2009; pp. 387–391. [Google Scholar]
- Wang, L.; Cardie, C. Focused Meeting Summarization via Unsupervised Relation Extraction. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Seoul, Korea, 5–6 July 2012; Association for Computational Linguistics: Stroudsburg, PA, USA, 2012; pp. 304–313. [Google Scholar]
- Aran, O.; Gatica-Perez, D. One of a Kind: Inferring Personality Impressions in Meetings. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, Sydney, Australia, 9–13 December 2013; ACM: New York, NY, USA, 2013; pp. 11–18. [Google Scholar]
- Nicolaou, M.A.; Gunes, H.; Pantic, M. Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space. IEEE Trans. Affect. Comput. 2011, 2, 92–105. [Google Scholar] [CrossRef]
- Hinton, G.E.; Osindero, S.; Teh, Y.-W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
- Le, Q.V. Building high-level features using large scale unsupervised learning. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8595–8598. [Google Scholar]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
- Bengio, Y.; Lamblin, P.; Popovici, D.; Larochelle, H. Greedy Layer-wise Training of Deep Networks. In Proceedings of the 19th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; MIT Press: Cambridge, MA, USA, 2006; pp. 153–160. [Google Scholar]
- Pan, J.; Sayrol, E.; Giro-I-Nieto, X.; McGuinness, K.; O’Connor, N.E. Shallow and Deep Convolutional Networks for Saliency Prediction. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 598–606. [Google Scholar]
- Sainath, T.N.; Weiss, R.J.; Senior, A.W.; Wilson, K.W.; Vinyals, O. Learning the speech front-end with raw waveform CLDNNs. In Proceedings of the INTERSPEECH-2015, Dresden, Germany, 6–10 September 2015; pp. 1–5. [Google Scholar]
- Golik, P.; Tüske, Z.; Schlüter, R.; Ney, H. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In Proceedings of the INTERSPEECH-2015, Dresden, Germany, 6–10 September 2015; pp. 26–30. [Google Scholar]
- Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, New York, NY, USA, 6–9 June 2016; ACM: New York, NY, USA, 2016; pp. 281–284. [Google Scholar]
- Nojavanasghari, B.; Gopinath, D.; Koushik, J.; Baltrušaitis, T.; Morency, L.-P. Deep Multimodal Fusion for Persuasiveness Prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; ACM: New York, NY, USA, 2016; pp. 284–288. [Google Scholar]
- Carletta, J.; Ashby, S.; Bourban, S.; Flynn, M.; Guillemot, M.; Hain, T.; Kadlec, J.; Karaiskos, V.; Kraaij, W.; Kronenthal, M.; et al. The AMI Meeting Corpus: A Pre-announcement. In Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction, Edinburgh, UK, 11–13 July 2005; Springer-Verlag: Berlin, Heidelberg, 2006; pp. 28–39. [Google Scholar] [Green Version]
- Susanne, B.; Victoria, M.; Hua, Y. The ISL meeting corpus: The impact of meeting type on speech style. In Proceedings of the International Conference on Spoken Language Processing, Denver, CO, USA, 16–20 September 2002; pp. 301–304. [Google Scholar]
- Sanchez-Cortes, D.; Aran, O.; Jayagopi, D.B.; Schmid Mast, M.; Gatica-Perez, D. Emergent leaders through looking and speaking: From audio-visual data to multimodal recognition. J. Multimodal User Interfaces 2013, 7, 39–53. [Google Scholar] [CrossRef]
- Litman, D.; Paletz, S.; Rahimi, Z.; Allegretti, S.; Rice, C. The Teams Corpus and Entrainment in Multi-Party Spoken Dialogues. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–4 November 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 1421–1431. [Google Scholar]
- Koutsombogera, M.; Vogel, C. Modeling Collaborative Multimodal Behavior in Group Dialogues: The MULTISIMO Corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 5–7 May 2018; Chair, N.C., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., et al., Eds.; European Language Resources Association (ELRA): Paris, France, 2018. [Google Scholar]
- Janin, A.; Baron, D.; Edwards, J.; Ellis, D.; Gelbart, D.; Morgan, N.; Peskin, B.; Pfau, T.; Shriberg, E.; Stolcke, A.; et al. The ICSI Meeting Corpus. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03), Hong Kong, China, 6–10 April 2003; Volume 1, pp. I-364–I-367. [Google Scholar]
- Oertel, C.; Cummins, F.; Edlund, J.; Wagner, P.; Campbell, N. D64: A corpus of richly recorded conversational interaction. J. Multimodal User Interfaces 2013, 7, 19–28. [Google Scholar] [CrossRef]
- Otsuka, K.; Yamato, J.; Takemae, Y.; Murase, H. Quantifying Interpersonal Influence in Face-to-face Conversations Based on Visual Attention Patterns. In Proceedings of the CHI’06 Extended Abstracts on Human Factors in Computing Systems, Montreal, QC, Canada, 22–27 April 2006; ACM: New York, NY, USA, 2006; pp. 1175–1180. [Google Scholar]
- Basu, S.; Choudhury, T.; Clarkson, B.; Pentland, A. Towards measuring human interactions in conversational settings. In Proceedings of the IEEE Int’l Workshop on Cues in Communication (CUES 2001) at CVPR 2001, Kauai, HI, USA, 9 December 2001. [Google Scholar]
- Dong, W.; Lepri, B.; Cappelletti, A.; Pentland, A.S.; Pianesi, F.; Zancanaro, M. Using the Influence Model to Recognize Functional Roles in Meetings. In Proceedings of the 9th International Conference on Multimodal Interfaces, Nagoya, Japan, 12–15 November 2007; ACM: New York, NY, USA, 2007; pp. 271–278. [Google Scholar]
- Bales, R.F. Personality and Interpersonal Behavior; Holt, Rinehart & Winston: Oxford, UK, 1970. [Google Scholar]
- Rienks, R.; Zhang, D.; Gatica-Perez, D.; Post, W. Detection and Application of Influence Rankings in Small Group Meetings. In Proceedings of the 8th International Conference on Multimodal Interfaces, Banff, AB, Canada, 13 November 2006; ACM: New York, NY, USA, 2006; pp. 257–264. [Google Scholar]
- Hung, H.; Jayagopi, D.B.; Ba, S.; Odobez, J.-M.; Gatica-Perez, D. Investigating Automatic Dominance Estimation in Groups from Visual Attention and Speaking Activity. In Proceedings of the 10th International Conference on Multimodal Interfaces, Chania, Greece, 20–22 October 2008; ACM: New York, NY, USA, 2008; pp. 233–236. [Google Scholar]
- Jayagopi, D.B.; Hung, H.; Yeo, C.; Gatica-Perez, D. Modeling Dominance in Group Conversations Using Nonverbal Activity Cues. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 501–513. [Google Scholar] [CrossRef] [Green Version]
- Escalera, S.; Pujol, O.; Radeva, P.; Vitrià, J.; Anguera, M.T. Automatic Detection of Dominance and Expected Interest. EURASIP J. Adv. Signal Process. 2010, 2010, 491819. [Google Scholar] [CrossRef] [Green Version]
- Lepri, B.; Subramanian, R.; Kalimeri, K.; Staiano, J.; Pianesi, F.; Sebe, N. Connecting Meeting Behavior with Extraversion—A Systematic Study. IEEE Trans. Affect. Comput. 2012, 3, 443–455. [Google Scholar] [CrossRef]
- Staiano, J.; Lepri, B.; Subramanian, R.; Sebe, N.; Pianesi, F. Automatic Modeling of Personality States in Small Group Interactions. In Proceedings of the 19th ACM International Conference on Multimedia, Scottsdale, AZ, USA, 28 November–1 December 2011; ACM: New York, NY, USA, 2011; pp. 989–992. [Google Scholar]
- Jayagopi, D.; Sanchez-Cortes, D.; Otsuka, K.; Yamato, J.; Gatica-Perez, D. Linking Speaking and Looking Behavior Patterns with Group Composition, Perception, and Performance. In Proceedings of the 14th ACM International Conference on Multimodal Interaction, Santa Monica, CA, USA, 22–26 October 2012; ACM: New York, NY, USA, 2012; pp. 433–440. [Google Scholar]
- Radev, D.R.; Jing, H.; Styś, M.; Tam, D. Centroid-based Summarization of Multiple Documents. Inf. Process. Manag. 2004, 40, 919–938. [Google Scholar] [CrossRef]
- Carbonell, J.; Goldstein, J. The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, ustralia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 335–336. [Google Scholar]
- Gong, Y.; Liu, X. Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, LA, USA, 9–12 Septembe 2001; ACM: New York, NY, USA, 2001; pp. 19–25. [Google Scholar]
- Carenini, G.; Murray, G.; Ng, R. Methods for Mining and Summarizing Text Conversations; Synthesis Lectures on Data Management; Morgan & Claypool Publishers LLC: Williston, VT, USA, 2011; Volume 3, pp. 1–130. [Google Scholar]
- Wan, S.; McKeown, K. Generating Overview Summaries of Ongoing Email Thread Discussions. In Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004. [Google Scholar]
- Isonuma, M.; Fujino, T.; Mori, J.; Matsuo, Y.; Sakata, I. Extractive Summarization Using Multi-Task Learning with Document Classification. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; Association for Computational Linguistics: Copenhagen, Denmark, 2017; pp. 2101–2110. [Google Scholar]
- Cao, Z.; Wei, F.; Dong, L.; Li, S.; Zhou, M. Ranking with Recursive Neural Networks and Its Application to Multi-document Summarization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 20–30 January 2015; AAAI Press: Menlo Park, CA, USA, 2015; pp. 2153–2159. [Google Scholar]
- Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Volume 2: Short Papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 427–431. [Google Scholar]
- Cheng, J.; Lapata, M. Neural Summarization by Extracting Sentences and Words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1: Long Papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 484–494. [Google Scholar]
- Wang, L.; Cardie, C. Domain-Independent Abstract Generation for Focused Meeting Summarization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 4–9 August 2013; Volume 1: Long Papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 1395–1405. [Google Scholar]
- Singla, K.; Stepanov, E.; Bayer, A.O.; Carenini, G.; Riccardi, G. Automatic Community Creation for Abstractive Spoken Conversations Summarization. In Proceedings of the Workshop on New Frontiers in Summarization, Copenhagen, Denmark, 7 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 43–47. [Google Scholar]
- Murray, G. Abstractive Meeting Summarization as a Markov Decision Process. In Proceedings of the Advances in Artificial Intelligence, Abbotsford, BC, Canada, 2–5 June 2015; Barbosa, D., Milios, E., Eds.; Springer: Cham, Switzerland, 2015; pp. 212–219. [Google Scholar]
- Zhao, Z.; Pan, H.; Fan, C.; Liu, Y.; Li, L.; Yang, M.; Cai, D. Abstractive Meeting Summarization via Hierarchical Adaptive Segmental Network Learning. In Proceedings of theWorld Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; ACM: New York, NY, USA, 2019; pp. 3455–3461. [Google Scholar]
- Maskey, S.; Hirschberg, J. Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization. In Proceedings of the INTERSPEECH-2005, Lisbon, Portugal, 4–8 September 2005; pp. 621–624. [Google Scholar]
- Waibel, A.; Bett, M.; Finke, M.; Stiefelhagen, R. Meeting browser: Tracking and summarizing meetings. In Proceedings of the DARPA Broadcast News Workshop, Pittsburgh, PA, USA, 8–11 February 1998; pp. 281–286. [Google Scholar]
- Galley, M. A Skip-chain Conditional Random Field for Ranking Meeting Utterances by Importance. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, 22–23 July 2006; Association for Computational Linguistics: Stroudsburg, PA, USA, 2006; pp. 364–372. [Google Scholar]
- Murray, G.; Renals, S.; Carletta, J. Extractive Summarization of Meeting Recordings. In Proceedings of the INTERSPEECH-2005, Lisbon, Portugal, 4–8 September 2005; pp. 593–596. [Google Scholar]
- Koumpis, K.; Renals, S. Automatic Summarization of Voicemail Messages Using Lexical and Prosodic Features. ACM Trans. Speech Lang. Process. 2005, 2. [Google Scholar] [CrossRef]
- Murray, G. Using Speech-Specific Characteristics for Automatic Speech Summarizatio. Ph.D. Thesis, University of Edinburgh, Edinburgh, Scotland, 2007. [Google Scholar]
- Zhu, X.; Penn, G.; Rudzicz, F. Summarizing Multiple Spoken Documents: Finding Evidence from Untranscribed Audio. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2–7August 2009; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; Volume 2, pp. 549–557. [Google Scholar]
- Murray, G.; Renals, S.; Carletta, J.; Moore, J. Evaluating automatic summaries of meeting recordings. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 33–40. [Google Scholar]
- Erol, B.; Lee, D.S.; Hull, J. Multimodal summarization of meeting recordings. In Proceedings of the 2003 International Conference on Multimedia and Expo, ICME’03, Baltimore, MD, USA, 6–9 July 2003; Voume 3, pp. 25–28. [Google Scholar]
- Li, H.; Zhu, J.; Ma, C.; Zhang, J.; Zong, C. Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1092–1102. [Google Scholar]
- Gatica-Perez, D.; McCowan, I.A.; Zhang, D.; Bengio, S. Detecting Group Interest-level in Meetings. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, 23 March 2005. [Google Scholar]
- Wrede, B.; Shriberg, E. Spotting “Hot Spots” in Meetings: Human Judgments and Prosodic Cues. In Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH 2003—INTERSPEECH 2003), Geneva, Switzerland, 1–4 September 2003; pp. 2805–2808. [Google Scholar]
- Wang, X.; Liu, Y.; Sun, C.; Wang, B.; Wang, X. Predicting Polarities of Tweets by Composing Word Embeddings with Long Short-Term Memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 27 July–31 July 2015; Volume 1: Long Papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 1343–1353. [Google Scholar]
- Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.-P. Context-Dependent Sentiment Analysis in User-Generated Videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1: Long Papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 873–883. [Google Scholar]
- Shen, Y.; Huang, X. Attention-Based Convolutional Neural Network for Semantic Relation Extraction. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2526–2536. [Google Scholar]
- Poria, S.; Cambria, E.; Gelbukh, A.F. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2539–2544. [Google Scholar]
- Murray, G.; Carenini, G.; Ng, R. Generating and Validating Abstracts of Meeting Conversations: A User Study. In Proceedings of the 6th International Natural Language Generation Conference, Trim Castle, Ireland, 7–9 July 2010. [Google Scholar]
- Hsueh, P.-Y.; Moore, J.D. Improving Meeting Summarization by Focusing on User Needs: A Task-oriented Evaluation. In Proceedings of the 14th International Conference on Intelligent User Interfaces, Sanibel Island, FL, USA, 8–11 February 2009; ACM: New York, NY, USA, 2009; pp. 17–26. [Google Scholar]
- Tucker, S.; Whittaker, S. Have a Say over What You See: Evaluating Interactive Compression Techniques. In Proceedings of the 14th International Conference on Intelligent User Interfaces, Sanibel Island, FL, USA, 8–11 February 2009; ACM: New York, NY, USA, 2009; pp. 37–46. [Google Scholar]
- Costa, P.T.; McCrae, R.R. Revised NEO Personality Inventory (NEO PI-R) and NEO Five-Factor Inventory (NEO-FFI); Psychological Assessment Resources: Lutz, FL, USA, 1992. [Google Scholar]
- Nihei, F.; Nakano, Y.I.; Hayashi, Y.; Hung, H.-H.; Okada, S. Predicting Influential Statements in Group Discussions Using Speech and Head Motion Information. In Proceedings of the 16th International Conference on Multimodal Interaction, Istanbul, Turkey, 12–14 November 2014; ACM: New York, NY, USA, 2014; pp. 136–143. [Google Scholar]
- Vahdatpour, A.; Amini, N.; Sarrafzadeh, M. Toward Unsupervised Activity Discovery Using Multi-dimensional Motif Detection in Time Series. In Proceedings of the 21st International Joint Conference on Artificail Intelligence, Pasadena, CA, USA, 11–17 July 2009; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2009; pp. 1261–1266. [Google Scholar]
- Fan, Y.; Lu, X.; Li, D.; Liu, Y. Video-based Emotion Recognition Using CNN-RNN and C3D Hybrid Networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; ACM: New York, NY, USA, 2016; pp. 445–450. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
- Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 439–448. [Google Scholar]
- Nguyen, L.S.; Frauendorfer, D.; Mast, M.S.; Gatica-Perez, D. Hire me: Computational Inference of Hirability in Employment Interviews Based on Nonverbal Behavior. IEEE Trans. Multimed. 2014, 16, 1018–1031. [Google Scholar] [CrossRef]
- Cao, Z.; Wei, F.; Li, S.; Li, W.; Zhou, M.; Wang, H. Learning Summary Prior Representation for Extractive Summarization. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 27 July–31 July 2015; Volume 2: Short Papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 829–833. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1: Long and Short Papers. Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Type of Utterance | Number of Utterance |
---|---|
Important utterance (positive) | 5268 |
Unimportant utterance (negative) | 10,245 |
total | 15,513 |
Type of Utterance | Number of Utterances |
---|---|
Important utterance (positive) | 3789 |
Unimportant utterance (negative) | 5150 |
Total | 8939 |
Category | Speaker | Others |
---|---|---|
Visual attention |
|
|
Head motion |
|
|
Speech |
|
|
Total | 21 (= 8 for attention + 3 for head motion + 10 for speech) | 17 (= 8 for attention + 3 for head motions + 6 for speech) |
Feature Category | Num. of Features Satisfied Two Conditions |
---|---|
SP/OT (38 features) | 13 |
PR (144 features) | 38 |
CO (153 features) | 45 |
Total | 96 |
Category | Examples | |
---|---|---|
SP/OT | Speaker feature | Others feature |
|
| |
PR | Speaker feature | Others feature |
|
| |
CO | Component of co-occurrence pattern | Co-occurrence pattern |
|
|
Layer | 2D-CNN | 3D-CNN | AlexNet-Based CNN | |
---|---|---|---|---|
Input layer | Size | Depends on modality | Depends on modality | Depends on modality |
Convolution layer | Kernel size | Depends on modality | Depends on modality | Depends on modality |
Num. of kernel | C1: 32, C2: 32 | C1: 32, C2: 32 | C1: 24, C2: 64, C3: 96, C4: 96, C5: 64 | |
Activation function | ReLU | ReLU | ReLU | |
Pooling layer | Filter size | Depends on modality | Depends on modality | Depends on modality |
Fully connected layer | Num. of neurons | FC1: 128, FC2: 128 | FC1: 128, FC2: 128 | FC1: 256, FC2: 256 |
Activation function | ReLU | ReLU | ReLU | |
Drop out | P1-FC1: 0.25, FC1-FC2: 0.5, FC2-softmax:0.5 | P1-FC1: 0.25, FC1-FC2: 0.5, FC2-softmax:0.5 | FC1-FC2: 0.5, FC2-softmax:0.5 |
Input | Input Vector Size | Convolution Kernel Size | Pooling Filter Size |
---|---|---|---|
SS | 750, 32, 4, 1 | 5, 3, 4 | 2, 2, 1 |
HS | 450, 15, 4, 1 | 3, 3, 4 | 2, 2, 1 |
SI | 1500, 1, 4, 1 | 10, 1, 4 | 2, 2, 1 |
HP | 450, 3, 4, 2 | 3, 3, 4 | 2, 2, 1 |
Layer | Configuration | |
---|---|---|
Input size | 28, 219, 1 | |
Convolution layer | Kernel size | 3, 219 |
Num. of kernel | C1: 32, C2: 32 | |
Activation function | C1: ReLU, C2: ReLU | |
Pooling layer | Filter size | 2, 1 |
Fully connected layer | Num. of neuron | FC1: 128, FC2: 128 |
Activation function | FC1: ReLU, FC2: ReLU | |
Drop out | P1-FC1: 0.25, FC1-FC2: 0.5, FC2-softmax:0.5 |
Category | Model | Precision | Recall | F-Measure | Accuracy |
---|---|---|---|---|---|
Baseline | LU | 0.552 | 0.715 | 0.623 | 0.707 |
Verbal models | HC_V | 0.634 | 0.566 | 0.598 | 0.742 |
BOW | 0.638 | 0.533 | 0.581 | 0.739 | |
V_ALL | 0.644 | 0.572 | 0.606 | 0.747 | |
Nonverbal models | SP/OT | 0.668 | 0.750 | 0.707 | 0.789 |
PR | 0.655 | 0.703 | 0.678 | 0.773 | |
CO | 0.599 | 0.678 | 0.636 | 0.736 | |
SP/OT+PR | 0.668 | 0.732 | 0.698 | 0.785 | |
SP/OT+CO | 0.670 | 0.744 | 0.705 | 0.788 | |
PR+CO | 0.656 | 0.698 | 0.676 | 0.773 | |
NV-ALL | 0.669 | 0.720 | 0.694 | 0.784 |
Category | Model | Precision | Recall | F-Measure | Accuracy |
---|---|---|---|---|---|
Baseline | LU | 0.552 | 0.715 | 0.623 | 0.707 |
Best nonverbal model | SP/OT | 0.668 | 0.750 | 0.707 | 0.789 |
Verbal and nonverbal models | HC_V-SP/OT | 0.680 | 0.619 | 0.648 | 0.772 |
BOW-SP/OT | 0.658 | 0.568 | 0.610 | 0.753 | |
V_ALL-SP/OT | 0.665 | 0.584 | 0.622 | 0.759 |
Network Structure | Model | Precision | Recall | F-Measure | Accuracy |
---|---|---|---|---|---|
Baseline | LU | 0.552 | 0.715 | 0.623 | 0.707 |
2D-CNN | HS | 0.598 | 0.714 | 0.651 | 0.740 |
SS | 0.703 | 0.781 | 0.740 | 0.814 | |
SI | 0.654 | 0.771 | 0.708 | 0.784 | |
HP | 0.555 | 0.638 | 0.594 | 0.703 | |
NV | 0.670 | 0.789 | 0.725 | 0.797 | |
AlexNet-based CNN | HS | 0.657 | 0.630 | 0.643 | 0.763 |
SS | 0.702 | 0.789 | 0.743 | 0.814 | |
SI | 0.703 | 0.749 | 0.726 | 0.808 | |
HP | 0.618 | 0.640 | 0.629 | 0.744 | |
NV | 0.709 | 0.830 | 0.765 | 0.827 | |
3D-CNN | HS | 0.668 | 0.666 | 0.667 | 0.774 |
SS | 0.695 | 0.821 | 0.753 | 0.817 | |
SI | 0.696 | 0.797 | 0.743 | 0.813 | |
HP | 0.601 | 0.696 | 0.645 | 0.740 | |
NV | 0.732 | 0.842 | 0.783 | 0.841 | |
Verbal model | V | 0.731 | 0.750 | 0.741 | 0.822 |
Network Structure | Model | Precision | Recall | F-measure | Accuracy |
---|---|---|---|---|---|
Baselines | LU | 0.552 | 0.715 | 0.623 | 0.707 |
3D-CNN | HS | 0.668 | 0.666 | 0.667 | 0.774 |
SS | 0.695 | 0.821 | 0.753 | 0.817 | |
SI | 0.696 | 0.797 | 0.743 | 0.813 | |
HP | 0.601 | 0.696 | 0.645 | 0.740 | |
NV | 0.732 | 0.842 | 0.783 | 0.841 | |
Verbal and nonverbal fusion model | V-NV | 0.761 | 0.786 | 0.773 | 0.843 |
Category | Models | Precision | Recall | F-Measure | Accuracy |
---|---|---|---|---|---|
Baseline | LU | 0.552 | 0.715 | 0.623 | 0.707 |
Models based on hand-crafted feature | V_ALL | 0.644 | 0.572 | 0.606 | 0.747 |
SP/OT | 0.668 | 0.750 | 0.707 | 0.789 | |
HC_V-SP/OT | 0.680 | 0.619 | 0.648 | 0.772 | |
Models based on deep learning | V | 0.731 | 0.750 | 0.741 | 0.822 |
NV | 0.732 | 0.842 | 0.783 | 0.841 | |
V-NV | 0.761 | 0.786 | 0.773 | 0.843 |
Category | Model | Precision | Recall | F-Measure | Accuracy |
---|---|---|---|---|---|
Baseline | LU | 0.585 | 0.828 | 0.686 | 0.678 |
Models based on hand-crafted feature | V_ALL | 0.686 | 0.728 | 0.707 | 0.744 |
SP/OT | 0.729 | 0.720 | 0.725 | 0.767 | |
HC_V-NV | 0.743 | 0.757 | 0.750 | 0.785 | |
Models based on deep learning | V | 0.765 | 0.806 | 0.785 | 0.813 |
NV | 0.777 | 0.844 | 0.809 | 0.831 | |
V-NV | 0.807 | 0.847 | 0.827 | 0.850 |
Num. of Words | Tuple of POS Tag | Specific Example |
---|---|---|
3 | P, V, N | Visit Tsukiji, Tsukiji/it/te |
P, C, N | Well then, Tokyo. ja/Tokyo/de | |
4 | P*2, N*2 | Pancake or fried chicken. panke-ki/ka/karaage/ka |
5 | P, AUX, V, N*2 | Do we go to the Imperial Palace? ko-kyo/iku/n/desu/ka? |
Num. of Words | Tuple of POS Tag | Specific Example |
---|---|---|
1 | Int., F, N, Adv. V, C, Adj. | um (Aa). right (So-desune). zoo (Do-butsuen). |
2 | Int., N | Mount Fuji, I see. Fujisan/naruhodo |
3 | Int., P, N | Um, popcorn. A/poppuko-n/ka |
4 | P, AUX., N*2 | That sounds good. yosa/ge/desu/kedo |
Category | Role Description | Cor (Simple, Multimodal) | Cor (Simple, Text) |
---|---|---|---|
Orienter | A person who orients the group by introducing the agenda | 0.91 | 0.82 |
A person who defines goals and procedures | 0.82 | 1.00 | |
A person who keeps the group focused and on track and summarizes the most important arguments and group decisions | 0.82 | 0.57 | |
Giver | A person who provides factual information and answers to questions | −0.14 | 0.86 |
A person who states his/her beliefs and attitudes about an idea | 0.39 | 0.00 | |
A person who expresses personal values and offers factual information | 0.82 | 0.28 | |
Seeker | A person who requests information | 0.67 | 0.17 |
A person who requests clarifications | 0.66 | 0.66 | |
Follower | A person who does not actively participate in the interaction | 0.38 | 0.91 |
Attacker | A person who deflates the status of others | −0.44 | −0.17 |
A person who expresses disapproval | 0.92 | 0.56 | |
A person who attacks the group or the problem | 0.00 | 0.30 | |
Gate Keeper | A person who is the moderator within the group | 0.25 | −0.24 |
A person who encourages and facilitates the participation | 0.82 | 0.82 | |
A person who regulates the flow of communication | 0.82 | 0.66 | |
Protagonist | A person who takes the floor | 0.82 | 0.38 |
A person who drives the conversation | 0.75 | 0.66 | |
A person who assumes a personal perspective and asserts her/his authority | 0.44 | −0.22 | |
Supporter | A person who shows a cooperative attitude, manifesting understanding, attention, and acceptance to others | −0.67 | 0.50 |
A person who provides technical and relational support. | 0.03 | 0.69 | |
Neutral Role | A person who passively accepts the ideas of others. | 0.38 | 0.30 |
A person who serves as an audience in a group discussion. | 0.67 | 0.61 |
Questionnaire Item | Multimodal | Text | t-test |
---|---|---|---|
Ease of use | 4.1 | 3.1 | t(18) = 1.945, p < 0.1 |
Ease of search | 3.7 | 3.1 | t(18) = 0.868, n.s. |
Efficiency in finding all relevant information | 4.0 | 3.5 | t(18) = 0.921, n.s. |
General task comprehension | 4.3 | 3.4 | t(18) = 2.242, p < 0.05 |
Task success | 3.7 | 2.7 | t(18) = 2.224, p < 0.05 |
Task difficulty | 3.0 | 2.6 | t(18) = 0.802, n.s. |
Perceived pressure | 2.7 | 2.5 | t(18) = 0.418, n.s. |
Usefulness of the browser | 4.5 | 3.5 | t(18) = 2.301, p < 0.05 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nihei, F.; Nakano, Y.I. Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization. Multimodal Technol. Interact. 2019, 3, 50. https://doi.org/10.3390/mti3030050
Nihei F, Nakano YI. Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization. Multimodal Technologies and Interaction. 2019; 3(3):50. https://doi.org/10.3390/mti3030050
Chicago/Turabian StyleNihei, Fumio, and Yukiko I. Nakano. 2019. "Exploring Methods for Predicting Important Utterances Contributing to Meeting Summarization" Multimodal Technologies and Interaction 3, no. 3: 50. https://doi.org/10.3390/mti3030050