人们逐渐以多元化的形式来表达自己的情感,于是多模态情感分析应运而生。所谓视觉信息,分为在图片中提取和在视频中提取,如何有效衡量这部分情感倾向,如何将其与文本模态内容结合,合理完成多模态情感分析任务是该领域非常值得探究的问题。本文从相应的多模态数据集展开综述,给出了目前多个经典数据集的详细概况,后梳理了各模态特征抽取方法、模态融合技术、基于深度学习的前沿技术等方面的内容。最后给出了对未来研究的展望,为选择合适匹配的数据集和探索多模态情感分析体系提供思路启发。 People gradually express their emotions in diverse forms, so multimodal sentiment analysis is born. The so-called visual information, divided into extracted in pictures and extracted in videos, how to effectively measure this part of sentiment tendency and how to combine it with text modal content to reasonably accomplish the multimodal sentiment analysis task is a very worthy question in this field. In this paper, we review the corresponding multimodal datasets, give a detailed overview of several classical datasets at present, and then sort out the content of each modal feature extraction method, modal fusion techniques, and frontier techniques based on deep learning. Finally, an outlook on future research is given to provide inspiration for selecting suitable matching datasets and exploring multimodal sentiment analysis systems.
People gradually express their emotions in diverse forms, so multimodal sentiment analysis is born. The so-called visual information, divided into extracted in pictures and extracted in videos, how to effectively measure this part of sentiment tendency and how to combine it with text modal content to reasonably accomplish the multimodal sentiment analysis task is a very worthy question in this field. In this paper, we review the corresponding multimodal datasets, give a detailed overview of several classical datasets at present, and then sort out the content of each modal feature extraction method, modal fusion techniques, and frontier techniques based on deep learning. Finally, an outlook on future research is given to provide inspiration for selecting suitable matching datasets and exploring multimodal sentiment analysis systems.
The Institute for Creative Technologies Multi-Modal Movie Opinion (ICT-MMMO)数据集由Wollmer等人 [
4
] 于2013年创建。该数据集包括从YouTube和ExpoTV两个网站上收集的370个在线英文电影评论视频。数据集的标签为以下几种:强积极、弱积极、中性、强消极和弱消极。
孙 睿,周艳聪. 衡量视觉信息的多模态情感分析综述A Review of Multimodal Sentiment Analysis for Measuring Visual Information[J]. 统计学与应用, 2023, 12(01): 128-138. https://doi.org/10.12677/SA.2023.121015
参考文献References
张亚洲, 戎璐, 宋大为, 张鹏. 多模态情感分析研究综述[J]. 模式识别与人工智能, 2020, 33(5): 426-438.
Morency, L.P., et al. (2011) Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web. Proceedings of the 13th International Conference on Multimodal Interfaces, Alicante, 14-18 November 2011, 169-176.
Perez-Rosas, V.M. and Morency, L.-P. (2013) Utterance-Level Multimodal Sentiment Analysis. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Volume 1, 973-982.
Wollmer, M., Knaup, T., et al. (2013) Youtube Movie Reviews: Sentiment Analysis in an Audio-Visual Context. IEEE Intelligent Systems, 28, 46-53. https://doi.org/10.1109/MIS.2013.34
Park, S.S., et al. (2016) Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia. ACM Transactions on Interactive Intelligent Systems, 6, Article 25. https://doi.org/10.1145/2897739
Morency, L.-P., et al. (2016) MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos.
Zadeh, A.A.B., Poria, S., Cambria, E. and Morency, L.P. (2018) Multimodal Language Analysis in the Wild: CMU- MOSEI Dataset and Interpretable Dynamic Fusion Graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Volume 1, 2236-2246. https://doi.org/10.18653/v1/P18-1208
Yu, W., et al. (2020) CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotation of Modality. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July 2020, 3718-3727. https://doi.org/10.18653/v1/2020.acl-main.343
Dash, A.K., Rout, J.K. and Jena, S.K. (2016) Harnessing Twitter for Automatic Sentiment Identification Using Machine Learning Techniques. Proceedings of 3rd International Conference on Advanced Computing, Networking and Informatics, Vol. 44, 507-514. https://doi.org/10.1007/978-81-322-2529-4_53
Vinodhini, G. and Chandrasekaran, R.M. (2019) A Comparative Performance Evaluation of a Neural Network-Based Approach for Sentiment Classification of Online Reviews. Journal of King Saud University of Computer and Information Sciences, 28, 2-12. https://doi.org/10.1016/j.jksuci.2014.03.024
Kaibi, I. and Nfaoui, E.H. (2019) A Comparative Evaluation of Word Embeddings Techniques for Twitter Sentiment Analysis. 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS), Fez, 3-4 April 2019, 1-4. https://doi.org/10.1109/WITS.2019.8723864
Ahuja, R., Chug, A., Kohli, S., Gupta, S. and Ahuja, P. (2019) The Impact of Features Extraction on the Sentiment Analysis. Procedia Computer Science, 152, 341-348. https://doi.org/10.1016/j.procs.2019.05.008
Mohey, D. (2016) Enhancement Bag-of-Words Model for Solving the Challenges of Sentiment Analysis. International Journal of Advanced Computer Science and Applications, 7, 244-252. https://doi.org/10.14569/IJACSA.2016.070134
Poria, S., Cambria, E., Hussain, A. and Huang, G.-B. (2015) Towards an Intelligent Framework for Multimodal Effective Data Analysis. Neural Networks, 63, 104-116. https://doi.org/10.1016/j.neunet.2014.10.005
Piana, S., Staglianó, A., Odone, F., Verri, A. and Camurri, A. (2014) Real-Time Automatic Emotion Recognition from Body Gestures.
Noroozi, F., Corneanu, C.A., Kaminska, D., Sapinski, T., Escalera, S. and Anbarjafari, G. (2018) Survey on Emotional Body Gesture Recognition.
Yakaew, A., Dailey, M. and Racharak, T. (2021) Multimodal Sentiment Analysis on Video Streams Using Lightweight Deep Neural Networks. Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods, 4-6 February 2021, 442-451. https://doi.org/10.5220/0010304404420451
Song, K., Yao, T., Ling, Q., et al. (2018) Boosting Image Sentiment Analysis with Visual Attention. Neurocomputing, 312, 218-228. https://doi.org/10.1016/j.neucom.2018.05.104
王仁武, 孟现茹. 图片情感分析研究综述[J]. 图书情报知识, 2020(3): 119-127.
朱雪林. 基于注意力机制的图片文本联合情感分析研究[D]: [硕士学位论文]. 南京: 东南大学, 2019.
You, Q.Z., Jin, H.L. and Luo, J.B. (2017) Visual Sentiment Analysis by Attending on Local Image Regions. Thirty-First AAAI Conference on Artificial Intelligence, 31, 231-237. https://doi.org/10.1609/aaai.v31i1.10501
Mittal, N., Sharma, D., Joshi, M.L., et al. (2018) Image Sentiment Analysis Using Deep Learning. In: Proceedings of the 2018 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE, Piscataway, 684-687. https://doi.org/10.1109/WI.2018.00-11
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 5998-6008.
Andayani, F., Theng, L.B., Tsun, M.T.K. and Chua, C. (2022) Hybrid LSTM-Transformer Model for Emotion Recognition From Speech Audio Files. IEEE Access, 10, 36018-36027. https://doi.org/10.1109/ACCESS.2022.3163856
Heusser, V., Freymuth, N., Constantin, S. and Waibel, A. (2019) Bimodal Speech Emotion Recognition Using Pre-Trained Language Models.
Jing, D., Manting, T. and Li, Z. (2021) Transformer-Like Model with Linear Attention for Speech Emotion Recognition. Journal of Southeast University, 37, 164-170.
Sakatani, Y. (2021) Combining RNN with Transformer for Modeling Multi-Leg Trips. ACM WSDM WebTour 2021, Jerusalem, 12 March 2021, 50-52.
Monkaresi, H., Hussain, M.S. and Calvo, R.A. (2012) Classification of Affects Using Head Movement, Skin Color Features and Physiological Signals. 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Seoul, 14-17 October 2012, 2664-2669. https://doi.org/10.1109/ICSMC.2012.6378149
Cai, G. and Xia, B. (2015) Convolutional Neural Networks for Multimedia Sentiment Analysis. In: Li, J., Ji, H., Zhao, D. and Feng, Y., Eds., Natural Language Processing and Chinese Computing, Vol. 9362, Springer International Publishing, Nanchang, 159-167. https://doi.org/10.1007/978-3-319-25207-0_14
Dobrisek, S., Gajsek, R., Mihelic, F., Pavesic, N. and Struc, V. (2013) Towards Efficient Multi-Modal Emotion Recognition. International Journal of Advanced Robotic Systems, 10, 53. https://doi.org/10.5772/54002
Wöllmer, M., Weninger, F., Knaup, T., Schuller, B., Sun, C., Sagae, K. and Morency, L.-P. (2013) YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context. IEEE Intelligent Systems, 28, 46-53. https://doi.org/10.1109/MIS.2013.34
Siddiquie, B., Chisholm, D. and Divakaran, A. (2015) Exploiting Multimodal Affect and Semantics to Identify Politically Persuasive Web Videos.
Mansoorizadeh, M. and Charkari, M. (2014) Multimodal Information Fusion Application to Human Emotion Recognition from Face and Speech. Multimedia Tools and Applications, 49, 277-297. https://doi.org/10.1007/s11042-009-0344-2
Lin, J.-C., Wu, C.-H. and Wei, W.-L. (2012) Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition. IEEE Transactions on Multimedia, 14, 142-156. https://doi.org/10.1109/TMM.2011.2171334
Zeng, Z., Hu, Y., Liu, M., Fu, Y. and Huang, T.S. (2006) The Training Combination Strategy of Multi-Stream Fused Hidden Markov Model for Audio-Visual Affect Recognition. Proceedings of the 14th Annual ACM International Conference on Multimedia, Santa Barbara, 23-27 October 2006, 65. https://doi.org/10.1145/1180639.1180661
Sebe, N., Cohen, I., Gevers, T. and Huang, T.S. (2006) Emotion Recognition Based on Joint Visual and Audio Cues. 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, 20-24 August 2006, 1136-1139. https://doi.org/10.1109/ICPR.2006.489
Song, M., Bu, J., Chen, C. and Li, N. (2004) Audio-Visual-Based Emotion Recognition—A New Approach. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, 1020-1025. https://doi.org/10.1109/CVPR.2004.1315276
Al-Azani, S. and El-Alfy, E.-S.M. (2020) Enhanced Video Analytics for Sentiment Analysis Based on Fusing Textual. Auditory and Visual Information, 8, 15. https://doi.org/10.1109/ACCESS.2020.3011977
Corradini, A., Mehta, M., Bernsen, N.O., Martin, J.C. and Abrilian, S. (2005) Multimodal Input Fusion in Human-Computer Interaction. In: Data Fusion for Situation Monitoring, Incident Detection, Alert and Response Management, IOS Press, Tsakhkadzor, 223-234.
Iyengar, G., Nock, H.J. and Neti, C. (2003) Audio-Visual Synchrony for Detection of Monologues in Video Archives. IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, 6-10 April 2003, V-772. https://doi.org/10.1109/ICME.2003.1220921
刘兵. 情感分析: 挖掘观点、情感和情绪[M]. 北京: 机械工业出版社, 2019: 149-156.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G. and Hassabis, D. (2016) Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529, 484-489. https://doi.org/10.1038/nature16961
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. and Manzagol, P.-A. (2010) Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research, 11, 3371-3408.
Sak, H., Senior, A. and Beaufays, F. (2014) Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. Neural and Evolutionary Computing, 1, 1-5. https://doi.org/10.21437/Interspeech.2014-80
Pal, S., Ghosh, S. and Nag, A. (2018) Sentiment Analysis in the Light of LSTM Recurrent Neural Networks. International Journal of Synthetic Emotions, 9, 33-39. https://doi.org/10.4018/IJSE.2018010103
Tang, D., Qin, B. and Feng, X. (2016) Effective LSTMs for Target-Dependent Sentiment Classification. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, December 2016, 3298-3307.
Basiri, M.E., Nemati, S. and Abdar, M. (2020) An Attention-Based Bidirectional CNN-RNN Deep Model for Sentiment Analysis. Future Generation Computer Systems, 115, 279-294. https://doi.org/10.1016/j.future.2020.08.005
Letarte G., Paradis, F. and Giguere, P. (2018) Importance of Self-Attention for Sentiment Analysis. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, November 2018, 267-275. https://doi.org/10.18653/v1/W18-5429
Li, W., Qi, F. and Tang, M. (2020) Bidirectional LSTM with Self-Attention Mechanism and Multi-Channel Features for Sentiment Classification. Neurocomputing, 387, 63-77. https://doi.org/10.1016/j.neucom.2020.01.006
Xu, Q., Zhu, L. and Dai, T. (2020) Aspect-Based Sentiment Classification with Multiattention Network. Neurocomputing, 388, 135-143. https://doi.org/10.1016/j.neucom.2020.01.024
Cao, R., Ye, C. and Zhou, H. (2021) Multimodel Sentiment Analysis with Self-Attention. Proceedings of the Future Technologies Conference (FTC), Volume 1, 16-26. https://doi.org/10.1007/978-3-030-63128-4_2
Shenoy, A. and Sardana, A. (2020) Multilogue-Net: A Context Aware RNN for Multi-Modal Emotion Detection and Sentiment Analysis in Conversation. The 58th Annual Meeting of the Association for Computational Linguistics, Seattle, 5-10 July 2020, 19-28. https://doi.org/10.18653/v1/2020.challengehml-1.3
Mai, S., Hu, H. and Xing, S. (2019) Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, July2019, 481-492. https://doi.org/10.18653/v1/P19-1046
Chauhany, D., Poria, S., Ekbaly, A., et al. (2017) Contextual Inter-Modal Attention for Multi-Modal Sentiment Analysis. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, October-November 2018, 3454-3466.
Kim, T. (2020) Multi-Attention Multimodal Sentiment Analysis. ICMR’20 Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, 8-11 June 2020, 436-441.
Liangy, P.P., Kolteryz, J.Z., Morencyy, L.P., et al. (2019) Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, 28 July-2 August 2019, 6558-6569.