BiMER: Design and Implementation of a Bimodal Emotion Recognition System Enhanced by Data Augmentation Techniques


DİKBIYIK E., DEMİR Ö., DOĞAN B.

IEEE Access, cilt.13, ss.64330-64352, 2025 (SCI-Expanded) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 13
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1109/access.2025.3559339
  • Dergi Adı: IEEE Access
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.64330-64352
  • Anahtar Kelimeler: Bimodal emotion recognition, data augmentation, IEMOCAP, intermediate fusion, real-time emotion recognition
  • Marmara Üniversitesi Adresli: Evet

Özet

In today’s world, accurately understanding and interpreting emotions in human-computer interaction is important. In this context, this study has adopted a detailed approach to the emotion recognition problem on both speech and text data using the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. First, the problem of datasets with limited number of records and unbalanced distribution across classes was addressed. For this purpose, a dataset obtained from records created as improvised in the IEMOCAP dataset was used and data augmentation methods were applied for both speech and text data. Using datasets that were balanced by applying data augmentation, single-mode emotion recognition experiments were performed with models developed for Speech Emotion Recognition (SER) and Textual Emotion Recognition (TER). Subsequently, the features obtained from these two single modalities were combined with the intermediate fusion method to provide more comprehensive emotion recognition and accuracy, and the Bimodal Emotion Recognition (BiMER) system was developed. The ResNet50-CRNN+AT model, which we obtained the highest accuracy from the three different models developed for SER, creates the speech mode of BiMER, while the Bidirectional Encoder Representations from Transformers (BERT) model used for TER creates the text mode of BiMER. In this way, BiMER was supported with data augmentation methods and the robustness and generalization ability of the model were improved, reaching 88.33% accuracy. Finally, the developed BiMER system was implemented as a real-time web application using the Flask framework, and the capacity of this application to recognize emotions interactively through the user interface was tested.