ISSN 2071-8594

Russian academy of sciences


Gennady Osipov

A. A. Moskvin, A. G. Shishkin Application of deep learning methods to recognize the emotional state of a person in a video image


In this paper, using the use of deep neural networks developed and implemented a model that allows you to determine in real time with limited computing resources emotional state of a personby video sequence, which is present as a voice signal related to the source for which you want to determine the state, and his face full face. Visual information is represented by 16 consecutive frames with a size of 96x96 pixels, and voice - with 140 of characteristics for a sequence of the 37 Windows. On the basis of experimental studies, the architecture of the model using convolutional and recurrent neural networks is developed. For 7 classes that meet different emotional States - neutral state, anger, sadness, fright, joy, disappointment and surprise - the recognition efficiency is 59%. Studies have shown that the use of audio information in conjunction with the visual can increase the accuracy of recognition by 12%. The created system is dynamic in terms of selection of parameters, narrowing or expanding the number of classes, as well as the ability to easily add, accumulate and use information from other external devices for further development and improvement of classification accuracy.


artificial neural networks, deep learning, emotion recognition, video, speech signal.

PP. 3-14.

DOI 10.14357/20718594190201


1. Спицын, В. Г., Болотова, Ю. А., Шабалдина, Н. В., Тхи, Б., Чанг, Т., & Хоанг, Ф. Н. (n.d.). Распознование лиц на основе методы главных компонент с применением вейвлет-дескрипторов Хаара и Добеши. Ba Ria –Vung Tau University, Ba Ria – Vung Tau, Vietnam.
2. Huang X., Acero A., Hon H.-W., Spoken language processing: a guide to theory, algorith, and system development // Prentice Hall PTR. 2001.
3. Ashok, V., Balakumaran, T., Gowrishankar, C., Vennila, I. L. A., & Kumar, A. N. (2010). The Fast Haar Wavelet Transform for Signal & Image Processing. Arxiv Preprint ArXiv, 7(1), 126–130.
4. Chronaki G., Hadwin J. A., Garner M., Maurage P., Sonuga-Barke E. J. S., The development of emotion recognition from facial expressions and non-linguistic vocalizations during childhood. Br. J. Dev. Psychol., 2015, vol. 33, no. 2, P. 218–236.
5. Jorda M., Instructor N. M., Ng A., CS229: Machine Learning techniques - Emotion Classification on face images. 2015.
6. Fan Y., Lam J. C. K., Li V.O.K. Multi-Region Ensemble Convolutional Neural Network for Facial Expression Recognition. Retrieved, 2017.
7. Horii T., Nagai Y., Asada M., Emotion Recognition and Generation through Multimodal Restricted Boltzmann Machines // Proceedings of the IROS 2015 Workshop on Grounding robot autonomy: Emotional and social interaction in robot behavior. October 2, 2015, Hamburg, Germany.
8. Duncan D., Shine G., English Ch., Facial Emotion Recognition in Real Time. 2017.
9. Fan Y., Lam J. C. K., Li V.O.K. Multi-Region Ensemble Convolutional Neural Network for Facial Expression Recognition. Retrieved, 2017.
10. Kahou S. E., Bouthillier X., Lamblin P., EmoNets: Multimodal deep learning approaches for emotion recognition in video // Journal on Multimodal User Interfaces. June 2016, Volume 10, Issue 2, P. 99–111.
11. Sun M. C., Hsu S. H., Yang M. C., Chien J. H. Contextaware Cascade Attention-based RNN for Video Emotion Recognition. In 2018 1st Asian Conference on Affective Computing and Intelligent Interaction, ACII Asia 2018.
12. Moon S. E., Jang S., Lee J. S. (2018). Convolutional Neural Network Approach for EEG-Based Emotion Recognition Using Brain Connectivity and its Spatial Information. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings.
13. Rabiner L.R., Juang B.-H. Fundamentals of speech recognition. Prentice Hall, Englewood Cliffs, NJ, 1993.
14. McDuff D., El Kaliouby R., Senechal T., Amr M., Cohn J. F., Picard R., Affectiva-mit facial expression dataset (AM-FED): Naturalistic and spontaneous facial expressions collected „in-the-wild // IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2013, P. 881–888.
15. Bolon P., Two-Dimensional Linear Filtering // Digital Filters Design for Signal and Image Processing. 2010, P. 233–260.
16. Lukianitsa A. A., Shishkin A. G., Automatic detection of changes in emotional States via speech signal // Speech technologies. 2009, №3, № 60-76.
17. Eyben F., Weninger F., Wöllmer M., Schuller B., München T., Open-Source Media Interpretation by Large feature-space Extraction // TU Munchen, MMK. November 2016.
18. Goodfellow I., Bengio Y., Courville A., Deep Learning, 2016, pp. 499-522.
19. Ferri C., Hernández-Orallo J., Salido M. A., Volume under the ROC Surface for Multi-class Problems. 2013, P. 108–120.