Speech Emotion Recognition using Deep Learning | Matlab



The project presents speech emotion recognition from speech signal based on features analysis and NN-classifier. Automatic speech emotion recognition (SER) plays an important role in HCI systems for measuring people?s emotions has dominated psychology by linking expressions to group of basic emotions (i.e., anger, disgust, fear, happiness, sadness, and surprise). The recognition system involves speech emotion detection, features extraction and selection and finally classification. These features are useful to distinguish the maximum number of samples accurately and the NN classifier based on discriminate analysis is used to classify the six different expressions. The simulated results will be shown that the filter based feature extraction with used classifier gives much better accuracy with lesser algorithmic complexity than other speech emotion expression recognition approaches.


  • Recognizing emotions is a key feature needed to build socially aware systems. Therefore, it is an important part of human computer interaction (HCI). Emotion recognition can play an important role in various fields such as healthcare (mood profiles) education (tutoring) and security and defense (surveillance) . Speech emotion recognition (SER) has enormous potential given the ubiquity of speech-based devices. However, it is important that SER models generalize well across different conditions and settings showing robust performance. Conventionally, emotion recognition systems are trained with supervised learning solutions. The generalization of the models is often emphasized by training on a variety of samples with diverse labels. The state-of-the-art models for standard computer vision tasks utilize thousands of labeled samples. Similarly, automatic speech recognition (ASR) systems are trained on several hundred hours of data with transcriptions. Generally, labels for emotion recognition tasks are collected with perceptual evaluations from multiple evaluators. The raters annotate samples by listening or watching to the stimulus. This evaluation procedure is cognitively intense and expensive Therefore; standard benchmark datasets for SER have limited number of sentences with emotional labels, often collected from a limited number of evaluators. This limitation severely affects the generalization of the systems
  • An alternative approach to increase the generalization of the models is by building robust models. An effective approach to achieve this goal is with Neural Network. These auxiliary tasks correspond to the reconstruction of feature representations at various layers in neural network classifiers.

?System Analysis

?? Existing method

  • Principal Component Analysis
  • Geometric methods.
  • SVM classification


  • Low discriminatory power and high computational load
  • In geometric based methods, the geometric features like distance between speech signals.

Proposed method

Speech Emotion recognition for transform features system through textural analysis and NN classifier.


  • Robustness to illumination changes
  • Low complexity
  • High discriminatory power

Block diagram:

Speech Emotion Recognition using Deep Learning1


Feature extraction is simply used to get some information from the given audio spectral features like that spectral flux, entropy, roll off, centroid, spread, energy entropy and so on. These features are extracted and used to compare and classification of audio signals by using neural network .These are the main features which is taken from the input audio.

Audio signal processing???????????

Audio signal processing?is a subfield of?signal processing?that is concerned with the electronic manipulation of?audio signals. Audio signals are electronic representations of?sound waves?longitudinal waves?which travel through air, consisting of compressions and rarefactions. The energy contained in audio signals is typically measured in?decibels

?Discrete Wavelet Transform (DWT)

The discrete wavelet remodel (DWT) became superior to use the wavelet rework to the digital international. Filter banks are used to approximate the behavior of the non-prevent wavelet remodel. The sign is decomposed with a immoderate-skip smooth out and a low-bypass clear out. The coefficients of these filters are computed using mathematical evaluation and made to be had to you. See Appendix B for more records about those computations.


LP d: Low Pass Decomposition Filter

HP d: High Pass Decomposition Filter

LP r: Low Pass Reconstruction Filter

HP r: High Pass Reconstruction Filter

The wavelet literature offers the filter coefficients to you in tables. An example is the Daubechies filters for wavelets. These filters rely upon a parameter p called the vanishing 2nd

Speech Emotion Recognition using Deep Learning2


Types of Neural Networks:

  • Artificial Neural Network
  • Probabilistic Neural Networks
  • General Regression Neural Networks

DTREG implements the most widely used types of neural networks:

  1. a) Multilayer Perceptron Networks (also known as multilayer feed-forward network),
  2. b) Cascade Correlation Neural Networks,
  3. c) Probabilistic Neural Networks (NN)
  4. d) General Regression Neural Networks (GRNN).

Radial Basis Function Networks:

  1. Functional Link Networks,
  2. Kohonen networks,
  3. Gram-Charlier networks,
  4. Hebb networks,
  5. Adeline networks,
  6. Hybrid Networks.

The Multilayer Perceptron Neural Network Model

The following diagram illustrates a perceptron network with three layers:

Speech Emotion Recognition using Deep Learning3

This network has an input layer (on the left) with three neurons, one hidden layer (in the middle) with three neurons and an output layer (on the right) with three neurons.

There is one neuron in the input layer for each predictor variable. In the case of categorical variables, N-1 neurons are used to represent the N categories of the variable.

Hardware Requirements

  • system
  • 4 GB of RAM
  • 500 GB of Hard disk

?Software Required

  • MATLAB 7.5 and above versions


[1] T. U. Binbin, “Speech emotion recognition based on improved rnfcc with emd,” Computer Engineering and Applications, vol. 48, no. 18, pp. 119-122, July 2012.

[2] H. Yao, Y. Sun, and X. Zhang, “Research on no?neardynamics features of emotional speech,” Journal of XidianUniversity(Natural Science), October 2016.

[3] S. Ying, Y. Hui, X. Zhang, and Q. Zhang, “Feature extraction of emotional speech based on chaotic characteristics,” Journal of Tianjin University, vol. 48, no. 8, pp. 681-685, August 2015.

[4] Y. E. Jixiang, “Speech emotion recognition based on multifractal,” Computer Engineering and Applications, vol. 48, no. 13, pp. 186-189, 2012.

[5] S. Kuchibhotla, H. D. Vankayalapati, and K. R. Anne, “An optimal two stage feature selection for speech emotion recognition using acoustic features,” International Journal of Speech Technology, vol. 19, no. 4, pp. 1-11, August 2016.

[6] I. Trabelsi and M. S. Bouhlel, “Feature Selection for GUMI KernelBased SVM in Speech Emotion Recognition;’International Journal of Synthetic Emotions, pp. 57-68, August 2016.

[7] Y. Sun and G. Wen, “Emotion recognition using semi-supervised feature selection with speaker normalization,” Springer-Verlag New York, Inc., September 2015.


Customer Reviews

There are no reviews yet.

Be the first to review “Speech Emotion Recognition using Deep Learning | Matlab”

Your email address will not be published. Required fields are marked *