ISSN 2071-8594

Russian academy of sciences


Gennady Osipov

S. D. Protserov, A. G. Shishkin Segmentation of Noisy Speech Signals


One of the most important problems in digital speech signal processing is to determine which parts of input acoustic signal contain speech, and which contain background noise or silence. This problem arises in many important practical applications, such as speech analysis in voice command systems, transmission of speech over the network and automated speech recognition. However, most of the existing systems designed for automated speech analysis are unable to solve this problem efficiently if the signal-to-noise ratio is too low. Moreover, their parameters have to be tuned separately for different noise levels. This prevents fully automated segmentation of noisy speech signals. In this paper we design a system for automated segmentation of speech signals distorted by additive noise of different type and intensity. Our system is based on three different convolutional neural network models and is capable of efficiently determining speech and silence segments in noisy signals with a wide range of noise intensity and different noise types.


speech signal, convolutional neural network, segmentation, digital signal processing

PP. 75-85.

DOI 10.14357/20718594210107


1. Rabiner, L.R., Sambur, M.R. 1957. An algorithm for determining the endpoints of isolated utterances. J Bell Syst. Tech. J. 54(2):297–315.
2. Zhang, R.Z., Cui, H.J. 2005. Speech endpoint detection algorithm analyses based on short-term energy. Audio Eng. 7:52–4.
3. Ghosh, P.K., Tsiartas, A., Narayanan, S. 2011. Robust voice activity detection using longterm signal variability. J Audio Speech Lang. Process. IEEE Trans. 19(3):600–13.
4. Ma, Y., Nishihara, A. 2013. Efficient voice activity detection algorithm using long-term spectral flatness measure. J. Eurasip J. Audio Speech Music Process. 2013(1):1–18.
5. Atal, B.S. 1974. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc. Am. 55(6):1304–22.
6. Davis, S.B., Mermelstein, P. 1990. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Readings Speech Recognit. 28(4):65–74.
7. Eshaghi, M., Karami Mollaei, M.R. 2010. Voice activity detection based on using wavelet packet. Digital Signal Process. 20(4):1102–15.
8. Li, J., Zhou, P., Jing, X. et al. 2012. Speech endpoint detection method based on TEO in noise environment. Proc. Eng. 29(4):2655–60.
9. Li, L., Zhu, J. 2013. Research of speech endpoint detection based on wavelet analysis and neural networks. J Electr Meas Instrum. 27(6): 528-534.
10. Sehgal, A., Kehtarnavaz, N. 2018. A Convolutional neural network smartphone App for real-time voice activity detection. IEEE Access. 99:1.
11. Amodei, D., Ananthanarayanan, S., Anubhai, et al. 2015. ArXiv, abs/1512.02595.
12. Hussain, M.S., and Haque, M.A. 2018. SwishNet: A Fast Convolutional Neural Network for Speech, Music and
Noise Classification and Segmentation. ArXiv, abs/1812.00149.
13. LibriVox. Available at:
14. CHiME-4. Available at:
15. Wang, Z., Vincent, E., Serizel, R., and Yan, Y. 2018. Rank-1 constrained Multichannel Wiener Filter for speech recognition in noisy environments. Comput. Speech Lang., 49: 37-51.
16. Jia, F., Majumdar, S., & Ginsburg, B. (2020). MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection. ArXiv, abs/2010.13886.
17. Tan, X., & Zhang, X. (2020). Speech enhancement aided end-to-end multi-task learning for voice activity detection. ArXiv, abs/2010.12484.
18. Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. 2015. Librispeech: An ASR corpus based on public domain audio books. [2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Proceedings]. Brisbane. 5206-5210.
19. Garofolo, John S., et al. 1993. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium.
20. Common Voice. Available at:
21. Boersma, P. 2002. Praat, a system for doing phonetics by computer.
22. Piczak, K.J. 2015. ESC: Dataset for Environmental Sound Classification.
23. Loshchilov, I., and Hutter, F. 2017. Fixing Weight Decay Regularization in Adam. ArXiv, abs/1711.05101.
24. Reddi, S., Kale, S., and Kumar, S. 2018. On the Convergence of Adam and Beyond. ArXiv, abs/1904.09237.
25. Smith, L.N., and Topin, N. 2019. Super-convergence: very fast training of neural networks using large learning rates. Defense + Commercial Sensing.