Using Context-based Statistical Models to Promote the Quality of Voice Conversion Systems

Document Type : Research Article


1 Corresponding Author, Department of Electrical Engineering, Amirkabir University of Technology, Tehran, Iran (e-mail:

2 Department of Product and Services, Tamin Telecom Co.(3G mobile operator), Tehran, Iran (e-mail:


This article aims to examine methods of optimizing GMM-based voice conversion systems performance in which GMM method is introduced as the basic method for improvement of voice conversion systems performance. In the current methods, due to using a single conversion function to convert all speech units and subsequent spectral smoothing arising from statistical averaging, we will observe quality reduction. In this paper, after introducing GMM2 method, several GMM models will be used to model each phoneme. Furthermore, in the stage of corresponding the clusters of each state, before applying Dynamic Time Warping algorithm, we use a LMR conversion for further correspondence among the parameters of two corresponding states of two speakers. Another reason for quality reduction in voice conversion system is that the precision of speech signal parameters was underestimated. In order to overcome such a problem, Generalized Harmonic Model is introduced which is replaced by sinusoid harmonic model applied in GMM2 giving another method called GMM3. Finally, we will present GMM4 method, the objective of which is to promote the system performance with limited data and a restricted number of demi-syllables to train conversion functions.


[1]     M. Abe, S. Nakamura, K.; Shikano, H. Kuwabara, “Voice conversion through vector quantization,“ ICASSP-88., pp. 655 – 658, 1988.
[2]     Stylianou , Y. and Cappe , O.; “A system voice conversion based on probabilistic classification and a harmonic plus noise model,” Proc. ICASSP, Seattle, U.S.A., pp. 281-284, May 1998.
[3]     Watanabe, T. et al.; “Transformation of spectral envelope for voice conversion based on radial basis function networks,” Proc. ICSLP. Denver, USA.,  pp. 285-288, Sept 2002.
[4]     Shikano , K.; Lee , K.; Reddy, R.; “Speaker Adaptation through vector Quantization”; IEEE Proc. on ICASSP; 1986; PP. 2643-2626.    
[5]     Abe, .M; Shikano, K.; Kuwabara, K.; “ Cross-Language Voice Conversion”; IEEE Proc. on ICASSP; 1990; PP 345-348. 
[6]     Eslami, M.; Sayadiyan, A.;” Generalized Harmonic Model For Speech Analysis”, In WSEAS Transactions on Electronics, Issue3, Vol. 1, July 2004.
[7]     Reynolds , D. A.; Rose , R.C.; ”Robust text independent Speaker identification using Gaussian mixture Speaker models”; IEEE Trans. On Speech, Audio Processing, Vol. 3; 1995; PP. 72-83.
[8]     Rabiner , L. R.; Juang , B. H.; “Fundamentals of Speech Recognition”, Englewood Cliffs, NJ: Prentice Hall; 1993.
[9]     Mizuno , H.; Abe, M.;“Voice Conversion based on piecewise linear conversion rules of formant frequency and spectrum tilt”; IEEE Proc. on ICASSP; 1994; PP. 1469-472.
[10]  Sayadian A.A.Gh.; Badiee K.; Moein M.; Moghadam N.A ; “Use of maximum information point (MIP) for high precision and reliable begin and endpoint detection of speech command”. In Amirkabir Journal Of Science And Technology, Spring 2004; 15(58-A (Topics In: Electrical Engineering)):320-337.
[11]  Almeida , L.B ; Silva, F.M.; “Variable frequency synthesis: An improved harmonic coding scheme”. In Proc. IEEE ‎Int. Conf. Acoust., Speech, Signal Processing, San Diego, 1984. ‎
[12]  Bailly, G.; “Improvements In Speech Synthesis”, J. Wiley And Sons Ltd, Chapter 1, ‎pp. 22-38, 2001.‎
[13]  ‎Baudoin, G., and Stylianou, Y.; “On The Transformation Of The Speech Spectrum ‎For Voice Conversion”, In Proceedings Of ICSLP-96 (Philadelphia, Pa),Vol. 2, pp. ‎‎1405-1408, October 1996. ‎
[14]  Geroge, E.B., Smith, M.J.T.; “Speech Analysis synthesis and Modification using an ‎Analysis by synthesis overlap add sinusoidal model", IEEE Trans. On speech and ‎Audio Processing. Vol. 5, No. 5; pp. 389-406; 1997.  ‎
[15]  Stylianou Y. Cappe, O., and Moulines E., “Continuous Probabilistic Transform For ‎Voice Conversion”, IEEE Transaction On Speech And Audio Processing Vol.6 ,2, ‎pp.131-142, March 1998.  ‎
[16]  Van Santen, J. P. H.; Buchsbaum, A. L., “Methods for Optimal Text Selection”, ‎In Proceedings Of Eurospeech-97, Rhodes, Greece, Vol. 2, pp. 553-556, ‎September 1997