The performance of automatic speech recognition (ASR) systems is adversely affected by the variations in speakers, audio channels and environmental conditions. Making these systems robust to these variations is still a big challenge. One of the main sources of variations in the speakers is the differences between their Vocal Tract Length (VTL). Vocal Tract Length Normalization (VTLN) is an effective method introduced to cope with this variation. In this method, the speech spectrum of each speaker is frequency warped according to a specific warping factor of that speaker. In this paper, we first developed the common search-based method to obtain the appropriate warping factor over a HMM-based Persian continuous speech recognition system. Then pointing out the computational cost of search-based method, we proposed a linear regression process for estimating warping factor based on the scores generated by our gender detection system. Experimental results over a Persian conversational speech database shown an improvement about 0.54 percent in word recognition accuracy as well as a significant reduction in computational cost of estimating warping factor, compared to search-based approach.
Rights and permissions | |
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |