首页 > 其他分享> > 【信息技术】【2015.10】基于模型的多种环境下鲁棒语音识别方法



本文为英国剑桥大学达尔文学院(作者:Yongqiang Wang)的博士论文,共231页。





Model-based approaches are a powerful andflexible framework for robust speech recognition. This framework has beenextensively investigated during the past decades and has been extended in anumber of ways to handle distortions caused by various acoustic factors,including speaker differences, channel distortions and environment noise. Thisthesis investigated model-based approaches to robust speech recognition indiverse conditions and proposed two extensions to this framework. Many speechrecognition applications will benefit from distant-talking speech capture. Thisavoids problems caused by using hand-held or body-worn equipment. However, dueto the large speaker-to-microphone distance, both background noise andreverberant noise will significantly corrupt speech signals and negativelyimpact speech recognition accuracies. This work will propose a new model-basedscheme for those applications in which only a single distant microphone isavailable. To compensate for the influence of previous speech frames on thecurrent speech frame in reverberant environments, extended statistics are appendedto the standard acoustic model to represent the distribution of a window ofcontextual clean feature vectors at the Gaussian component level. Given thesestatistics and the reverberant noise model parameters, the standard VectorTaylor series (VTS) expansion is extended to compensate the acoustic modelparameters for the effect of reverberation and background noise. A maximumlikelihood (ML) estimation algorithm is also developed to estimate thereverberant noise model parameters. Adaptive training of acoustic modelparameters on data recorded in multiple reverberant environments is alsoproposed. This allows a consistent ML framework to estimate both thereverberant noise parameters and the acoustic model parameters. Experiments areperformed on an artificially corrupted corpus and a corpus recorded in realreverberant environments. It is observed that the proposed model-based schemessignificantly improve the model robustness to reverberation for bothclean-trained and adaptively-trained acoustic models. As the speech signals areusually affected by multiple acoustic factors simultaneously, another importantaspect in the model-based framework is the ability to adapt canonical models tothe target acoustic condition with multiple acoustic factors in a flexiblemanner. An acoustic factorisation framework has been proposed to factorise thevariability caused by different acoustic factors. This is achieved byassociating each acoustic factor with a distinct factor transform. In this way,it enables factorised adaptation, which gives extra flexibility for model-basedapproaches. The second part of this thesis proposes several extensions toacoustic factorisation. It is first established that the key to acousticfactorisation is to keep the factor transforms independent of each other.Several approaches are discussed to construct such independent factortransforms. The first one is the widely used data constrained approach, whichsolely relies on the adaptation data to achieve the independence attribute. Thesecond, transform constrained approach utilises partial knowledge of howacoustic factors affect the speech signals and relies on different forms oftransforms to achieve factorisation. Based on a mathematical analysis of thedependence between ML estimated factor transforms, the third approachexplicitly enforces the independence constraint, thus it is not relying onbalanced data or particular forms of transforms. The transform constrained andthe explicit independence constrained factorisation approaches are applied tothe speaker and noise factorisation for speech recognition, yielding twoflexible model-based schemes which can use the speaker transforms estimated inone noise condition in other unseen noise conditions. Experimental results onartificially corrupted corpora demonstrate the flexibility of these schemes andalso illustrate the importance of the independence attribute to factorisation.

  1. 引言
  2. 语音识别系统
  3. 声学模型自适应与鲁棒性
  4. 混响环境的稳健性
  5. 声学因子分解框架
  6. 说话人与噪声分解
  7. 混响鲁棒性实验
  8. 声学分解实验
  9. 结论
    附录A 不匹配函数的推导
    附录B 最大假设的含义
    附录C fCAT的最大似然估计


