Some state-of-the-art automatic speech processing methods and beyond


Chao Zhang


2020-12-03 09:30:00 ~ 2020-12-03 11:00:00


ZOOM线上会议(ZOOM ID: 995 199 56603 Password:282267)


Ying Wen


Since the resurgence of using artificial neural networks for speech recognition in 2011, deep learning has reformed almost every field of automatic speech processing. This talk includes a set of our recent work covering multiple such fields, in which the proposed methods are all deep-learning-based and achieve the state-of-the-art performances. First, an approach that integrates the traditional noisy-source-channel-model-based method and the more recent attention-based sequence-to-sequence method for automatic speech recognition (ASR) is introduced. Next, when applying ASR to a dialogue or a meeting, the task of speaker diarisation is often necessary, which is to find “who spoke when” over a long audio stream with multiple active speakers. To this aspect, a discriminative neural clustering approach is presented that can achieve supervised clustering with a Transformer model. Further, regarding the understanding of the dialogue or meeting, emotion recognition is often important, and we introduce a novel model structure targeting on the fusion of time-synchronous and time-asynchronous multimodal feature representations for this purpose. Along the generation side, a speech synthesis method that leverages the cross-utterance text information to improve the modelling of prosody modelling is discussed. In the end, to deliver both a better understanding of how the human brain can achieve robust speech recognition and to obtain insights for improving the future ASR systems, an approach that creates bidirectional connections between the artificial and brain neural networks is demonstrated, which extends the scope of the automatic speech technologies.


Chao Zhang received his B.E. and M.S. degrees from the Department of Computer Science and Technology at Tsinghua University in 2009 and 2012, and his Ph.D. degree in Information Engineering under the direction of Prof. Phil Woodland (FREng) from  Cambridge University Engineering Department in 2017. He is currently a Research Associate at Cambridge University and an advisor at JD.com in speech and language processing. Chao is a co-author of the HTK speech recognition toolkit, and developed its C-based generic deep learning modules and Python-based pipelines. As a key member of the Cambridge speech team, Chao won a set of international speech recognition project evaluations and challenges including iARPA Babel 2013, DARPA BOLT 2014, and ASRU 2015 MGB, and often built the most important systems. Chao has published more than 50 papers in speech conferences and journals and received multiple paper awards, such as the best student papers from NCMMSC 2011, ICASSP 2014, and ASRU 2019, and the best paper candidate from ASRU 2015. Besides various speech processing tasks, his recent work also covers text-generation, emotion recognition, multimodal intelligence, recommendation system, large scale optimisation, as well as brain science.

© John Hopcroft Center for Computer Science, Shanghai Jiao Tong University

邮箱:jhc@sjtu.edu.cn 电话:021-54740299