Bo Li
Research Areas
Authored Publications
Sort By
Preview abstract
Speech data from different domains has distinct acoustic and linguistic characteristics. It is common to train a single multidomain model such as a Conformer transducer for speech recognition on a mixture of data from all domains. However, changing data in one domain or adding a new domain would require the multidomain model to be retrained. To this end, we propose a framework called modular domain adaptation (MDA) that enables a single model to process multidomain data while keeping all parameters domain-specific, i.e., each parameter is only trained by data from one domain. On a streaming Conformer transducer trained only on video caption data, experimental results show that an MDA-based model can reach similar performance as the multidomain model on other domains such as voice search and dictation by adding per-domain adapters and per-domain feed-forward networks in the Conformer encoder.
View details
Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems
Chao Zhang
IEEE Spoken Language Technology Workshop (2022)
Preview abstract
Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. This EP model strongly affects latency, but is subject to computational constraints, which limits prediction accuracy. We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This allows flexibility during inference to produce a low-cost prediction or a higher quality prediction if ASR computation is ongoing. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 130ms (33.3% reduction), and 90th percentile latency by 160ms (22.2% reduction), without regressing word-error rate. For continuous recognition, WER improves by 10.6% (relative).
View details
Preview abstract
Multilingual end-to-end automatic speech recognition models are attractive due to its simplicity in training and deployment. Recent work on large-scale training of such models has shown promising results compared to monolingual models. However, the work often focuses on the structure of multilingual models themselves in a single-pass decoding setup. In this work, we investigate second-pass deliberation for multilingual speech recognition. Our proposed deliberation is multilingual, i.e., the text encoder encodes hypothesis text from multiple languages, and the deliberation decoder attends to encoded text and audio from multiple languages without explicitly using language information. We investigate scaling different components of the multilingual deliberation model, such as the text encoder and deliberation decoder, and also compare scaling the second-pass deliberation decoder and the first-pass cascaded encoder. We show that deliberation improves the average WER on 9 languages by 4% relative compared to the single-pass model in a truly multilingual setup. By increasing the size of the deliberation model up to 1B parameters, the average WER improvement increases to 9%, with up to 14% for certain languages.
View details
Joint Unsupervised and Supervised Training for Multilingual ASR
Yu Zhang
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2022), pp. 6402-6406
Preview abstract
Self-supervised training has been showing promising gains in pretraining models and facilitating the downstream finetuning for speech recognition. Effective self-supervised losses designed for large-scale unlabeled data can help learn the useful latent structures. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. However, the pretrained checkpoint selection is known to be tricky and tedious, and pure finetuning can cause catastrophic forgetting of the learnt representations. To address these concerns, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We apply our method to a challenging multilingual automatic speech recognition (ASR) task and validate its performance on the public dataset \textit{Multilingual LibriSpeech} (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art (SOTA) methods by 10\%, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms monolingual baselines by 33.3\%, and the state-of-the-art 2-stage XLSR by 32\%. On low-resource language like Polish, our WER is less than half of the monolingual WER baseline and even beats the supervised transfer learning method using external supervision.
View details
Preview abstract
Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pre-training on unsupervised data can help improve speech recognition quality for many African languages.
View details
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
Rami Botros
Ruoming Pang
David Johannes Rybach
James Qin
Quoc-Nam Le-The
Anmol Gulati
Cal Peyser
Chung-Cheng Chiu
Emmanuel Guzman
Jiahui Yu
Qiao Liang
Wei Li
Yu Zhang
Interspeech (2021) (to appear)
Preview abstract
On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller.
View details
Learning Word-Level Confidence for Subword End-to-End ASR
David Qiu
Yu Zhang
Liangliang Cao
Deepti Bhatia
Wei Li
Ke Hu
ICASSP (2021)
Preview abstract
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR). Although prior works have proposed training auxiliary confidence models for ASR systems, they do not extend naturally to systems that operate on word-pieces (WP) as their vocabulary. In particular, ground truth WP correctness labels are needed for training confidence models, but the non-unique tokenization from word to WP causes inaccurate labels to be generated. This paper proposes and studies two confidence models of increasing complexity to solve this problem. The final model uses self-attention to directly learn word-level confidence without needing subword tokenization, and exploits full context features from multiple hypotheses to improve confidence accuracy. Experiments on Voice Search and long-tail test sets show standard metrics (e.g., NCE, AUC, RMSE) improving substantially. The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
View details
Confidence Estimation for Attention-based Sequence-to-Sequence Models for Speech Recognition
David Qiu
Yu Zhang
Philip C. Woodland
Liangliang Cao
ICASSP (2021)
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling
Jiahui Yu
Wei Han
Anmol Gulati
Chung-Cheng Chiu
Ruoming Pang
ICLR 2021
Preview abstract
Streaming automatic speech recognition (ASR) aims at emitting each recognized word shortly as they are spoken, while full-context ASR encodes an entire speech sequence before decoding texts. In this work, we propose a unified framework, Universal ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. More importantly, we show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation. Universal ASR framework is network-agnostic, and can be applied to recent state-of-the-art convolution-based and transformer-based end-to-end ASR networks. We present extensive experiments on both research dataset LibriSpeech and mega-scale internal dataset MultiDomain with two state-of-the-art ASR networks ContextNet and Conformer. Experiments and ablation studies demonstrate that Universal ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR.
View details
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
Jiahui Yu
Chung-Cheng Chiu
Wei Han
Anmol Gulati
Ruoming Pang
ICASSP 2021
Preview abstract
Streaming automatic speech recognition (ASR) aims to output each hypothesized word as quickly and accurately as possible. However, reducing latency while retaining accuracy is highly challenging. Existing approaches including Early and Late Penalties~\cite{li2020towards} and Constrained Alignment~\cite{sainath2020emitting} penalize emission delay by manipulating per-token or per-frame RNN-T output logits. While being successful in reducing latency, these approaches lead to significant accuracy degradation. In this work, we propose a sequence-level emission regularization technique, named FastEmit, that applies emission latency regularization directly on the transducer forward-backward probabilities. We demonstrate that FastEmit is more suitable to the sequence-level transducer~\cite{Graves12} training objective for streaming ASR networks. We apply FastEmit on various end-to-end (E2E) ASR networks including RNN-Transducer~\cite{Ryan19}, Transformer-Transducer~\cite{zhang2020transformer}, ConvNet-Transducer~\cite{han2020contextnet} and Conformer-Transducer~\cite{gulati2020conformer}, and achieve 150-300ms latency reduction over previous art without accuracy degradation on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech.
View details