Liangliang Cao
Scientist at Apple Inc.llcao[at]apple.com
[LinkedIn], [Google Scholar], [DBLP], [arXiv]

I am a principal scientist at Apple in Cupertino, California. Previously I worked as a scientist/engineer at Google, Yahoo!, and IBM, as well as as an adjunct associate professor at Columbia University and UMass. Before that, I studied in UIUC as a Ph.D. student, in CUHK as a master, and in USTC as a bachelor student. During my Ph.D. study, I interned in Kodak, Microsoft, and NEC labs. I feel very fortunate to learn from many fantastic colleagues and mentors from these companies and universities.
I had a lot of experience in integrating cutting-edge research with products. I was a recipient of the ACM SIGMM Rising Star Award. I won 1st place in the ImageNet LSVRC Challenge in 2010. In 2016, I co-founded a startup named Switi Inc and worked as the CTO. After the startup was acquired, I worked as the tech lead for Google Cloud speech modeling and then the tech lead for Cloud vision modeling. I also helped Google Cloud win one of the largest contract in the history of Cloud AI.
Here is my (outdated) CV.
Essays
- Can content generation AI become the next Web search?
- Peering into the future of speech and visual recognition
- Two ways of iterating AI systems
- Memory of my Ph.D. Advisor Prof. Thomas Huang
Recent Papers
Most of my recent papers are available on arXiv. If you are looking for papers published before 2019, see here.
On vision-language models and 3D vision
- "RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture" [arXiv], [demo]
- "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness" [arXiv]
- "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens" [arXiv]
- "Exploiting Category Names for Few-Shot Classification with Vision-Language Models" [arXiv]
On speech foundation models
- "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition", IEEE Journal of Selected Topics in Signal Processing 2022. [arXiv]
- "Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition" [arXiv]
- "Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition", ICASSP'22 [arXiv]
- "Residual Energy-Based Models for End-to-End Speech Recognition", INTERSPEECH'21 [arXiv]
- "Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction", INTERSPEECH'21 [arXiv]
- "Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition", ICASSP'21 [arXiv]
- Non-Streaming Model Distillation On Unsupervised Data, ICASSP'21 [arXiv]
- Targeted Universal Adversarial Perturbations, Interspeech'21 [arXiv]
- Bridging the gap between streaming and non-streaming ASR, Interspeech'21 [arXiv]
- "RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions", SLT'21 [arXiv]
- "Learning Word-Level Confidence For Subword End-to-End ASR", ICASSP'21 [arXiv]
- "Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models", ICASSP'20 [paper][dataset]