Liangliang CaoScientist at Apple Inc.
[LinkedIn], [Google Scholar], [DBLP], [arXiv]
I am a principal scientist at Apple in Cupertino, California. Previously I worked as a scientist/engineer at Google, Yahoo!, and IBM, as well as as an adjunct associate professor at Columbia University and UMass. Before that, I studied in UIUC as a Ph.D. student, in CUHK as a master, and in USTC as a bachelor student. During my Ph.D. study, I interned in Kodak, Microsoft, and NEC labs. I feel very fortunate to learn from many fantastic colleagues and mentors from these companies and universities.
I had a lot of experience in integrating cutting-edge research with products. I was a recipient of the ACM SIGMM Rising Star Award. I won 1st place in the ImageNet LSVRC Challenge in 2010. In 2016, I co-founded a startup named Switi Inc and worked as the CTO. After the startup was acquired, I worked as the tech lead for Google Cloud speech modeling and then the tech lead for Cloud vision modeling. I also helped Google Cloud win one of the largest contract in the history of Cloud AI.
Here is my (outdated) CV.
- Can content generation AI become the next Web search?
- Peering into the future of speech and visual recognition
- Two ways of iterating AI systems
- Memory of my Ph.D. Advisor Prof. Thomas Huang
Most of my recent papers are available on arXiv. If you are looking for papers published before 2019, see here.
On vision-language models and 3D vision
- "RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture" [arXiv], [demo]
- "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness" [arXiv]
- "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens" [arXiv]
- "Exploiting Category Names for Few-Shot Classification with Vision-Language Models" [arXiv]
On speech foundation models
- "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition", IEEE Journal of Selected Topics in Signal Processing 2022. [arXiv]
- "Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition" [arXiv]
- "Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition", ICASSP'22 [arXiv]
- "Residual Energy-Based Models for End-to-End Speech Recognition", INTERSPEECH'21 [arXiv]
- "Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction", INTERSPEECH'21 [arXiv]
- "Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition", ICASSP'21 [arXiv]
- Non-Streaming Model Distillation On Unsupervised Data, ICASSP'21 [arXiv]
- Targeted Universal Adversarial Perturbations, Interspeech'21 [arXiv]
- Bridging the gap between streaming and non-streaming ASR, Interspeech'21 [arXiv]
- "RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions", SLT'21 [arXiv]
- "Learning Word-Level Confidence For Subword End-to-End ASR", ICASSP'21 [arXiv]
- "Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models", ICASSP'20 [paper][dataset]