Liangliang Cao
Scientist at Apple Inc.llcao[at]apple.com
[LinkedIn], [Google Scholar], [DBLP], [arXiv]

I am a principal scientist at Apple in Cupertino, California. Previously I worked as a scientist/engineer at Google, Yahoo!, and IBM, as well as as an adjunct associate professor at Columbia University and UMass. Before that, I studied in UIUC as a Ph.D. student, in CUHK as a master, and in USTC as a bachelor student. During my Ph.D. study, I interned in Kodak, Microsoft, and NEC labs. I feel very fortunate to learn from many fantastic colleagues and mentors from these companies and universities.
I was a recipient of the ACM SIGMM Rising Star Award. I won 1st place in the ImageNet LSVRC Challenge in 2010. In 2016, I co-founded a startup named Switi Inc and worked as the CTO. After the startup was acquired, I worked as the tech lead for Google Cloud speech modeling and then the tech lead for Cloud vision modeling. In 2019, I helped Google Cloud win one of the largest contracts in the history of Cloud AI.
I am currently on the editorial board of IEEE TPAMI, as well as a regular reviewer of several CV/ML/Multimedia conferences. Here is my (outdated) CV.
News
- A new essay: The Struggles of New Bing: Insights for AI Products in the GenAI Era.
- The ImageNet Adversarial Text Regions (ImageNet-Atr) dataset is available. It is similar to the ImageNet eval set, but challenging for typical CLIP models. For example, the Open-CLIP B-16 trained from LAION dataset got a top-1 zero-shot accuracy of 29.4%.
Recent Projects
Instruct Tuning
- "Instruction-Following Speech Recognition", 2023 [arXiv]
On vision-language models and 3D vision
I used to be the Tech Lead at Google Cloud Vision and launched the state-of-the-art vision-language models to enterprise customers.- "RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture", ACM Multimedia 2023 [arXiv], [demo]
- "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness" [arXiv], [dataset]
- "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens" [arXiv]
- "Exploiting Category Names for Few-Shot Classification with Vision-Language Models" [arXiv]
On speech foundation models
I used to lead the Google Cloud Speech Modeling team, and launched 10+ end-to-end ASR models to production.- "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition", IEEE Journal of Selected Topics in Signal Processing 2022. [arXiv]
- "Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition" [arXiv]
- "Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition", ICASSP'22 [arXiv]
- "Residual Energy-Based Models for End-to-End Speech Recognition", INTERSPEECH'21 [arXiv]
- "Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction", INTERSPEECH'21 [arXiv]
- "Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition", ICASSP'21 [arXiv]
- Non-Streaming Model Distillation On Unsupervised Data, ICASSP'21 [arXiv]
- Targeted Universal Adversarial Perturbations, Interspeech'21 [arXiv]
- Bridging the gap between streaming and non-streaming ASR, Interspeech'21 [arXiv]
- "RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions", SLT'21 [arXiv]
- "Learning Word-Level Confidence For Subword End-to-End ASR", ICASSP'21 [arXiv]
- "Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models", ICASSP'20 [paper][dataset]