Liangliang Cao

Scientist at Apple Inc.
Personal email: llcao[at]
[LinkedIn], [Google Scholar], [DBLP], [arXiv]



I am a principal scientist at Apple in Cupertino, California. Previously I worked as a scientist/engineer at Google, Yahoo!, and IBM, as well as as an adjunct associate professor at Columbia University and UMass. Before that, I studied in UIUC as a Ph.D. student, in CUHK as a master, and in USTC as a bachelor student. During my Ph.D. study, I interned in Kodak, Microsoft, and NEC labs. I feel very fortunate to have learned from many fantastic colleagues and mentors at these companies and universities.

I was a recipient of the ACM SIGMM Rising Star Award. I won 1st place in the ImageNet LSVRC Challenge in 2010. In 2016, I co-founded a startup named Switi Inc and worked as the CTO. After the startup was acquired, I worked as the tech lead for Google Cloud speech modeling and then the tech lead for Cloud vision modeling. In 2019, I helped Google Cloud win one of the largest contracts in the history of Cloud AI.

I am currently on the editorial board of IEEE TPAMI, and I am a regular reviewer of several CV/ML/Multimedia conferences. Here is my CV (updated June 2024).


  • Apple Intelligence was announced in WWDC'24! It was a great experience to act as a modeling lead and engineering lead to support a number of AI features.
  • New paper and code for "Ferret: Refer and Ground Anything Anywhere" are available.
  • The ImageNet Adversarial Text Regions (ImageNet-Atr) dataset is available. It is similar to the ImageNet eval set, but challenging for typical CLIP models. For example, the Open-CLIP B-16 trained from LAION dataset got a top-1 zero-shot accuracy of 29.4%.

Recent Essays

Recent Paper

Instruct tuning

  • "Instruction-Following Speech Recognition", 2023 [arXiv]

Vision-language models and 3D genAI

I used to be the Tech Lead at Google Cloud Vision and launched the state-of-the-art vision-language models to enterprise customers.
  • "Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day", ICLR 2024 [arXiv]
  • "Ferret: Refer and Ground Anything Anywhere at Any Granularity", ICLR 2024 [arXiv, code]
  • "RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture", ACM Multimedia 2023 [arXiv], [demo]
  • "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness" [arXiv], [dataset]
  • "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens", EMNLP 2023 [arXiv]
  • "Exploiting Category Names for Few-Shot Classification with Vision-Language Models" [arXiv]

Speech foundation models

I used to lead the Google Cloud Speech Modeling team, and launched 10+ end-to-end ASR models to production.
  • "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition", IEEE Journal of Selected Topics in Signal Processing 2022. [arXiv]
  • "Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition" [arXiv]
  • "Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition", ICASSP'22 [arXiv]
  • "Residual Energy-Based Models for End-to-End Speech Recognition", INTERSPEECH'21 [arXiv]
  • "Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction", INTERSPEECH'21 [arXiv]
  • "Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition", ICASSP'21 [arXiv]
  • Non-Streaming Model Distillation On Unsupervised Data, ICASSP'21 [arXiv]
  • Targeted Universal Adversarial Perturbations, Interspeech'21 [arXiv]
  • Bridging the Gap between Streaming and Non-streaming ASR, Interspeech'21 [arXiv]
  • "RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions", SLT'21 [arXiv]
  • "Learning Word-Level Confidence For Subword End-to-End ASR", ICASSP'21 [arXiv]
  • "Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models", ICASSP'20 [paper][dataset]