Liangliang Cao

Scientist at Apple Inc.
Personal email: llcao[at]
[LinkedIn], [Google Scholar], [DBLP], [arXiv]


I am a principal scientist at Apple in Cupertino, California. Previously I worked as a scientist/engineer at Google, Yahoo!, and IBM, as well as as an adjunct associate professor at Columbia University and UMass. Before that, I studied in UIUC as a Ph.D. student, in CUHK as a master, and in USTC as a bachelor student. During my Ph.D. study, I interned in Kodak, Microsoft, and NEC labs. I feel very fortunate to learn from many fantastic colleagues and mentors from these companies and universities.

I was a recipient of the ACM SIGMM Rising Star Award. I won 1st place in the ImageNet LSVRC Challenge in 2010. In 2016, I co-founded a startup named Switi Inc and worked as the CTO. After the startup was acquired, I worked as the tech lead for Google Cloud speech modeling and then the tech lead for Cloud vision modeling. In 2019, I helped Google Cloud win one of the largest contracts in the history of Cloud AI.

I am currently on the editorial board of IEEE TPAMI, as well as a regular reviewer of several CV/ML/Multimedia conferences. Here is my (outdated) CV.


Recent Projects

Instruct tuning

  • "Instruction-Following Speech Recognition", 2023 [arXiv]

Vision-language models and 3D genAI

I used to be the Tech Lead at Google Cloud Vision and launched the state-of-the-art vision-language models to enterprise customers.
  • "Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day" [arXiv]
  • "Ferret: Refer and Ground Anything Anywhere at Any Granularity" [arXiv, code]
  • "RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture", ACM Multimedia 2023 [arXiv], [demo]
  • "Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness" [arXiv], [dataset]
  • "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens" [arXiv]
  • "Exploiting Category Names for Few-Shot Classification with Vision-Language Models" [arXiv]

Speech foundation models

I used to lead the Google Cloud Speech Modeling team, and launched 10+ end-to-end ASR models to production.
  • "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition", IEEE Journal of Selected Topics in Signal Processing 2022. [arXiv]
  • "Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition" [arXiv]
  • "Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition", ICASSP'22 [arXiv]
  • "Residual Energy-Based Models for End-to-End Speech Recognition", INTERSPEECH'21 [arXiv]
  • "Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction", INTERSPEECH'21 [arXiv]
  • "Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition", ICASSP'21 [arXiv]
  • Non-Streaming Model Distillation On Unsupervised Data, ICASSP'21 [arXiv]
  • Targeted Universal Adversarial Perturbations, Interspeech'21 [arXiv]
  • Bridging the gap between streaming and non-streaming ASR, Interspeech'21 [arXiv]
  • "RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions", SLT'21 [arXiv]
  • "Learning Word-Level Confidence For Subword End-to-End ASR", ICASSP'21 [arXiv]
  • "Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models", ICASSP'20 [paper][dataset]