Liangliang Cao
Scientist at Apple Inc.llcao[at]apple.com
[LinkedIn], [Google Scholar], [DBLP], [arXiv]

I am a principal scientist at Apple in Cupertino, California. Previously I worked as a scientist/engineer at Google, Yahoo!, and IBM, as well as as an adjunct associate professor at Columbia University and UMass. Before that, I studied in UIUC as a Ph.D. student, in CUHK as a master, and in USTC as a bachelor student. During my Ph.D. study, I interned in Kodak, Microsoft, and NEC labs. I feel very fortunate to learn from many fantastic colleagues and mentors from these companies and universities.
I had a lot of experience in integrating cutting-edge research with products. I was a recipient of the ACM SIGMM Rising Star Award. I won 1st place in the ImageNet LSVRC Challenge in 2010. In 2016, I co-founded a startup named Switi Inc and worked as the CTO. After the startup was acquired, I worked as the tech lead for Google Cloud speech modeling and then the tech lead for Cloud vision modeling. I also helped Google Cloud win one of the largest contract in the history of Cloud AI.
Here is my (outdated) CV.
Essays
- Can content generation AI become the next Web search?
- Peering into the future of speech and visual recognition
- Two ways of iterating AI systems
- Memory of my Ph.D. Advisor Prof. Thomas Huang
Papers
Most of my recent papers are available on arXiv. If you are looking for papers published before 2019, see here.
- Preprint: "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens" [arXiv]
- Preprint: "Exploiting Category Names for Few-Shot Classification with Vision-Language Models" [arXiv]
- IEEE JSTSP'22: "BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition", IEEE Journal of Selected Topics in Signal Processing. [arXiv]
- ArXiv: "Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition" [arXiv]
- ICASSP'22: "Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition" [arXiv]
- INTERSPEECH'21: "Residual Energy-Based Models for End-to-End Speech Recognition" [arXiv]
- INTERSPEECH'21: "Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction" [arXiv]
- ICASSP'21: "Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition" [arXiv]
- ICASSP'21: Non-Streaming Model Distillation On Unsupervised Data [arXiv]
- Interspeech'21: Targeted Universal Adversarial Perturbations [arXiv]
- Interspeech'21: Bridging the gap between streaming and non-streaming ASR [arXiv]
- SLT'21: "RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions" [arXiv]
- ICASSP'21 "Learning Word-Level Confidence For Subword End-to-End ASR" [arXiv]
- ECCV'20: Label-Efficient Learning on Point Clouds using Approximate Convex Decompositions [paper]
- MICCAI'20: Deep Active Learning for Effective Pulmonary Nodule Detection [paper]
- ICASSP'20 "Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models" [paper][dataset]
- MICCAI'19 "3DFPN-HS2: 3D Feature Pyramid Network Based High Sensitivity and Specificity Pulmonary Nodule Detection" [paper]
- CVPR'19 "Automatic adaptation of object detectors to new domains using self-training" [code] [paper] [project]
- TPAMI'19 "Focal Visual-Text Attention for Memex Question Answering" [code and dataset] [paper]