Vision-language models (VLMs) could offer a promising solution for large-scale microscopy image analysis, linking observations to emerging disease mechanisms and identifying novel biomarkers, ultimately advancing scientific discovery and precision health. However, the lack of diverse benchmarks across microscopy modalities, scales, and states limits systematic evaluation and robust model development towards this goal. Introducing Micro-Bench, our first effort to address this limitation. We use Micro-Bench to evaluate both embedding and auto-regressive VLMs and find: 😔 All models have high error rates (including GPT-4) 🫠 Specialized fine-tuning erodes previously encoded biomedical knowledge (yup, base CLIP knows some biology). We use these insights to revisit a simple technique to make biomedical fine-tuned models robust (no extra training involved!). Come to our poster session @ NeurIPS if you would like to discuss more: 🔬 https://lnkd.in/g8VsJTiJ 🦠 West Ballroom A-D #5405 (11:00 am) Thanks to all the amazing people behind this project Jeff Nirschl, James Burgess, Sanket Gupte, Yuhui Zhang, Alyssa Unell, Serena Yeung