Scott Emmons

I am a research scientist at Google DeepMind focused on AI safety and alignment. I completed my PhD at UC Berkeley’s Center for Human-Compatible AI, advised by Stuart Russell. I previously cofounded far.ai, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.

I am interested in both the theory and practice of AI alignment. I have helped characterize how RLHF can lead to deception when the AI sees more than the human, develop multimodal attacks and benchmarks for open-ended agents, and use mechanistic interpretability to find evidence of learned look-ahead in a chess-playing neural network.

Curriculum Vitae

scott at scottemmons dot com

Publications

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah
arXiv, 2025
[arXiv] [BibTeX]
An Approach to Technical AGI Safety and Security
Rohin Shah, ..., Scott Emmons^*, ..., and Anca Dragan
arXiv, 2025
[arXiv] [BibTeX]
Observation Interference in Partially Observable Assistance Games
Scott Emmons^*, Caspar Oesterheld^*, Vincent Conitzer, and Stuart Russell
International Conference on Machine Learning, 2025
[arXiv] [BibTeX]
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristobal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, Tony Tong Wang, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, and Ethan Perez
International Conference on Learning Representations, 2025
[code] [arXiv] [BibTeX]
The Partially Observable Off-Switch Game
Andrew Garber^*, Rohan Subramani^*, Linus Luu^*, Mark Bedaywi, Stuart Russell, and Scott Emmons
Association for the Advancement of Artificial Intelligence, 2025
[arXiv] [BibTeX]
Obfuscated Activations Bypass LLM Latent-Space Defenses
Luke Bailey^*, Alex Serrano^*, Abhay Sheshadri^*, Mikhail Seleznyov^*, Jordan Taylor^*, Erik Jenner^*, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons
arXiv, 2024
[project page] [code] [arXiv] [BibTeX]
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
Leon Lang^*, Davis Foote^*, Stuart Russell, Anca Dragan, Erik Jenner, and Scott Emmons^*
Neural Information Processing Systems, 2024
[arXiv] [BibTeX]
A StrongREJECT for Empty Jailbreaks
Alexandra Souly^*, Qingyuan Lu^*, Dillon Bowen^*, Tu Trinh^†, Elvis Hsieh^†, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons^‡, Olivia Watkins^‡, and Sam Toyer^‡
Neural Information Processing Systems, 2024
[code] [arXiv] [BibTeX]
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, and Stuart Russell
Neural Information Processing Systems, 2024
[project page] [code] [arXiv] [BibTeX]
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Luke Bailey^*, Euan Ong^*, Stuart Russell, and Scott Emmons
International Conference on Machine Learning, 2024
[project page] [code] [arXiv] [BibTeX]
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Edmund Mills, Shiye Su, Stuart Russell, and Scott Emmons
arXiv, 2023
[code] [arXiv] [BibTeX]
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan^*, Chan Jun Shern^*, Andy Zou^*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks
International Conference on Machine Learning, 2023
[project page] [code] [arXiv] [BibTeX]
For Learning in Symmetric Teams, Local Optima are Global Nash Equilibria
Scott Emmons, Caspar Oesterheld, Andrew Critch, Vincent Conitzer, and Stuart Russell
International Conference on Machine Learning, 2022
[code] [arXiv] [BibTeX]
RvS: What is Essential for Offline RL via Supervised Learning?
Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine
International Conference on Learning Representations, 2022
[code] [arXiv] [BibTeX]
An Empirical Investigation of Representation Learning for Imitation
Xin Chen^*, Sam Toyer^*, Cody Wild^*, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H. Wang, Ping Luo, Stuart Russell, Pieter Abbeel, and Rohin Shah
Neural Information Processing Systems, 2021
[code] [arXiv] [BibTeX]
Sparse Graphical Memory for Robust Planning
Scott Emmons^*, Ajay Jain^*, Michael Laskin^*, Thanard Kurutach, Pieter Abbeel, and Deepak Pathak
Neural Information Processing Systems, 2020
[video] [code] [arXiv] [BibTeX]
Concurrency and Reachability in Treelike Temporal Networks
Eun Lee, Scott Emmons, Ryan Gibson, James Moody, and Peter J. Mucha
Physical Review E, 2019
[arXiv] [BibTeX]
A Map Equation with Metadata: Varying the Role of Attributes in Community Detection
Scott Emmons and Peter J. Mucha
Physical Review E, 2019
[code] [arXiv] [BibTeX]
Global Redundancy Resolution via Continuous Pseudoinversion of the Forward Kinematic Map
Kris Hauser and Scott Emmons
IEEE Transactions on Automation Science and Engineering, 2018
[project page] [code] [preprint] [BibTeX]
MOOC Visual Analytics: Empowering Students, Teachers, Researchers, and Platform Developers of Massively Open Online Courses
Scott Emmons, Robert Light, and Katy Börner
Journal of the Association for Information Science and Technology (JASIST), 2017
[project page] [code] [preprint] [BibTeX]
Post-Processing Partitions to Identify Domains of Modularity Optimization
William H. Weir, Scott Emmons, Ryan Gibson, Dane Taylor, and Peter J. Mucha
Algorithms, 2017
[code] [arXiv] [BibTeX]
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale
Scott Emmons, Stephen Kobourov, Mike Gallant, and Katy Börner
PLoS ONE, 2016
[project page] [code] [arXiv] [BibTeX]

Open-Source Software

imitation: Clean Imitation Learning Implementations
Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, and Stuart Russell
arXiv, 2022
[code] [arXiv] [documentation] [BibTeX]

Leadership

Sysadmin and Internship Manager, Center for Human-Compatible AI.
Berkeley, CA, August 2019 to May 2024.
[about]
Cofounder and President, far.ai.
Berkeley, CA, February 2022 to July 2023.
[about]

Service

Volunteer Teacher, Shanti Bhavan Children's Project.
Tamil Nadu, India, Summer 2017.
[about] [Netflix documentary]
Volunteer Teacher, Sunflower County Freedom Project.
Mississippi Delta, Summer 2016.
[about]