Owain Evans

Owain Evans
Owain Evans
Alma mater	Massachusetts Institute of Technology (PhD); Columbia University (BA);
Known for	AI alignment research; TruthfulQA benchmark; Reversal curse; Emergent misalignment;
	Scientific career
Fields	Artificial intelligence, AI safety, machine learning
Institutions	Truthful AI; Center for Human Compatible AI, UC Berkeley; Future of Humanity Institute, University of Oxford;
Website	https://owainevans.github.io/

Owain Rhys Evans is a British artificial intelligence researcher who works on AI alignment and machine learning safety. He founded Truthful AI, a research group based in Berkeley, California, and is an affiliate of the Center for Human Compatible AI (CHAI) at the University of California, Berkeley.^[1] His research addresses AI truthfulness, emergent behaviors in large language models, and the alignment of AI systems with human values.^[2]

Education

Evans earned a Bachelor of Arts in philosophy and mathematics from Columbia University in 2008 and a PhD in philosophy from the Massachusetts Institute of Technology in 2015. His doctoral research focused on Bayesian computational models of human preferences and decision-making.^[3]

Career

After completing his doctorate, Evans held positions at the Future of Humanity Institute (FHI) at the University of Oxford, first as a postdoctoral research fellow and later as a research scientist.^[4]^[2] While at FHI, he co-authored a survey of machine learning researchers on timelines for human-level AI, published in the Journal of Artificial Intelligence Research.^[5] The survey was reported on by Newsweek, New Scientist, the BBC, and The Economist.^[6]^[7]^[8]^[9] He was also among the co-authors of a 2018 report on the potential for misuse of AI technologies, published by researchers at Oxford, Cambridge, and other institutions.^[10]^[11]

Since 2022, Evans has been based in Berkeley, where he founded Truthful AI, a non-profit research group that studies AI truthfulness, deception, and emergent behaviors in large language models.^[1]

Research

Evans's early work examined challenges in inverse reinforcement learning when human behavior is irrational or biased, proposing methods for AI systems to infer preferences from imperfect human demonstrations.^[12] He co-developed TruthfulQA (2021), a benchmark that tests whether language models give truthful answers rather than repeating common misconceptions. Initial evaluations found that larger models were not more truthful, suggesting that scaling alone does not improve factual accuracy.^[13]^[14] The benchmark has since been used by AI developers to evaluate large language models.^[2]^[15] He also co-authored a paper proposing design and governance strategies for building AI systems that do not deceive or hallucinate.^[16]

In 2023, Evans and collaborators described the "reversal curse", showing that language models trained on a fact in one direction (e.g. "A is B") often cannot answer the corresponding reverse query ("B is A"). The paper was published at ICLR 2024.^[17]^[18] His group also developed a benchmark for evaluating situational awareness in language models, presented at NeurIPS 2024.^[19]

In 2025, Evans and colleagues published a study in Nature on what they termed "emergent misalignment": fine-tuning a language model on a narrow task (writing insecure code) caused it to produce unrelated harmful outputs without explicit instruction to do so.^[20]^[21]^[22] The findings prompted follow-up research by OpenAI, Anthropic, and Google DeepMind.^[23]^[24] Later that year, Evans and collaborators (including researchers at Anthropic) reported that hidden behavioral traits can transfer between language models through training data, even when those traits are not explicitly present in the data, a phenomenon they called "subliminal learning".^[25]

Public engagement

In November 2025, Evans delivered the Hinton Lectures, a keynote lecture series on AI safety co-founded by Geoffrey Hinton and the Global Risk Institute.^[26]^[27]^[28]

References

^ ^a ^b "About us". TruthfulAI. Retrieved 14 February 2026.
^ ^a ^b ^c Ough, Tom (November 2024). "Looking Back at the Future of Humanity Institute". Asterisk. Retrieved 14 February 2026.
^ Evans, Owain Rhys (2015). Bayesian Computational Models for Inferring Preferences (PhD). Massachusetts Institute of Technology. Retrieved 14 February 2026.
^ Davey, Tucker (8 October 2018). "Cognitive Biases and AI Value Alignment: An Interview with Owain Evans". Future of Life Institute. Retrieved 14 February 2026.
^ Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018). "When Will AI Exceed Human Performance? Evidence from AI Experts". Journal of Artificial Intelligence Research. 62: 729–754.
^ Bort, Ryan (31 May 2017). "Will AI Take Over? Artificial Intelligence Will Best Humans at Everything by 2060, Experts Say". Newsweek. Retrieved 14 February 2026.
^ Revell, Timothy (31 May 2017). "AI will be able to beat us at everything by 2060, say experts". New Scientist. Retrieved 14 February 2026.
^ Gray, Richard (19 June 2017). "How long will it take for your job to be automated?". BBC. Retrieved 14 February 2026.
^ Cross, Tim (2018). "Human obsolescence". The Economist.
^ "Global AI experts sound the alarm in unique report". University of Cambridge. 2018. Retrieved 14 February 2026.
^ Naughton, John (25 February 2018). "Don't worry about AI going bad – the minds behind it are the danger". The Observer. Retrieved 14 February 2026.
^ Evans, Owain; Stuhlmüller, Andreas; Goodman, Noah D. (2016). "Learning the Preferences of Ignorant, Inconsistent Agents". Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 30. Retrieved 14 February 2026.
^ Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.
^ Naughton, John (2 October 2021). "The truth about artificial intelligence? It isn't that honest". The Observer. Retrieved 14 February 2026.
^ Gertner, Jon (18 July 2023). "Wikipedia's Moment of Truth". The New York Times Magazine. Retrieved 14 February 2026.
^ Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (2021). "Truthful AI: Developing and governing AI that does not lie". arXiv:2110.06674 [cs.CY].
^ Berglund, Lukas; Tong, Meg; Kaufmann, Max; Balesni, Mikita; Stickland, Asa Cooper; Korbak, Tomasz; Evans, Owain (2024). "The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"". Proceedings of the International Conference on Learning Representations (ICLR).
^ Hern, Alex (6 August 2024). "Why AI's Tom Cruise problem means it is 'doomed to fail'". The Guardian. Retrieved 14 February 2026.
^ Laine, Rudolf; Chughtai, Bilal; Evans, Owain; et al. (2024). "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs". Advances in Neural Information Processing Systems (NeurIPS).
^ Betley, Jan; Warncke, Niels; Sztyber-Betley, Anna; Tan, Daniel; Bao, Xuchan; Soto, Martín; Srivastava, Megha; Labenz, Nathan; Evans, Owain (14 January 2026). "Training large language models on narrow tasks can lead to broad misalignment". Nature. 649: 584–589. arXiv:2502.17424. doi:10.1038/s41586-025-09937-5. Retrieved 14 February 2026.
^ Nolan, Beatrice (4 March 2025). "Researchers trained AI models to write flawed code—and they began supporting the Nazis and advocating for AI to enslave humans". Fortune. Retrieved 14 February 2026.
^ Ahuja, Anjana (2 September 2025). "How AI models can optimise for malice". Financial Times.{{cite news}}: CS1 maint: deprecated archival service (link)
^ Ornes, Stephen (13 August 2025). "The AI Was Fed Sloppy Code. It Turned Into Something Evil". Quanta Magazine. Retrieved 14 February 2026.
^ Hall, Peter (18 June 2025). "OpenAI can rehabilitate AI models that develop a "bad boy persona"". MIT Technology Review. Retrieved 14 February 2026.
^ Hasson, Emma R. (29 August 2025). "Subliminal Learning Lets Student AI Models Learn Unexpected (and Sometimes Misaligned) Traits from Their Teachers". Scientific American. Retrieved 14 February 2026.
^ "The Hinton Lectures Return" (Press release). AI Safety Foundation. 7 October 2025. Retrieved 14 February 2026 – via PR Newswire.
^ Kirkwood, Isabelle (7 October 2025). "The Hinton Lectures return as AI's safety cracks widen". BetaKit. Retrieved 14 February 2026.
^ The Hinton Lectures 2025 – Night 1 – AI Agents: Risks and Opportunities. The AI Safety Foundation. 10 November 2025. Retrieved 14 February 2026 – via YouTube.

External links

Personal website
Owain Evans publications indexed by Google Scholar

[TruthfulAIAbout-1] "About us". TruthfulAI. Retrieved 14 February 2026.

[Asterisk2024-2] Ough, Tom (November 2024). "Looking Back at the Future of Humanity Institute". Asterisk. Retrieved 14 February 2026.

[MITThesis-3] Evans, Owain Rhys (2015). Bayesian Computational Models for Inferring Preferences (PhD). Massachusetts Institute of Technology. Retrieved 14 February 2026.

[FLI2018-4] Davey, Tucker (8 October 2018). "Cognitive Biases and AI Value Alignment: An Interview with Owain Evans". Future of Life Institute. Retrieved 14 February 2026.

[GovAI2018-5] Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018). "When Will AI Exceed Human Performance? Evidence from AI Experts". Journal of Artificial Intelligence Research. 62: 729–754.

[Newsweek2017-6] Bort, Ryan (31 May 2017). "Will AI Take Over? Artificial Intelligence Will Best Humans at Everything by 2060, Experts Say". Newsweek. Retrieved 14 February 2026.

[NewScientist2017-7] Revell, Timothy (31 May 2017). "AI will be able to beat us at everything by 2060, say experts". New Scientist. Retrieved 14 February 2026.

[BBC2017-8] Gray, Richard (19 June 2017). "How long will it take for your job to be automated?". BBC. Retrieved 14 February 2026.

[Economist2018-9] Cross, Tim (2018). "Human obsolescence". The Economist.

[CamAI2018-10] "Global AI experts sound the alarm in unique report". University of Cambridge. 2018. Retrieved 14 February 2026.

[Guardian2018-11] Naughton, John (25 February 2018). "Don't worry about AI going bad – the minds behind it are the danger". The Observer. Retrieved 14 February 2026.

[AAAI2016-12] Evans, Owain; Stuhlmüller, Andreas; Goodman, Noah D. (2016). "Learning the Preferences of Ignorant, Inconsistent Agents". Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 30. Retrieved 14 February 2026.

[TruthfulQA2021-13] Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.

[Guardian2021-14] Naughton, John (2 October 2021). "The truth about artificial intelligence? It isn't that honest". The Observer. Retrieved 14 February 2026.

[NYT2023-15] Gertner, Jon (18 July 2023). "Wikipedia's Moment of Truth". The New York Times Magazine. Retrieved 14 February 2026.

[TruthfulAI2021-16] Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (2021). "Truthful AI: Developing and governing AI that does not lie". arXiv:2110.06674 [cs.CY].

[ReversalCurse2024-17] Berglund, Lukas; Tong, Meg; Kaufmann, Max; Balesni, Mikita; Stickland, Asa Cooper; Korbak, Tomasz; Evans, Owain (2024). "The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"". Proceedings of the International Conference on Learning Representations (ICLR).

[Guardian2024-18] Hern, Alex (6 August 2024). "Why AI's Tom Cruise problem means it is 'doomed to fail'". The Guardian. Retrieved 14 February 2026.

[SAD2024-19] Laine, Rudolf; Chughtai, Bilal; Evans, Owain; et al. (2024). "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs". Advances in Neural Information Processing Systems (NeurIPS).

[Nature2025-20] Betley, Jan; Warncke, Niels; Sztyber-Betley, Anna; Tan, Daniel; Bao, Xuchan; Soto, Martín; Srivastava, Megha; Labenz, Nathan; Evans, Owain (14 January 2026). "Training large language models on narrow tasks can lead to broad misalignment". Nature. 649: 584–589. arXiv:2502.17424. doi:10.1038/s41586-025-09937-5. Retrieved 14 February 2026.

[Fortune2025-21] Nolan, Beatrice (4 March 2025). "Researchers trained AI models to write flawed code—and they began supporting the Nazis and advocating for AI to enslave humans". Fortune. Retrieved 14 February 2026.

[FT2025-22] Ahuja, Anjana (2 September 2025). "How AI models can optimise for malice". Financial Times.{{cite news}}: CS1 maint: deprecated archival service (link)

[Quanta2025-23] Ornes, Stephen (13 August 2025). "The AI Was Fed Sloppy Code. It Turned Into Something Evil". Quanta Magazine. Retrieved 14 February 2026.

[MITTechReview2025-24] Hall, Peter (18 June 2025). "OpenAI can rehabilitate AI models that develop a "bad boy persona"". MIT Technology Review. Retrieved 14 February 2026.

[SciAm2025-25] Hasson, Emma R. (29 August 2025). "Subliminal Learning Lets Student AI Models Learn Unexpected (and Sometimes Misaligned) Traits from Their Teachers". Scientific American. Retrieved 14 February 2026.

[PRNewswire2025-26] "The Hinton Lectures Return" (Press release). AI Safety Foundation. 7 October 2025. Retrieved 14 February 2026 – via PR Newswire.

[BetaKit2025-27] Kirkwood, Isabelle (7 October 2025). "The Hinton Lectures return as AI's safety cracks widen". BetaKit. Retrieved 14 February 2026.

[28] The Hinton Lectures 2025 – Night 1 – AI Agents: Risks and Opportunities. The AI Safety Foundation. 10 November 2025. Retrieved 14 February 2026 – via YouTube.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]