Blog

Jun 17, 2024

Sycophancy to subterfuge: Investigating reward tampering in language models

Posted by in categories: cybercrime/malcode, robotics/AI

New Anthropic research: Investigating Reward Tampering.

Could AI models learn to hack their own reward system?

In a new paper, we show they can, by generalization from training in simpler settings.

Sycophancy to Subterfuge: Investigating…


Empirical evidence that serious misalignment can emerge from seemingly benign reward misspecification.

Leave a reply