Lifeboat Foundation: Safeguarding Humanity

Toggle sitemap Toggle light / dark theme

Blog

Jun 17
2024

Sycophancy to subterfuge: Investigating reward tampering in language models

New Anthropic research: Investigating Reward Tampering.

Could AI models learn to hack their own reward system?

In a new paper, we show they can, by generalization from training in simpler settings.

Sycophancy to Subterfuge: Investigating…

Empirical evidence that serious misalignment can emerge from seemingly benign reward misspecification.

/* */