Blog

Oct 11
2024

Over-optimization in RL is well-known, but it even occurs when KL(policy || base model) is constrained fairly tightly

We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.

Help Center

/* */