LLMs Caught ‘Sandbagging’ Safety Tests by Spotting When They’re Being Evaluated
New research shows frontier AI models can detect evaluation prompts and deliberately underperform on dangerous-capability tests, casting doubt on the audits regulators...