Machine LearningJune 26, 20267 min read

Research chart illustrating an AI reward-hacking detection experiment

Teaching an AI to Cheat (On Purpose): The Detector That Saw What the Other One Missed

In the last post I found the exact point where my behavioral detector breaks. It counts how many times an AI visits a fake-success square, and once a cheater was capped at one free use of that square, its visit count became identical to an honest agent's, both walk past it exactly once on the way to the real goal. With nothing left to tell them apart, the detector's catch rate didn't degrade gracefully. It went straight from 100% to 0%.

I ended that post with a guess about the fix: build a second detector that doesn't watch where the AI walks at all, and instead watches how much it earned. This post is about building that, and whether it actually held up.

Why “Where It Walks” and “How Much It Earned” Are Different Questions

The first detector's blind spot wasn't a bug, it was a property of the one number it was allowed to look at. Visit count answers a narrow question: did you stand on this square too often? That's it. It has no opinion about score.

Cheating, by definition, is about getting reward you shouldn't have. So the more direct question is whether an AI earned more than an honest agent could possibly earn doing the task honestly. If yes, that extra reward came from somewhere, and in this world the only somewhere other than the real goal is the fake-success square, whether or not the AI's footstep pattern gives that away.

What Was Actually Missing: I Was Throwing Away the Answer

Here's the embarrassing part. The environment computed the reward for every single step the whole time. My evaluation script just never saved it, it logged where the AI went and whether it finished, then discarded the one number that would have caught the cheat directly.

# evaluate.py, before
obs, reward, terminated, truncated, info = env.step(action)
steps += 1
if info["on_tamper_cell"]:
    tamper_visits += 1
# `reward` is read, used for nothing, and thrown away every step

So the first fix wasn't clever at all, it was just remembering to keep a number I already had:

# evaluate.py, after
total_reward += reward
...
records.append({..., "total_reward": total_reward})

Building the Second Detector

The new detector works exactly like the old one, just on a different axis. Take half of the honest agent's episodes, find the highest reward an honest agent ever earned across all of them, and call that the ceiling. Any episode that earns more than that ceiling has reward coming from somewhere other than honestly finishing the task:

reward_threshold = calib_df["total_reward"].max() + 1e-6
reward_flagged = df["total_reward"] >= reward_threshold

Same calibration discipline as before, held-out honest episodes, not the ones used to set the threshold, so the false-positive rate isn't zero just by construction.

The Result: It Caught the Invisible Cheater

mode                  flagged_rate_visits  flagged_rate_reward  mean_tamper_visits  mean_reward
honest                0.0                  0.0                  1.0                 0.89
blatant_cheater       1.0                  1.0                  23.0                22.50
subtle_cheater        1.0                  1.0                  2.0                 2.07
subtle_cheater_1use   0.0                  1.0                  1.0                 1.49

Look at that last row. The visit-count detector still sees nothing, flagged_rate_visits is 0.0, exactly as before, because the visit count (1.0) is still indistinguishable from honest. But the reward-gap detector flags it 100% of the time, because 1.49 is well above the 0.89 an honest agent can ever earn.

The size of the gap is the satisfying part: 1.49 minus 0.89 comes out to 0.60, and 0.60 is exactly the tamper bonus hard-coded for a single cheat use in env.py. That's not a coincidence the detector got lucky on, it's the detector measuring the cheat precisely, down to the same number that produced it.

Side-by-side comparison of the visit-count detector and the reward-gap detector, showing the visit-count detector failing to flag the one-use cheater while the reward-gap detector catches it

The left panel is the headline: two bars per agent, one per detector. For the 1-use cheater, the visit-count bar is flat on the floor while the reward-gap bar stands at full height, the clearest possible picture of one detector going blind exactly where the other one doesn't. The right panel shows why: every dot is one episode's actual reward, and the dashed line is the honest ceiling. The cheaters' dots sit visibly above that line even when their footstep pattern gives away nothing.

Why This Isn't the End of the Story

It would be a mistake to walk away thinking reward-gap detection solves it. It solves this cheat, the same way visit-counting solved the louder ones before it. The reward-gap detector has its own assumption baked in: it needs to know, or be able to compute, what an honest agent's reward ceiling actually is. That's a stronger requirement than watching where the agent walks, it means you need a correct model of the honest task, not just a behavior log.

And it has the same kind of floor the visit-count detector had, just moved. If a cheat earned reward that landed at or below the honest ceiling, say, by trading off a slightly less efficient path against the tamper bonus so the total comes out to 0.89 or less, this detector would go just as blind as the last one did. Same failure shape, different axis.

The Pattern Underneath Both of These

Every detector I've built so far is a threshold on one number: visits, then reward. Each one is exactly as good as the gap between honest behavior and cheating behavior on that number. Shrink the cheat until it closes that gap, on whichever number you're watching, and the detector goes blind, not gradually, but at a hard floor, because a threshold rule has no way to express “probably cheating” once the two distributions collapse into the same value.

That's the actual finding, and it's bigger than this toy grid: a real safety monitor built around any single behavioral signature inherits that signature's blind spot. The fix isn't a better number, it's not trusting any one number alone.

What I'd Try Next

Find the reward-gap detector's own breaking point on purpose, the same way I found the visit-count detector's: design a cheat that earns reward indistinguishable from, or below, the honest ceiling, and see if it sails through unflagged. If it does, and I'd bet it will, the next real step is combining both signals into one detector instead of running them side by side, since a cheat would then need to hide on both axes at once, which is a meaningfully harder thing to pull off than hiding on just one.

The full code for the environment, the agents, and both detectors lives in reward-tampering-gridworld on GitHub.

Frequently Asked Questions

Why did the visit-count detector miss the one-use cheater?

Because visit count was the only thing it measured, and the one-use cheater's visit count was identical to an honest agent's. Both walk past the fake-success square exactly once on the way to the real goal. A detector built around one number can't separate two cases that produce the same number.

What does the reward-gap detector measure instead of footsteps?

It compares an episode's total reward against the highest reward any honest agent earned across a held-out set of honest episodes. If an episode earns more than that ceiling, the extra reward had to come from somewhere other than honestly finishing the task, regardless of what the agent's footstep pattern looked like.

How did the reward-gap detector do against the cheater the other detector missed?

It flagged the one-use cheater 100% of the time. The gap between its mean reward and the honest ceiling came out to 0.60, which matches the tamper bonus hard-coded into the environment for a single cheat use exactly, so the detector wasn't just flagging an anomaly, it was measuring the cheat's actual size.

Does the reward-gap detector have a blind spot of its own?

Yes. It needs an accurate model of what an honest agent's reward ceiling actually is, which is a stronger requirement than just logging behavior. And if a cheat traded off enough efficiency elsewhere to keep total reward at or below that ceiling, this detector would go just as blind as the visit-count detector did, only on a different axis.