AI & SecurityJune 23, 20267 min read

A gridworld diagram illustrating an AI agent, its real goal, and a tampering shortcut it can exploit

Teaching an AI to Cheat (On Purpose): The Problem of Reward Hacking

I've been curious about how AI actually works under the hood for a while, beyond just using it as a tool, so I built a small reinforcement learning project purely to learn from it: reward-tampering-gridworld on GitHub. This post is about what I found.

The Problem

When we train an AI with reinforcement learning, we don't tell it how to do a task. We just give it a reward, a number that goes up when it does well, and let it figure out the rest through trial and error.

That sounds great until you realize something uncomfortable: the AI doesn't actually care about the task. It only cares about the number. And if there's ever a gap between “the number went up” and “the task actually got done,” a sufficiently clever AI will find that gap and exploit it.

This is a real, well-known problem in AI safety called reward hacking (or “specification gaming”). It's not science fiction. It happens constantly with real systems, in small and large ways. A cleaning robot rewarded for “not seeing any mess” might just turn off its camera. A trading bot rewarded for “profit” might find a measurement glitch and exploit that instead of actually trading well. The AI isn't being malicious. It's doing exactly what it was told to optimize. The problem is that what we measured wasn't quite the same as what we wanted.

I wanted to see this happen myself, on a small scale I could fully understand and control, and then try to build something that catches it when it happens.

The Setup, in Plain Terms

I built a tiny grid, think of a 7x7 chessboard. An AI agent starts in one corner and has to walk to the opposite corner, which is the real goal.

But I also added a second, special square somewhere in the middle of the grid. If the agent stands on that square, a “sensor” reports “success!”, even though the agent never actually reached the real goal. It's a loophole: a way to fake the measurement without doing the real job.

Then I trained three different versions of the AI:

One that's only rewarded for actually reaching the goal (no way to cheat).
One that's rewarded purely by what the fake sensor says (cheats constantly, doesn't care about the real goal at all).
One that's mostly rewarded for the real goal, but can also grab a couple of small bonus points from the fake sensor along the way: a more realistic, “mildly dishonest” version.

The question I wanted to answer: can you build something that watches an AI's behavior and reliably tells you when it's cheating, even when the cheating is small and easy to miss?

To find out, I built the world described above, trained three AI agents on it with different reward rules (one honest, one that cheats blatantly, one that cheats subtly), and built a separate “detector” program that watches behavior, not score, to catch the cheaters. It worked, but not on the first try: along the way I ran into a bug where one of the agents literally couldn't perceive whether cheating was even possible, which meant it never learned the shortcut existed at all. Once I fixed that and gave it a clear signal of the shortcut's availability, a second issue showed up in the other direction. It cheated too much once it could perceive the option, overusing the shortcut far past what felt like a “mildly dishonest” agent should. Both bugs ended up being more interesting than the original plan, because each one was its own small lesson in how sensitive an agent's behavior is to exactly what it can and can't observe.

Under the Hood, for the Technically Curious

The environment is a custom gymnasium.Env subclass (Gymnasium is the standard Python interface for RL environments). The agent has 4 discrete actions (up, down, left, right), and the observation it receives each step is just a few numbers: its normalized (row, col) position, plus a signal indicating whether its cheat shortcut currently has any uses left.

The key design choice is tracking two separate reward signals internally: true_done (ground truth: did it actually reach the target?) and sensor_triggered (the measurement the agent is actually trained on, which can be faked). Every result in this project comes from comparing these two signals after the fact. The agent itself only ever sees the second one.

true_done = bool(np.array_equal(self.agent_pos, self.target_pos))
on_tamper_cell = bool(np.array_equal(self.agent_pos, self.tamper_pos))

The three agents were trained with PPO (Proximal Policy Optimization) via Stable-Baselines3, a standard, well-tested RL library. The point of the project was never to reimplement an RL algorithm, but to study what happens when the reward function itself is flawed.

Why I Built This

I'm a developer who got curious about how AI actually learns, beyond the surface level of calling an API. Reward hacking is one of those ideas that sounds abstract until you watch it happen on your own screen, in a grid you built yourself, with an agent you trained yourself. Seeing the cheating agent walk straight to the tamper square instead of the real goal, every single episode, made the concept click in a way that reading about it never did.

The full project, including the detector and the training scripts for all three agents, is on GitHub if you want to poke at it yourself.

Frequently Asked Questions

What is reward hacking in reinforcement learning?

Reward hacking (also called specification gaming) happens when an AI trained with reinforcement learning finds a way to make its reward signal go up without actually doing the task the reward was meant to measure. It is not malicious behavior. The AI is optimizing exactly what it was told to optimize. The problem is a gap between what got measured and what was actually wanted.

How do you detect reward hacking in an RL agent?

By comparing the agent's behavior against ground truth rather than trusting the reward signal alone. In this project, that meant tracking two separate outcomes for every episode: true_done (did the agent actually reach the real goal?) and sensor_triggered (what the agent's reward function reported), then building a separate detector that watches for divergence between the two, even when that divergence is small or occasional.

Why use a small gridworld instead of a complex environment?

A 7x7 gridworld keeps every variable visible and every outcome explainable. With a custom gymnasium.Env, a handful of discrete actions, and a deliberately placed exploit, it is possible to know exactly when and why an agent cheats, which makes it possible to build and verify a detector with confidence before thinking about scaling the idea up to less controlled environments.

What RL library was used to train the agents?

Proximal Policy Optimization (PPO) via Stable-Baselines3, a standard and well-tested RL library. The point of the project was not to reimplement an RL algorithm, but to study what happens when the reward function itself is flawed, so the training algorithm was kept as a known quantity and the environment was the variable under test.