Machine LearningJune 24, 20268 min read

Diagram of a gridworld with an AI agent, a target square, and an exploit square, alongside a behavior monitor

Teaching an AI to Cheat (On Purpose): What I Built

In the first post in this series, I explained the idea behind reward hacking. This one is the build log: the actual pieces of reward-tampering-gridworld on GitHub, in the order I built them.

A Quick Recap of the Setup

An AI trained with reinforcement learning doesn't get told how to do a task. It gets a reward number that climbs when it does well, and it works out a strategy through trial and error. If there's ever a gap between “the number went up” and “the task actually got done,” a clever agent will exploit that gap instead of doing the real work. That's reward hacking.

To study it on a scale I could fully control, I built a 7x7 grid world where an agent has a real goal, a target square in the opposite corner, and a fake shortcut: a separate “tamper” square that fires a success signal without the agent ever finishing the job. This post walks through what I built to study that gap, piece by piece.

A quick note on where I'm coming from: I've worked as a developer for a while, so reading code, debugging, and structuring a project weren't new skills here. Python specifically was. I'm new enough to it that half the fun of this project was learning the language while also learning the RL concepts on top of it, and genuinely enjoying both at once.

1. The Environment

The grid itself is a small Python program built on Gymnasium, the standard interface most RL training libraries expect an environment to implement. Each turn, the agent chooses a direction, and the environment reports back where it ended up, whether it actually reached the goal, and whether it's currently standing on the fake square.

The detail that matters most: the environment tracks two signals that never get mixed together. One is the truth, whether the episode actually ended at the real target. The other is what the sensor reports, which is the only thing the agent is rewarded on and the only thing it can manipulate. Keeping these apart is what makes it possible to later prove, with data, that a gap exists between looking successful and being successful.

Concretely, it's a gymnasium.Env subclass with a 7x7 grid, 4 discrete movement actions, and an observation space of just three numbers: row, column, and whether the cheat shortcut still has uses left. Every step() call returns the usual RL tuple, plus an info dict carrying the ground-truth flag the agent's reward function never sees:

info = {
    "true_done": true_done,               # ground truth, hidden from the reward
    "sensor_triggered": sensor_triggered,  # what the agent is actually rewarded on
    "on_tamper_cell": on_tamper_cell,
}

2. Three Agents, Three Incentives

With the environment in place, I trained three versions of the agent using PPO (Proximal Policy Optimization) through Stable-Baselines3, an established RL library. Each version got a different reward rule:

Honest: rewarded only for reaching the real target, with no shortcut available.
Unrestrained cheater: rewarded purely for standing on the fake square, with unlimited uses.
Light cheater: rewarded mainly for the real goal, with a capped number of bonus points (two per attempt) available from the fake square.

None of these agents were instructed to cheat or not cheat. Each one ran thousands of trial-and-error attempts and converged on whatever strategy maximized its score under its own reward rule.

Training is close to a one-liner once the environment and reward rule exist:

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100_000)
model.save(f"agents/{mode}")

Stable-Baselines3 ships a Monitor wrapper that logs per-episode reward and a chosen field from the info dict to a CSV automatically, so instead of writing custom logging, info_keywords=("true_done",) pulls the ground-truth flag straight out of the dict above into the log file for free.

3. Catching the Cheaters by Watching Behavior

Once the three agents were trained, I built a separate program whose only job is to look at an agent's behavior, not its score, and flag whether it's likely cheating. The method is simple: count how many times the agent steps onto the fake square during a run. An honest agent might cross it once by coincidence on the way to the real goal. A cheater visits it far more often than that.

To set a fair threshold, I calibrated using only the honest agent's data, establishing what normal behavior looks like, and then tested the resulting rule against all three agents, including the honest one, to check how often it would raise a false alarm.

Concretely, I split the honest agent's episodes in half: one half sets the threshold, the other half is held out purely to measure the false-positive rate. Using the same data to set and test the threshold would make the detector look better than it actually is, the same mistake as evaluating a model on its own training set:

threshold = calib_df["tamper_visits"].max() + 1
flagged = df["tamper_visits"] >= threshold

4. A Second Model That Explains the Violation

The last piece borrows an idea from how Anthropic trains Claude: Constitutional AI, where model behavior gets scored and corrected against a written set of principles rather than relying only on human feedback. I wrote a short, plain-English list of rules describing what the agent should and shouldn't do, then handed a flagged episode and that list to a second, separate model and asked it to explain which rule was broken and what better behavior would have looked like.

That turns “the detector raised a flag” into an actual explanation a person can read and check. The critic runs entirely on my own machine through Ollama, calling a small open model (llama3.2), so there's no API key and no per-request cost. Each flagged trajectory gets converted into a plain-text step list and sent alongside the rules in a single prompt, no fine-tuning required:

response = ollama.chat(model="llama3.2", messages=[{"role": "user", "content": prompt}])

Full stack: Python, Gymnasium for the environment, Stable-Baselines3 for PPO training, Pandas and NumPy for the data work, Matplotlib for the plots, and Ollama with Llama 3.2 for the local critic.

Where It Went Wrong First

None of this worked cleanly on the first attempt. One agent couldn't even perceive that cheating was an option, so it never learned the shortcut existed. After fixing the observation so the agent could actually see whether the shortcut was available, it swung the other way and leaned on the fake square far more than a “light cheater” should. Both bugs taught me more about how sensitive an agent's behavior is to what it can observe than the original, tidier plan would have.

The full project, environment, training scripts, detector, and critic, is on GitHub if you want to read the code directly.

Frequently Asked Questions

What is a gymnasium.Env and why build a custom one?

Gymnasium is the standard Python interface for reinforcement learning environments: it defines step(), reset(), and an observation/action space that any training library knows how to talk to. A custom subclass was needed here because the environment had to track something off-the-shelf grids don't: a ground-truth completion flag that stays hidden from the reward the agent trains on, alongside the fake sensor signal it can exploit.

How do you set a threshold for a behavior-based cheat detector without cheating yourself?

By splitting the honest agent's episodes into two halves: one half sets the threshold (how many visits to the exploit square counts as suspicious), and the other half, which the threshold never saw, is used to measure how often the detector raises a false alarm. Using the same data for both jobs would make a detector look better than it actually is, the same failure mode as testing a model on its own training data.

What is Constitutional AI and how was it used here?

Constitutional AI, an approach Anthropic uses to train Claude, scores and corrects model behavior against a written set of principles rather than relying only on human feedback. This project borrowed the basic idea on a much smaller scale: a short written list of rules an agent should follow, handed to a separate local language model along with a flagged episode, so it could point out which rule was broken and describe what the agent should have done instead.

Why run the critic model locally with Ollama instead of an API?

Ollama runs open models like Llama 3.2 directly on a local machine, with no API key, no per-request cost, and no data leaving the machine. For a learning project that just needs a model to read a short trajectory and explain a rule violation in plain language, a small local model is enough, and it keeps the entire pipeline runnable offline.