Vibe Coding Warcraft II

Introduction

Image Credit: Lost in Translation (2003) directed by Sofia Coppola. © 2003 Focus Features / Tohokushinsha Film Corporation / American Zoetrope.

The name of this blog means artisanal code and is a nod to the idea that one day in the not too distant future programming as we’ve known it for the last 70 years will probably be as quaint as carding your own wool. That said, I still believe we will need “programmers” for many years. Now, how do I arrive at that conclusion?

First, I am going to assume that we humans are still the customers in the future, i.e. the ones telling the machines what we want. Given this assumption, we are still going to “program” but it will be at the level of providing a specification. The more precise the specification, the more likely you are to get the desired outcome. You will obviously still be able to get a working game through a simple prompt like “Create a Tetris game”, but if you look at all the tens of thousands of intentional design decisions that go into something like BF6 it’s pretty clear to me that we will still be needed.

Second, if you’re thinking “well okay, maybe we’ll need programmers, but maybe only a tenth of them”, let me offer this quote from a recent interview in The Economist with Bret Taylor (chairman of OpenAI):

And what do you do with cost savings? Well, you can pass it on to your shareholders, which is valuable, but I’m a capitalist and you know, if your competitors have access to the same technology, someone will find a way to reinvest that money to gain a competitive edge.

So I hope – and believe – that these technologies will make it possible for us to do more, rather than doing what we currently do at a lower cost.

I’m in a leadership role and feel an obligation to build a solid understanding of this new technology, so that I can help guide my colleagues through the transition. I’ve already used vibe coding to produce a number of smaller tools, but I wanted to do a bit of deliberate practice and see what I can do in a regular weekend (which actually turned out to only be Saturday) in the hours I could squeeze in between other activities.

The target I’ve set for my weekend vibe coding session is to recreate a bit of Warcraft II using the newly released Codex 5.3 model. I could have tried creating an original game, but my employment at EA would have made it impossible to write about, and I also don’t want to have to spend time thinking about design choices. These are my priorities for the weekend:

  • I DO want to learn to set up my project in a way that each task can be executed from a clean agent context.
  • I DO want to learn to create tasks that are written clearly enough that the agent can execute them without needing further clarification.
  • I DO NOT want to try to “one shot” the outcome as one giant task. I still want to be an active part of the process.
  • I DO NOT want to set up my own “Gas Town” that leaves the orchestration in the hands of a separate agent.

Friday, February 13th

7:40 PM – I don’t want to start tomorrow morning with a completely clean slate, so I am going to do a bit of prep work tonight. The first thing I do is ask ChatGPT to create a markdown file describing the Warcraft II binary map format (PUD):

Create a markdown file describing the Warcraft 2 PUD map format in enough detail that I can implement a reader

This takes a bit of time, but I eventually get a neatly organized markdown file containing all the necessary information.

Next, I find the background tilesets for Warcraft II on the internet and convert them to .png format. I could have asked my agent to do this as well, but it was a 30 second job to do it myself. Still, it feels like a missed opportunity. I will also need a Warcraft II PUD file to test with. I briefly consider buying a copy (again) of Warcraft II before I decide to find a fan-made map instead.

The last item of the day is to pull together a structured specification of all the Warcraft II units:

The following page contains a structured overview of all Warcraft II units: https://wowpedia.fandom.com/wiki/Warcraft_II_units Take this information and convert it to two markdown files based on which faction each unit belongs to: Horde and Alliance.

The first generated files fail to include the properties of each unit so I try again:

I also want ALL properties of each unit to be included in the md files. Also, if the names don't exactly match the names from the PUD spec, update the two new md files so that they match.

This time I get exactly what I want.

8:21 PM – It’s time to call it a day.

Saturday, February 14th

9:22 AM – The main challenge with using tools like Codex is that by default these tools completely lack long-term memory and have limitations on their short-term memory (the context window). For a small project like mine, the standard way of addressing this is to add a number of markdown files which describe various aspects of the project. I’m leaving the following files in the root of my repository:

  • AGENTS.md – This file describes how the coding agent should approach work in general. I will share some of the more important aspects below.
  • CURRENT_STATE.md – This file includes a list of the functionality we’ve implemented so far.
  • DECISIONS.md – This file describes why things are the way they are.
  • TASKS.md – this file includes a list of completed and non-completed tasks. I will share what this looked like after my session further down.
  • CHANGELOG.md – This is a complement to the git revision history in an easily accessible format. This is mostly for me.

My first version of AGENTS.md contains the following important sections:

  • Build rules
    • Agent rule: if your change touches >3 files or changes behavior, update CURRENT_STATE.md.
    • If you made/reversed a tradeoff, add a bullet to DECISIONS.md.
    • If you reached a milestone, append to CHANGELOG.md.
  • Session protocol
    • Read: AGENTS.md, CURRENT_STATE.md, DECISIONS.md, TASKS.md, and relevant files in Docs/.
    • Restate goal + touched files.
    • Implement in small steps; keep cargo run working.
    • Run cargo fmt + cargo clippy ... before finishing.
    • Update long-term state files as required.

As I mentioned previously, I am going to execute each task from an empty agent context. My hope is that each context will only need the following four prompts:

  • Read AGENTS.md and CURRENT_STATE.md before doing anything
  • Plan the next task
  • Execute
  • Run git add and git commit with a commit message reflecting what we've done so far

The agent is acting very much like Guy Pearce’s character in the 2000 movie Memento, who has lost his near-term memory and relies on reading notes from his past self and leaving notes for his future self.

Leaving notes for your future self

9:44 AM – With my memory scaffolding in place, I am ready to start working. My plan is to use Rust for this project since it makes a lot of the potential runtime issues in my normal working language, C++, compile errors instead. I have Codex install the full Rust toolchain as well as the SDL2 library which I will use for rendering and input. This works fine, but I get a preview of one of the other pitfalls with agentic AI: its bias for action. It is constantly running ahead and trying to do more than I’m asking it to. This is going to be a constant theme over the next few hours. I used the agent to set up my memory scaffolding and another example of its eagerness to help is that it has already prepopulated my TASKS.md with a lot of items. These superficially look like the right thing, but I am going to ignore them and be specific. This highlights another risk: it is very easy to turn yourself into a meat-puppet who uncritically accepts everything the agent suggests.

A Codex session

10:13 AM – I have my Rust project and toolchain setup. The first task I ask the agent to complete is to init SDL, create a window, and add an event loop. This works fine except for a few snags getting SDL2 to link properly (which the agent sorts out itself).

A skeleton application is up and running

11:07 AM – In my previous vibe coding projects I have not been as structured in my approach to the long-term memory issue and have worked in a more ad hoc manner. It takes some time to structure the TASKS.md list. I am used to just doing stuff, but now I need to plan ahead much more and focus much more on being precise. This feels unfamiliar and I find myself maybe overthinking design choices which is the opposite of what vibe coding is supposed to enable.

The next task I give the agent is to wrap the SDL renderer and introduce a classic present + wait-for-vsync game loop. Not much to show visually yet except that the window now clears to blue. It is 11:19 and I am taking a break. I have spent approximately two hours on the project so far.

2:44 PM – I am back at it. The next task I ask the agent to complete is to create a Game class to hold our game logic and give it a step function to update the simulation and a render function to draw the current state of the game. This completes without issues.

2:56 PM – I ask the agent to create a Map class to hold the game map. The task includes knowing about the four Warcraft II settings: Winter, Wasteland, Swamp, and Summer. I also ask it to provide a constructor that builds a dummy map with a checkerboard pattern. The bias for action showed its face again. The agent skipped the planning phase and went straight to execution. I update my AGENTS.md file with an explicit instruction to not do this going forward. Apart from this, the task is completed without issue.

Before I commit, I add a fifth step to my own developement loop:

Are there any MEANINGFUL tests I could add to the code at this point? Don't make anything up just for the sake of it, but if there are things that make sense to test and which are amenable to testing, list them.

I get a list of reasonable tests and approve them for implementation.

3:40 PM – I ask the agent to implement initial rendering functionality for our debug map. This works well with the exception of more action bias. The agent added a bunch of tests without asking me. I once again update AGENTS.md to explicitly forbid this.

Rendering of the debug map

4:10 PM I run into more bias-for-action shenanigans. When I ask the agent to plan the next task, instead of picking from TASKS.md it chooses to pick up one of its own long-term priorities which were part of the generated CURRENT_STATE.md file as a forward looking option. Luckily, the agent doesn’t execute since I’ve now explicitly forbidden this. I close the session and update AGENTS.md with yet more explicit instructions to only pick the next task from TASKS.md.

The next task I actually wanted done was to add a minimap. With the new guardrails in place, the task is completed without issue.

Minimap in place

4:34 PM – I ask the agent to queue up all the input events that happened since the last present() call and pass them on to the Game step function. I also ask it to implement functionality to scroll the map with the arrow keys as well as being able to click on the minimap. All of this completes without issues. It’s 4:58 PM and I’m taking a break to make dinner for my family. So far I have spent approximately four hours on the project.

Scrolling the map to a different location

6:27 PM – I will try to do one more thing before I call it a day. I will see if I can get the agent to use the information I asked ChatGPT to collect about the PUD map format to create a parser. I was originally planning on trying to find a PUD file online, but instead I decide to buy (again!) a copy of Warcraft II (the original Battle.net edition) and get the file from there. I select the classic map “Death in the Middle” to work with. Here’s a screenshot from Battle.net.

Death in the Middle © 2019 Blizzard Entertainment. All rights reserved

I ask the agent to implement the parser logic and replace my dummy map with one loaded from the PUD file on disk. This task is completed without issue. Before I test the project, I ask the agent to add some tests to check that the map loading is working. The agent dutifully does this, but I now run into reward hacking for the first time. The agent noticed that the PUD file actually failed to load because some of the tile indices used were outside the range of the tileset I had provided. In an effort to be helpful and make sure the map loading test succeeded, it replaced the PUD file on disk with a synthetic one that it knew didn’t have any errors. To the agent’s credit, it didn’t try to hoodwink me about what it was doing but told me it was hacking the test to make it pass.

I want to see what the map actually looks like so I ask the agent to temporarily remove the error checking for invalid tile indices, set them to zero instead, and add a comment about it in the relevant documents. The task is completed without issue and the map is now loaded.

Death in the Middle Borked

Looking at the rendered image, it is clear that the missing tiles indices are all located in the border regions between different types of tiles (wood, grass, dirt, etc.). After doing a bit more research, it turns out that to save memory, Warcraft II doesn’t use 32 x 32 pixel tiles, but that it internally uses 8 x 8 pixel tiles. The indices in the PUD files refer to an indirection table which is hard-coded in the original client where each record contains the 4 x 4 indices to the smaller tiles. It is 7:57 PM and it is non-trivial to do this so I decide to call it a day. So far I’ve spent approximately five hours on the project (I left and picked up my daughter at the gym during the last session as well).

Conclusion

The first thing I want to acknowledge is that the result itself is not particularly impressive. If I had written the code myself in my weapon of choice, C++, I would probably have gotten further in the five hours I spent on this (we should deduct some time though, since I was also taking notes for this blog entry). The first point I want to make, though, is that a lot of the time was spent waiting for the agent to do its job. Even if nothing else changes, this will only get faster. The second point I want to make is that the quality of the code is good and as far as I can tell the agent didn’t create a single bug. I am not a Rust expert, but inspecting the code, it looks solid. The third and final point I want to make is that if I had given the agent the full list of tasks in advance, given it free rein to do whatever it wanted inside a sandbox, and set up an orchestration layer which initiated each task in sequence, I could have walked away and come back later in the day to a completed project.

I have also reflected a bit on why creating the tasks felt relatively challenging. I have come to the conclusion that when it comes to coding, I am used to thinking in whatever language it is I am working in (C++, Python, Scheme, etc.). At this point in the journey (i.e., complete n00b), I feel a bit lost in translation. Eventually, with practice, I hope this will come more naturally.

To pick up the loaded rifle that was “My first version of AGENTS.md, this is what the final version of the session protocol looks like this:

  • Read: AGENTS.md, CURRENT_STATE.md, DECISIONS.md, TASKS.md, and relevant files in Docs/.
  • Restate goal + touched files.
  • Planning precedence when asked to “plan the next task”:
    • Explicit user instruction.
    • First unchecked item under ## Next (MVP vertical slice) in TASKS.md.
  • CURRENT_STATE.md is context only and must not override task selection from TASKS.md.
  • Before presenting any plan, include:
    • Selected task: <exact unchecked TASKS.md line>.
    • If the selected task is not from TASKS.md, stop and ask for confirmation.
  • DO NOT START EXECUTING TASKS BEFORE I EXPLICITLY TELL YOU TO
  • DO NOT ADD TESTS BEFORE I EXPLICITLY TELL YOU TO
  • Implement in small steps; keep cargo run working.
  • Run cargo fmt + cargo clippy ... before finishing.
  • Update long-term state files as required.
  • Ignore any git status outside the project
  • NEVER DISABLE TESTS OR TRICK THE TESTS TO MAKE THEM COMPLETE
Coding agents waiting for a task to be assigned

As you can see, there are a lot of explicit instructions to avoid action bias and reward hacking. I might continue with this project at some other point if I feel like it, but it’s more likely that I will apply what I’ve learned to one of my own projects that I’m more interested in. Finally, this is the list of tasks the agent completed.

Leave a comment