How to win silver in the CodeCup

May 18, 2023

Every year CodeCup organizes a contest for a game like chess, go, or Othello. But instead of playing the games yourself, you create a bot which plays the game! And instead of a well-known game like chess, an unfamiliar game is chosen.

This year we had to create a bot for Entropy, where Order and Chaos compete, each with different move possibilities. Last year (Spaghetti) I finished at place 9 of 56, a quite good result. Yet for Entropy I exceeded my expectations by reaching place 2 of 75. In this article, I will show how I managed to do this. Besides a record of my efforts, I hope this article can guide you in future contests!

Contents:

The game Entropy

The eternal battle between Order and Chaos

In Entropy, the players take turns trying to achieve their competing goals. Order tries to create patterns and Chaos tries to prevent those patterns. The board starts empty in every match, but Chaos grabs a random chip from the bag every turn and places it on an empty field.

a b c d e f g A B C D E F G 0 3 0 2 0 3 0 0 0 0 3 3 0 0

Example gameplay
Source: CodeCup

After Chaos places a chip, it is Order's turn. Order can choose to move one chip orthogonally or to skip their turn. For example, in the picture above Order can move the blue chip on Dd only to Dc, Cd, Bd or Ad. After that, Chaos takes a turn again. Once the board is full, the match ends and the points are counted.

16 3 2 3 3 5

Point calculation

Order scores points for the length of every orthogonal palindrome (a pattern of colors the same in both directions). All palindromes are counted, even those inside or overlapping with other palindromes. So in the picture above we have one palindrome of length 2 (white‑white), three palindromes of length 3 (twice white‑red‑white, once red‑white‑red) and a palindrome of length 5 (white‑red‑white‑red‑white). That makes a total of 2+3+3+3+5=16 points for this line. Order aims to maximize the point total, Chaos to minimize it.

See the full rules or an example game for more details.

Deep dive: how to efficiently calculate points

To calculate the points, you just have to check all palindromes. Easier said than done, right? But this first insight helps: all palindromes are by definition symmetric around their center position. This center position is the middle of a field for odd-length palindromes, and between two fields for even-length palindromes. So we can just try all center positions and see how long each palindrome is:

points = 0
//Even-length palindromes
//The palindrome's center is just after the field referred to
//E.g. center = 2 implies that the center is between field 2 and 3
for center in [0, 1, 2, 3, 4, 5] {
    for length in [2, 4, 6] {
        left = center + 1 - length / 2
        right = center + length / 2
        //Left and right must be a valid field
        if left < 0 || right > 6 {
            break
        }
        //Left and right must have a filled field
        if colors[left] == empty || colors[right] == empty {
            break
        }
        //The colors of left and right must match
        if colors[left] != colors[right] {
            break
        }
        points += length
    }
}

//Odd-length palindromes
//The palindrome's center is the middle of the field referred to
for center in [1, 2, 3, 4, 5] {
    //Any color chip works for the center field, so we skip it
    for length in [3, 5, 7] {
        left = center - (length - 1) / 2
        right = center + (length - 1) / 2
        //Left and right must be a valid field
        if left < 0 || right > 6 {
            break
        }
        //Left and right must have a filled field
        if colors[left] == empty || colors[right] == empty {
            break
        }
        //The colors of left and right must match
        if colors[left] != colors[right] {
            break
        }
        points += length
    }
}

That is a good implementation already, but we can do even better by using SIMD instructions. These allow us to compute all palindromes in just a few steps and no loops. To do, we need to change our code:

  • to always execute the same thing, without any conditional branches; and
  • to only load array data according to the index of the main loop.
points = 0
//Even-length palindromes
for center in [0, 1, 2, 3, 4, 5] {
    still_running = true
    for half_length in [0, 1, 2] {
        //Move the fields half_length to either side
        // (We interpret the color array as a 64-bit number)
        //Corresponds with indexing colors[left] and colors[right]
        left = colors << (half_length * 8)
        right = colors >> (half_length * 8 + 8)
        //We replace the ifs and breaks with a boolean
        still_running = still_running &&
            center >= half_length && center + half_length + 1 <= 6 &&
            left[center] != empty && right[center] != empty &&
            left[center] == right[center]
        //If the palindrome is still running, add the points
        points += (still_running * 0xff) & (half_length * 2 + 2)
    }
}

//And similar for odd-length palindromes

We now remove the inner loop by duplicating the code three times, and that makes the code SIMD-ready. The compiler can convert this to SIMD instructions in theory, but does not do this well enough here. By making good use of the Intel Intrinsics Guide, we manually make the code run for all center positions at once. Writing this type of code is hard, so it is useful to keep a simple version for (automatically) comparing results. From there, there are only a few things left to improve. For example, can you think of a way to combine the even and odd palindrome calculation with SIMD? Hint: make the even/odd loops as similar as possible, and see if you can handle the differences outside of the loop.

Now your code can calculate palindromes millions of times per second. A very smart trick I did not think of, but other contestants did: there are not that many possible combinations of chips on each line (77 ≈ 800,000 combinations, or 87 ≈ 2,000,000 combinations if you include empty fields). Therefore, you can just compute the points for all possible combinations in less than a second. Subsequently, you only look up the points in a table. In addition, this is easy to do for other properties you calculate on a line as well!

For making a bot, most aspects of the game are “standard”: it is a two-player perfect information zero-sum game. The branching factor and game lengths are manageable too. However, there are several interesting aspects to consider when making a bot. First, the game is asymmetric: Chaos and Order have completely different moves. That is not too hard to handle, except that it can lead to duplication of code if you are not careful (I was not that careful, unfortunately). Secondly, the game has an element of chance, by picking random chips from the bag. This has a large impact, since the commonly used minimax and Monte Carlo tree search techniques become much less efficient when handling this correctly.

How to handle chance?

Expectimax and Ballard's star-minimax

One of the starting points for writing a bot is the minimax algorithm. This looks at all your available moves to choose the best one. For each move, it looks at all the following moves by the opponent and chooses the best one for them (worst for you). At some point, you stop, because you do not have enough computation time to evaluate the complete game. So you stop after looking ahead some number of moves, and use an evaluation function to say how good each game situation is (e.g. how many points have been scored so far).

Minimax. Source: Introduction to AI, University of Helsinki, unmodified, CC BY-NC-SA 4.0

Expectimax is the extension of minimax to include “chance nodes”. Whenever we have to draw a random chip from the bag, we just try all possible colors. There is no “best chip color”, of course, but we combine the results by adding them — weighted by the probability of drawing each color. That single change is all we need!

Unfortunately, if we stop here, we lose all the advantages of common minimax extensions, particularly alpha-beta pruning. I consider alpha-beta pruning an essential improvement, since it allows you to search more moves ahead. The key to alpha-beta pruning is that every time we ask for the value of a move, we also supply α — the best value already achieved by Order — and β — the best value already achieved by Chaos. If we encounter a move worse than α or β (depending on whose turn it is), we know that the moves leading to this situation are suboptimal. We stop evaluating this situation and save some time.

Ballard combined these two minimax extensions into *-minimax (also see this slide deck). *-minimax computes a lower bound and upper bound for chance nodes, and updates these bounds as the results for each possible chip color come in. If the upper bound goes below α, or the lower bound above β, we can stop evaluating the node. This works, but is not as good as alpha-beta pruning. That is because a bad move is obvious immediately when using alpha-beta pruning (i.e. without chance nodes). But with chance nodes, a bad move can be compensated by a very good move. Because we do not know this in advance, we mostly have to muddle on.

Mostly, because there are a few tricks to apply. Ballard himself proposed “Star2”, where you look at each color just enough to get an upper bound (upper bounds on chip colors translate into a higher α when evaluating other colors). That is a decent point to stop when making your own bot. There are a few other tricks (StarETC and ChanceProbCut) I applied, but these only have a minor effect.

Finally, you can search moves in a smarter order: if you search good moves first, you can increase α/decrease β and thus save more time. In particular, I first try the move stored in the transposition table (if found), then the killer heuristic and then all moves sorted with the evaluation function.

Deep dive: more thoughts on lower and upper bounds

Competitor Tomek Czajka posted a good idea on the CodeCup forums, which I have not seen described anywhere before. To appreciate this idea, we must first look at alpha-beta pruning from another angle. Instead of interpreting α and β as the best known moves, we can say they are query bounds. As usual, we ask the alpha-beta algorithm for the value v(s) of a situation s. However, we only care about the exact value if α < v(s) < β. Otherwise, we just require the algorithm to output a value x such that v(s) ≤ xα or βxv(s). So in exchange for spending less time when v(s) is outside our query bounds, we get a less precise answer and fewer guarantees (only a lower or upper bound).

β α v(s₁) f(s₁,α,β) output is exactly equal to v(s₁) v(s₂) f(s₂,α,β) output is upper bound of v(s₂) v(s₃) f(s₃,α,β) output is lower bound of v(s₃)

Visual representation of alpha-beta output in situation s₁, s₂ and s₃.

With this view, we can also run the alpha-beta algorithm (run a query) with different α and β than our best-known moves. Generally, running a query with a small window (= β - α) is faster. For example with aspiration windows, we make a guess for the value v(s) and just try α and β close to that value. If our guess was correct, we save a lot of time. If not, we at least get a bound on the real value. Going to an extreme, we have null windows where β = α + ε. This just lets you know if the value v(s) is below or above α. It is used in principal variation search (as guess-and-validate) and MTD(f) (as binary search).

Coming back to *-minimax, Tomek Czajka had the insight that running the algorithm with α = β = -∞ will always give a lower bound on the value, and α = β = ∞ will always give an upper bound. Therefore, when you run these queries for all possible chip colors, you have both a lower and upper bound for all colors. This should result in a much smaller window when determining the exact value of each possible color.

Let us compare this idea to Ballard's Star2. In Star2 we do “probing”: we try only one move of Chaos with a drawn color chip. Because Chaos could have chosen other moves, and Chaos wants to minimize the points, this results in an upper bound on the value. Tomek's approach will do the same with α = β = ∞, but additionally also try only one Chaos move deeper in the search as well. So it can potentially be faster, with risk of a weaker bound. The α = β = -∞ query has no equivalent in Star2, but leads to choosing only one move for Order, and that way producing a lower bound on the value.

I think this idea is genius, and potentially allows a large speedup. Thinking a bit more myself, we can generalize the idea of searching and lower/upper bounds. Currently, *-minimax searches through the game tree depth-first. Star2 and Tomek's idea change this a bit, but still mostly work depth-first. Instead, you can also look at the game tree as a whole. Maybe a chance-node variant of MT-SSS* or Proof-number search is possible?

Deep dive: on playing against weaker opponents

In the competition format of this year's CodeCup, you play a game against every opponent. To determine the final ranking, you sum up all your points scored as Order, and subtract all points scored by the opponent when playing as Chaos. As a result, it is important to not just win, but also to win by a large point margin. In particular, this means we can and must play better against some of the weaker opponents.

What do I mean with that? Well, by using the search algorithm above, we already take advantage of the mistakes the opponent makes. However, we assume that the opponent plays strongly in response. As a result, we do not take large risks, since we could lose more than we gain. Against weaker opponents, those risks do pay off, making this essential to reach a high ranking in the competition. I could see in the test competitions that I already did very well as Chaos without this stratagem, only leaving points on the table when playing as Order. That is why I only applied the following strategy for Order.

So let us detect weaker opponents. At first, I found this hard to do: the number of points scored so far was an unreliable indicator throughout most a match. The same for the outcome of the search algorithm. It seemed to me that chance played too large a role. But then I figured out a way to separate chance and good decisions: first I calculate the top move and score with my search algorithm, as if I am playing as the opponent (i.e. as Chaos), so using the same color chip the opponent drew from the bag. Then I compare that with the score corresponding to the opponent's move. The difference between these scores is due to skill, not luck. If you are familiar with multi-armed bandits: I am trying to estimate the opponent's regret. Summing these differences up during the match results in a very reliable indicator of opponent strength. When the opponent's strength is low enough, I switch to the optimistic algorithm below. If that turns out to be a mistake, my program switches back automatically later on.

a b c d e f g A B C D E F G 6 3 0 6 3 0 3 0 2 8 6 3 0 6 Order's turn Fb → Fd (my choice) Aa → Ab ... ... Gg → Gf Chaos' turn (random color) Ab 77.2 ... Ee (best move) 77.2 ... Ff (opponent's choice) 82.0 Difference: 4.8 points

Evaluating the strength of the opponent is difficult because it typically includes the effect of random colors.
After I (Order) move, the opponent (Chaos) draws a random chip from the bag.
Comparing the opponent's move with the best move using the same chip color is a reliable strength indicator.

Now we have to decide how to take more risks and play more optimistic. The ideal approach is to adapt my normal search algorithm to assume the opponent plays randomly. However, this is too slow, as good plans often take three or more Order moves — unlike the typical two moves needed against stronger opponents. So I settled on the next best thing, even more optimistic: I ignore the chips Chaos could potentially place. I only try moving chips — only Order moves — and hope the opponent does not block these moves by placing a chip on the wrong spot. Except in the end of the game, this works out very well! Moreover, it even works decently against the weakest opponents that do not play randomly. Nevertheless, to be safe, I added a small preference to play point-scoring moves earlier rather than later.

This technique regularly scores 20-30 points more than the normal search algorithm, thus contributing to my high ranking. Ideally, I always play well with my normal algorithm, but that was just not sufficient for Entropy and this competition format.

Estimating the future with reinforcement learning

Q-learning on steroids

We use our search algorithm (*-minimax) to decide what is the best move, and to estimate the value of the current game state. To do this well, we need a good evaluation function. The evaluation function gives a value to each game situation: assuming optimal play, what is the expected final game score? We can completely create this function by hand, but doing so is hard. Reinforcement learning can help us here!

Still, there is much to decide yourself. First, you need to decide which inputs to use. This is a large topic, so see the next section for more on this. Second, you need to decide the structure of your function: e.g. a linear function or a neural network. More complex functions can be more accurate, but I found that a linear function works very well in practice (neural networks did not do better, when I tried to apply them). Linear functions also have many practical advantages: they are fast, they are simple to validate, you do not necessarily need your own code to train them (just use linear regression in R) and their predictions are easy to understand.

Then we have to decide what the output y for the evaluation function should be, so we can actually train the function. Ideally, it is the expected number of points assuming optimal play. We do not know that value though, but there are multiple alternatives from reinforcement learning theory. I chose Q-learning (another good choice is TD-learning). In Q-learning, you run your search algorithm to compute a y. In essence, your search function handles the short-term value — which move to choose and the direct consequences — before calling the evaluation function for the long-term value. So you use the evaluation function, even while training it! We call this process, of using your previous evaluation function to build a new one, “bootstrapping”.

Q-learning. Source: Reinforcement learning, Richard S. Sutton and Andrew G. Barto, unmodified, CC BY-NC-ND 2.0

I implemented a few improvements on top of default Q-learning, which improve the resulting evaluation function or speed up the training. First, I used multi-step Q-learning: instead of only looking one move ahead as with normal Q-learning, I let the search algorithm look two or more moves ahead. This does not always work in reinforcement learning, but works well in games because players act adversarially. The second modification is on which game states I train the evaluation function. Normally, you only train on the states you actually encounter when playing. However, when using the TreeStrap algorithm, you train on many the game states you encounter in the search algorithm as well. Instead of a single training step, you have thousands of (related) training steps for free, combining into a single really effective training step! A second benefit is that the search algorithm actually visits these states, so it is useful to predict them well.

The last thing to consider in reinforcement learning is your training curriculum: from which example matches do we want to learn? You need many matches of the game, i.e. only matches the test competitions is not enough, so you will need self-playing: your bot playing against itself or other bots you have. The matches need to be varied enough to learn from, but still representative of how you will play during the competition. In other games than Entropy, this can mean you need to play suboptimal sometimes, just to get this needed variation. For Entropy this was easy: drawing random color chips from the bag causes enough variation between matches already. Only the first few turns lack this variation, so there a low amount of random moves was useful.

What is important in Entropy?

The inputs for my evaluation function

As I said above, it is important to decide what the inputs of your evaluation function should be. This can simply be the state of the game, but that makes your evaluation function do all the hard work (and makes a linear function impossible). In practice, with little computation time available, it is better to calculate useful stuff based on the game state. Compared to e.g. AlphaGo, I have many times fewer computation power available, both in the actual competition and while training. So we use domain knowledge about Entropy to punch above our weight!

It is hard to give general advice, but here are a few guidelines to strive towards:

For Entropy, I created the following inputs:

a b c d e f g A B C D E F G 6 3 0 6 3 0 3 0 2 8 6 3 0 6

Example game state for the evaluation function.

Currently, it is Order's turn to play with 46 points scored so far. My evaluation function assigns 76.9 points to this game state, so 30.9 points can still be scored on average. The search algorithm gives the more accurate result of 75.4 points.

The corresponding best move is to move the orange chip on Fb to Fd, thereby unsurprisingly executing the largest threat of 6 points.

a b c d e f g A B C D E F G 5 5 5 5 4 5 5

Potential palindromes of length 4/5/6/7: how many palindromes of length 4/5/6/7 can be created by one chip (considering all colors and all positions for this chip). High numbers mean there are many mid-term and long-term opportunities for Order to score points.

Example: one palindrome of length 4 can be created, six palindromes of length 5, none of length 6/7.

a b c d e f g A B C D E F G 4 2

The largest threat: the largest number of points that can be scored by moving a single chip. This is the best input I have, because it conceptually allows me to look one extra move ahead nearly for free.

Example: the largest threat scores 6 points.

a b c d e f g A B C D E F G -3 5 2 3

The sum of other threats, so excluding the largest threat.

Example: the remaining threats score 7 points.

a b c d e f g A B C D E F G

The number of holes of size one or two: the areas of empty fields surrounded by chips or the sides of the board. Holes usually are good opportunities for Chaos to dump chips with an undesired color.

Example: there are three holes of size one, none of size two.

a b c d e f g A B C D E F G 1 18 17 16 2 10 9 8 7 15 3 1 6 14 4 5 13 2 4 12 5 3 6 7 8 9 10 11

The number of chips on the outer two rings of fields. These fields are hard to reach, so it is useful for Order to fill them early on.

Example: there are there 18 chips on the outer ring (green), 10 chips on the ring inside it (blue).

a b c d e f g A B C D E F G 6 3 4 5 5 3 2 3 4 5 4 3 2 1 2 3 4 3 1 0 1 2 3 3 1 3 4 5 4 6 5 4 3 4 6 5 4 6 4 2 4 2 2 3 2 3 4 5 5

The sum of the Manhattan distances to the center field. Works similarly as the “outer rings” input, but also considers corners.

Example: the distances to the center field sum up to 117.

a b c d e f g A B C D E F G

The color imbalance: how many chips remain of the least played color. A high value in the last few turns often indicates that Order can forcibly create some palindromes. This input is a bit weird, because the players cannot influence its value at all. Still, it helps balance lucky and unlucky color draws earlier on.

Example: the two least played colors — magenta and black — each have four chips remaining.

a b c d e f g A B C D E F G 3 3 0 1 2 3 0 0 0 1 0 4 0 0 0 0 0 2 7 0 6 0 0 4 7 7 5 1 8 0 0 6 5 7 6

The 2-turn mobility: how many fields each chip can reach in two turns, summed up. A high value means that Order can move their chips to many places, making it more likely to score points.

Example: the number of fields reachable in two turns sums up to 88.

a b c d e f g A B C D E F G 0 1 3 0 1 4 0 0 0 1 8 0

The non-contributing 2-turn mobility: the sum of chips' 2-turn mobility, but only for chips that do not currently contribute any points. Also useful, since Order does not want to lose the points they currently have.

Example: the number of fields reachable in two turns without losing points sums up to 18.

The neural network that did not work

A challenge for the reader

So, I also tried using a neural network instead of a linear function. After all, neural networks are used with much success in many domains, with AlphaGo as prime example. With the same inputs, I hoped to play a bit better too. I think I did everything right: I was using the Adam optimizer already for the linear function, I used one or two hidden layers, I tried different number of nodes and activation functions (ReLU and tanh), I standardized the inputs, and probably much more. However, it did not work out; it played worse than the linear function.

Still, I would like to be proved wrong, and learn how to work with neural networks in the future. If you are interested, please try it! The details on the dataset and challenge are below.

There are a few thoughts I have, but did not investigate. First is that the training process is unstable because I have per-turn weights. Second, maybe I just need much more data.

Challenge details: do better than my linear function!

Challenge dataset (zipped CSV's)

The dataset contains the following fields:

  • test_train: “train” or “test”, indicating if this observation belongs to the training or testing set.
  • gameid: an unique id per played game. Can be used to group the different turns of a single game.
  • state: describes the state of the game. The first character (C or O) indicates whose turn it is. “_” indicates an empty field, 1-7 indicate placed chips (each number is a different color).
  • turn: the turn number.
  • minimax (training set only): the result of the search algorithm, using my own evaluation function.
  • heuristic: the score assigned to this game state by my evaluation function.
  • outcome: the outcome of the game, as it was played.
  • points: the current amount of points.
  • The other fields correspond to the inputs from the previous section.

The goal is to predict the minimax value as well as possible on the test set, using the same inputs I used, or (if you like) also using the current number of points or the raw board state. Designing new inputs to use is not part of the challenge. The used method should be reasonably fast, as the search algorithm will call it often.

I typically evaluate on turn 51, where my function achieves an R² of 89.3%. A regression directly on the dataset achieves an R² of 89.8%. Does your method do better? Then please contact me so we can see how well it works in actual play. If you send a method that works better than what I have (not only on R²), I learned something interesting, we had a good discussion, or you otherwise impress me, I will send you a packet of stroopwafels!

Contact me at , also if you just want to talk about this topic!

Competition result

Silver, but how well will I do next year?

Competition results
Source: CodeCup

I ended up in second place, a large achievement for me! I hoped to win, of course, but this result is better than I have gotten in any other year. To be fair, how much time you spend improving your bot is also crucial, and there is some luck involved in thinking of good ideas. Likewise, if the opponents also find good ideas and put in more effort, you will drop a few places. Nonetheless, I hope that reading this page can help you do better in the future.

Have you gotten interested in creating a bot too? Then you can join me in next year's CodeCup competition!


Feel free to contact me at

Appendix: Feedback from the CodeCup forums

Some ideas and missed opportunities

When the competition finishes, there usually is a good discussion on the forum about the strategies everyone used. Here is a grab bag of things I found interesting:

Appendix: How to deal with variation and uncertainty

Embracing statistics

When repeatedly playing the game, there will be some variation in the outcome. When estimating playing strength of your bots, and of those of the other competitors, you need to take this variation into account. So it's always useful to compute a standard deviation and 95% confidence interval, showing you whether you might have gotten (un)lucky or if the difference definitely exists.

Using the competition results, I have modelled the playing strength of each bot. This model is imperfect, but a decent approximation, showing that the ranking could easily have been different.

Estimated playing strength, click to view large version.
Crashes and timeouts are ignored, thus ranking competitors differently.
Regression model: pair of matches outcome = strength[player1] - strength[player2] + error.

This has been a useful tool leading up to the final competition, together with plots of the strength playing as Order or Chaos (not shown). For example, I estimate that my bot plays relatively strong as Chaos, even making it the strongest Chaos bot (with large uncertainty though). Nevertheless, for the final competition we can do a bit better, since the final competition has been re-run: same bots, but new matches. With that, I calculate that I could have ended up higher than Tapani Utriainen with 22% chance, even though we had ~200 points difference! Therefore, I recommend playing more matches to counteract this uncertainty, in future competitions.

Appendix: on ideas, mistakes and making decisions

Fast experimentation helps

In general it is useful to gather ideas in the literature. Another useful resource is the Chess Programming Wiki. But once you have some ideas, how do you evaluate them?

I found that is important to focus on quickly evaluating ideas. So instead of fully programming a fast new input based on some idea, I just make a slow, easy to program and mostly-correct input. Then I run this on my database of representative matches, and check how this changes the regression R² with some R code. This is just indicative, since the R² is not important by itself, but lets me see what the potential is of my new input. After that, I can make a correct and fast version, and let it train my evaluation function. But even here, I run my training program with a lower time limit per turn, so I can evaluate quickly. The training program also spends some time to play matches against the old version. A bit slower, but it directly shows you how your new idea really performs. Only the final version runs for hours to get the best quality results.

Another thing that I cannot stress enough: look at your data! By making plots you see unexpected relations, you see wrongly computed values and you find opportunities for improvement.