Pushing Into Uncharted Territory
So far, I’ve been using tiny graphs to ensure things are working as expected. But eventually we’ve got to get out of the wading pool! Here’s my first step in dipping my toes in the ocean.
We’ll increase the size of the game graph to the Cube graph, which has 8 vertices and 12 edges.
I chose this graph for the next step because:
It’s got an even number of edges - each player has the same number of turns (6 each)
It’s small enough to be brute forced, but large enough to try strategies that might scale to larger graphs
It’s small enough to be easily human-playable
This graph has 48 automorphisms, but only 6 survive for the fixed vertex variant. Of the 3¹² = 531,441 possible terminal states, there are only 89,694 unique terminal states. This is still small enough to adjudicate all of them brute-force style. But I want to try something different here because brute force won’t work for anything much bigger than this.
This is what the game board looks like.
Cube fixed vertex Tangled game graph.
We could compute and store the quantum annealing ground truths for all 89,694 unique terminal states here. But this will soon stop being a viable strategy as the graphs get bigger. We need a new approach for training our agents that could scale. For the agents trained using simulated annealing adjudications, we can just use an online simulated annealer that computes results for all the terminal states that come up and then just throw them away. But for the agent trained using quantum annealing, we should try to be more clever, because QC calls are expensive. I’ve thought about this for a while. I don’t have any way to do this in a completely rock-solid way, so here’s what I’ve decided to implement:
Step 1: Pre-train an AlphaZero agent to a satisfactory performance level using a simulated annealing adjudicator
Step 2: Next, fine-tune the agent with the following changes:
If the simulated annealing adjudicator returns a score for a terminal state whose absolute value is between 0.25 and 0.75 (i.e. close to the draw lines at ±0.5):
Check if that terminal state or any of its automorphisms is a key in an adjudication dictionary; if it is, return the already computed quantum annealing result
If it isn’t, create a dictionary entry whose key is the terminal state and whose value is the result of adjudication using the quantum annealer; return that value
The idea here is to first pre-train the AlphaZero agent on rewards that are ‘mostly correct’ (the simulated annealing ones) at low compute cost. After this is done, we fine-tune on the quantum annealing results, by only sending terminal states whose adjudication values are likely to differ between quantum and simulated annealing (I am guessing that this will happen preferentially for scores near the draw lines of score = ±0.5, motivating the ‘absolute value is between 0.25 and 0.75’ criterion), and storing the results of the quantum anneals in a look-up table, so we only have to call the D-Wave hardware once per unique terminal state.
This will miss a lot of terminal states where the agent has been misinformed about the reward, but I hope it will give a noticeable performance boost for the quantum-trained agent.
If this strategy works here, we can then apply it to the type of game graph where the gap between any classical adjudicator and the quantum annealer is likely in the quantum supremacy regime.
Creating a Lookup Table For The Cube Graph
For the Cube graph, there are 10,958 states where the simulated annealing adjudicator reports a score whose absolute value is between 0.25 and 0.75. That’s a lot — about 12% of all unique terminal states are near the draw lines! I sent all of these into the quantum annealing adjudicator, and I found that 4,196 out of 10,958 (about 38%) of the states near the draw line were mis-adjudicated by the simulated annealer. That’s also a lot! At least for this graph, these numbers imply that (a) there are a lot of states near the draw lines and (b) a lot of these states are wrongly adjudicated by simulated annealing. From the perspective of using these quantum annealing results to improve an AlphaZero agent, by doing the above I found 4,196/89,694 ~ 5% of the unique terminal states’ ground truths that differ from the simulated annealing results. I’d expect that with this improvement in the reward quality we should see an improvement in the performance of an AlphaZero agent fine-tuned using these results.
Competition Results
I followed the same procedure as for the barbell graph, training AlphaZero agents both on simulated annealing and the limited lookup-table results found above. The neural nets for both were identical in architecture to the one I used for the barbell graph but with increases in some of their hyperparameters, giving 223,437 trainable parameters. Note that as there are only 89,694 unique terminal states for this game graph, this is likely overkill. But I think it’s fine, as what we’re looking for here is evidence that the ~ 5% of these where the adjudications between simulated and quantum annealing disagree is enough to cause performance differences between the agents. Here are the hyperparameters I used.
self.nn_args = dotdict({ 'lr': 0.0001, 'dropout': 0.1, 'epochs': 10, 'batch_size': 64, 'num_channels': 32, # 223,437 trainable parameters 'num_res_blocks': 5, 'first_out_layer_size': 64, 'second_out_layer_size': 128})
I set up a round-robin tournament using the same nine agents as in the barbell graph case:
Random play [Random]
MCTS with 10 rollouts using simulated annealing [MCTS 10 SA]
MCTS with 100 rollouts using simulated annealing [MCTS 100 SA]
MCTS with 10 rollouts using quantum annealing [MCTS 10 QA]
MCTS with 100 rollouts using quantum annealing [MCTS 100 QA]
AlphaZero trained using simulated annealing with 10 rollouts using simulated annealing [AlphaZero 10 SA]
AlphaZero trained using simulated annealing with 100 rollouts using simulated annealing [AlphaZero 100 SA]
AlphaZero trained using quantum annealing with 10 rollouts using quantum annealing [AlphaZero 10 QA]
AlphaZero trained using quantum annealing with 100 rollouts using quantum annealing [AlphaZero 100 QA]
Here are the results:
Results of the round-robin tournament, showing each agent’s results against each other agent both playing as red and blue. The way to read this is that the blue-colored cells represent W/L/D for the blue-colored agent, and the red-colored cells represent W/L/D for the red-colored agent.
Summary of results, ranked in descending Elo order. The number in the agent name is the number of online rollouts used per action selection. The advantage for red is interesting as it implies going first provides an advantage. The top four agents are essentially equivalent to each other, producing mostly draws (although the ranking seems solid in that a higher ranked agent reliably does better, even marginally, than those immediately below it).
Some Thoughts …
As the number of possible terminal states explodes with graph size, the fraction of states that we will be able to adjudicate using the quantum annealer will go to zero. This means we need a way to figure out which states are most meaningful to adjudicate. I think the reason why the QA results aren’t much better here than the SA results here is because the fraction of states where we were able to adjudicate using QA was lower.
There are too many draws in the current incarnation of Tangled. Good agents (at least on small graphs) seem to be able to find ways to always draw (and the density of states seems to massively peak around score ~ zero). I would like to reduce the number of draws. There is an easy way to do this — reduce the number defining the ‘draw line’ from 0.5 to something smaller. In the limit where this goes to zero, there will be no draws at all. I think that when two good players play each other, the resultant score will always be close to zero.
I think this means that if we shrink the draw line down to the limit where there are no draws, the most meaningful states will be the ones closest to zero score.
Now which of these should we choose to adjudicate using the quantum annealer? Maybe if we train an AlphaZero agent using simulated annealing to the point where we lose patience with the training process, during self-play it will seek out the ‘right’ kinds of terminal states (those you get to when both players are playing well (at least assuming the simulated annealing adjudicator)). We could, once we are in the final phases of the training, select terminal states that give results very close to zero (say scores within ~ 0.01 of zero) and send those to the quantum annealer, and store the results in a lookup table. I think I only have about 100,000 calls to the QC left so that would be only 100,000 states, which is ~ zero as a percentage of total states, but maybe they are ‘special’ in the sense that good agents preferentially seek them. Maybe?