I’ve gotten to a major milestone in this project. I’ve trained and tested my first quantum-trained AlphaZero agent! It appears to behave the way I was hoping it would.

The Game Graph I Used Here

For this run, I used the barbell graph, which in my esoteric naming convention is graph 19. I chose this one because it’s the smallest graph (6 vertices, 7 edges) where some of the terminal states’ adjudications differ between simulated and quantum annealing (there are 3⁷ = 2,187 unique terminal states; 133 of these (about 6%) give different answers for simulated annealing and hardware-based fast quantum annealing).

Here’s the barbell graph, showing the locations of the red and blue vertices (we are using the fixed vertex Tangled variant).

Recap: How to Play Tangled

To recap how Tangled works, the two players (red and blue) take turns selecting edges of the graph and coloring them one of gray, green, or purple. Red goes first. As there are an odd number of edges in this graph, red gets one more turn than blue (red chooses 4 edges, blue chooses 3), giving red a significant advantage. In the tests that follow, when pitting one agent against another, I ensure both play red half the time and blue half the time.

Once all the edges are colored, the game is over and the winner is determined. What each player is attempting to do is have more influence than their opponent, where more influence means more of the vertices prefer to align with their colored vertex. For example in the barbell graph, if the edges (0, 1), (0, 2), (2, 5), and (3, 5) are colored green (ferromagnetic coupling) and the edges (3, 4) and (4, 5) are colored purple (antiferromagnetic coupling), then when the system is measured, the energetically preferred states are (1,1,1,1,-1,1) and (-1,-1,-1,-1, 1, -1). You can see that the red vertex (vertex 0) is the same as four other vertices in both preferred states, while the blue vertex (vertex 4, the odd one out) differs from all five other vertices in both. So in this case, red has much more influence (four aligned vs zero aligned) and wins the game. The ground truth for this adjudication is a hardware-based fast quantum anneal (anneal time of 5 nanoseconds) using a D-Wave quantum computer (data here collected from the Advantage 2.6 machine). See this post for details.

Recap: How to Train AlphaZero Agents For Tangled

It’s the same as regular AlphaZero (see the series of blog posts/videos starting here on the subject for deep dives), except we can vary which adjudicator is used to evaluate terminal states and provide the reward signal. The idea is that agents trained on adjudicators that return ground truth (hardware-based quantum annealing) should perform much better than agents trained on approximations that often return the incorrect reward (in this case, simulated annealing returns the wrong answer for about 6% of the terminal states).

The Experimental Set-up

I adjudicated all 2,187 unique terminal states for this game graph, using 1,000,000-sample simulated annealing and 100,000-sample 5 nanosecond anneal time using the D-Wave Advantage 2.6 system in fast anneal mode. For the hardware anneal, I used the parallel embedding technique described here with no shim or gauge transforms.

I then trained two AlphaZero agents. The first was trained using the pre-computed simulated annealing terminal state adjudications (‘lookup_type’: ‘SA’), and the second was trained using the pre-computed quantum annealing adjudications (‘lookup_type’: ‘QA’). Both were trained using the same run parameters, which are shown here:

    args = dotdict({
        'fixed_token_game': True,    # if True, red/blue have fixed token locations; if False, players choose vertices
        'lookup_type': 'SA',   # either 'SA' or 'QA' -- only relevant if using 'lookup_table' adjudicator
        'numIters': 35,    # number of self-play/nn training/competition cycles
        'numEps': 480,     # Number of complete self-play games to simulate during a new iteration; number of training
        # examples = numEps*# moves per game*# symmetries; make this a multiple of num_workers
        'tempThreshold': 5,
        # Training uses temp=1 if episodeStep < tempThreshold, and thereafter uses temp=0; set to E-2 for all but last
        # 2 moves, E-4 for all but last 4 moves, etc.
        'updateThreshold': 5,  # During arena playoff, new neural net will be accepted if > threshold of Elo gain
        'maxlenOfQueue': 5000000,  # Number of game examples to train the neural networks
        'numMCTSSimsSelf': 100,  # Moves for MCTS to simulate during self-play; higher means higher game quality
        'numMCTSSimsComp': 10,  # Moves for MCTS to simulate during competitive play
        # each traverses a tree down to a leaf. once it gets there it calls the neural net to establish an initial
        # policy and value for the node. if the leaf is a terminal state, it evaluates the value by adjudication
        'arenaCompare': 592,  # Half the number of games to play during arena play to determine if new net will be
        # accepted; small Elo differences need large numbers (1000-3000 total games) to be sure; make this a multiple
        # of num_workers
        'cpuct': 2.0,  # constant for polynomial upper confidence bound (larger=more exploration), default=2
        'initial_elo': 1000,   # assuming starting from baseline of equivalence to 10-rollout MCTS
        # 'checkpoints': './results/graph_' + str(graph_args.graph_number) + '_parallel_fixed_token_results/',
        'load_model': False,  # set to true if you want to start training from a previous checkpoint
        'numItersForTrainExamplesHistory': 8,
        'num_workers': 16   # number of cpu workers; test different values for your setup to find optimal number
    })

The neural net I used here is very small (only 34,188 trainable parameters — I figured you probably don’t need a huge net for this game graph), but the same overall architecture as I’ve been using (stacked res blocks with policy and value heads). Here are the run parameters and the model architecture:

class AlphaZeroNet(nn.Module):
    def __init__(self, board_size, action_size, verbose=False):
        super().__init__()   # necessary to initialize nn.Module
        self.nn_args = dotdict({
            'lr': 0.0001,       # probably don't modify these unless you want to try a specific type of neural net
            'dropout': 0.1,     # start by modifying the structure of the net to be what you want and then change
            'epochs': 10,       # these parameters as appropriate
            'batch_size': 64,
            'num_channels': 16,   # 34,188 trainable parameters
            'num_res_blocks': 2,
            'first_out_layer_size': 32,
            'second_out_layer_size': 64
        })

        self.board_x, self.board_y = board_size
        self.action_size = action_size

        self.criterion_pi = nn.KLDivLoss(reduction='batchmean')  # Use KLDivLoss for probabilities
        self.criterion_v = nn.MSELoss()

        # Initial input processing; generally recommended to set bias=False if followed by batch norm
        self.conv1 = nn.Conv1d(self.board_y, self.nn_args.num_channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm1d(self.nn_args.num_channels)

        # Residual tower
        self.res_blocks = nn.ModuleList([
            ResBlock1D(self.nn_args.num_channels) for _ in range(self.nn_args.num_res_blocks)
        ])

        # Calculate size after residual tower
        self.flat_size = self.nn_args.num_channels * self.board_x

        # Policy head with BatchNorm
        self.policy_fc1 = nn.Linear(self.flat_size, self.nn_args.first_out_layer_size, bias=False)
        self.policy_bn1 = nn.BatchNorm1d(self.nn_args.first_out_layer_size)
        self.policy_fc2 = nn.Linear(self.nn_args.first_out_layer_size, self.action_size, bias=True)

        # Value head with BatchNorm
        self.value_fc1 = nn.Linear(self.flat_size, self.nn_args.first_out_layer_size, bias=False)
        self.value_bn1 = nn.BatchNorm1d(self.nn_args.first_out_layer_size)
        self.value_fc2 = nn.Linear(self.nn_args.first_out_layer_size, self.nn_args.second_out_layer_size, bias=False)
        self.value_bn2 = nn.BatchNorm1d(self.nn_args.second_out_layer_size)
        self.value_fc3 = nn.Linear(self.nn_args.second_out_layer_size, 1, bias=True)

        self.dropout = nn.Dropout(self.nn_args.dropout)

        self.optimizer = optim.Adam(self.parameters(), lr=self.nn_args.lr)

        self.to(device)   # Move the model to the GPU

        if verbose:
            print(self)
            print('number of trainable parameters: ', sum(p.numel() for p in self.parameters() if p.requires_grad))

During gameplay, AlphaZero and pure MCTS agents both require online adjudication in order to choose moves (I guess technically you could use AlphaZero online without any MCTS rollouts by just using the bare neural net policy, but I didn’t do that here). To ensure that the non-quantum agents did not have access to the quantum annealing ground truth except for final game state adjudication, I did the following. For the ‘classical’ agents, during online move selection, terminal state adjudications used the simulated annealing results. For the quantum agents, online move selection was done using the quantum annealing results. When the game is ended, the adjudication always uses the quantum results.

Intuitively, this should lead to the agents constrained to use the (sometimes incorrect) simulated annealing results performing worse than the quantum versions that always see the correct results.

Enter The Kumite!!!

I set up a round-robin tournament using the following nine agents:

Random play [Random]
MCTS with 10 rollouts using simulated annealing [MCTS 10 SA]
MCTS with 100 rollouts using simulated annealing [MCTS 100 SA]
MCTS with 10 rollouts using quantum annealing [MCTS 10 QA]
MCTS with 100 rollouts using quantum annealing [MCTS 100 QA]
AlphaZero trained using simulated annealing with 10 rollouts using simulated annealing [AlphaZero 10 SA]
AlphaZero trained using simulated annealing with 100 rollouts using simulated annealing [AlphaZero 100 SA]
AlphaZero trained using quantum annealing with 10 rollouts using quantum annealing [AlphaZero 10 QA]
AlphaZero trained using quantum annealing with 100 rollouts using quantum annealing [AlphaZero 100 QA]

Let’s go!!!!!!

Results

In each agent vs. agent competition, I played 96 games with the first agent as red, and 96 games with the first agent as blue, for a total of (9 choose 2) * 96 * 2 = 6,912 games played. Here are the results:

My First Quantum Agent

The Game Graph I Used Here

Recap: How to Play Tangled

Recap: How to Train AlphaZero Agents For Tangled

The Experimental Set-up

Enter The Kumite!!!

Results

© 2025 Snowdrop Quantum Applications Corporation

My First Quantum Agent

The Game Graph I Used Here

Recap: How to Play Tangled

Recap: How to Train AlphaZero Agents For Tangled

The Experimental Set-up

Enter The Kumite!!!

Results

Pushing Into Uncharted Territory

Analysis of Adjudication of Terminal States for Small Graphs

© 2025 Snowdrop Quantum Applications Corporation