DeepMind new research: AlphaZero's black box opened | TechNews Technology New Report

Chess has always been an AI laboratory. 70 years ago, Alan Turing wanted to create a chess-playing machine capable of self-learning and continuous improvement. The “Deep Blue” born in the 20th century defeated humans for the first time, but relied on experts to input chess knowledge; AlphaZero, which appeared in 2017, was a neural network Road-driven reinforcement learning machines, realizing Turing’s dream.

AlphaZero’s heuristic algorithm does not require any artificial design, nor does it need to watch humans play chess, it is completely self-playing training. Did it really learn the concept of human chess? This is the interpretability problem of neural networks.

Recently, the author of AlphaZero, Demis Hassabis, collaborated with DeepMind colleagues and Google Brain researchers to find evidence of the human chess concept from the AlphaZero neural network, showing the time and location of the concept during the neural network training process, and also found that AlphaZero and humans have different styles. , the paper was published in PNAS.

AlphaZero learned human chess concepts during training

AlphaZero’s network architecture includes a backbone residual network (ResNet) and a separate Policy Head and Value Head. ResNet is composed of a series of network blocks and skip connection layers. In terms of training iteration (iteration), AlphaZero starts with a neural network with random initialization parameters, repeatedly plays against itself, learns the position evaluation of chess pieces, and trains again according to the generated data.

To determine the extent to which the AlphaZero network reveals human chess-playing concepts, the study used sparse linear probing to map parameter changes during the network’s training process to human-understandable changes.

First, the concept is defined as a user-defined function as shown in orange in Figure 1. As a probe, a generalized linear function g is trained to approximate the chess concept c. The approximation g quality indicates how well the layer (linear) encodes the concept. Given the concept, the process is repeated for the sequence of networks generated by the training process for all layers of each network.

▲ Figure 1: The AlphaZero network (blue) explores human-encoded chess concepts. (Source: PNAS, the same below)

A function can be used to determine whether I or the opponent has a “bishop♗”:

Of course, there are many more complex chess concepts than this example, such as piece mobility (mobility), and a function can be written to compare the score when I and my opponent move pieces. During the experiment, the concept function is pre-specified to encapsulate the domain knowledge of chess.

Next is probe training. The researchers used the 10 to the 5th power natural occurrence positions in the ChessBase dataset as the training set, triggered the training sparse regression probe g from the network of depth d, and predicted the value of concept c. Comparing the networks at different training steps of the AlphaZero learning cycle, and the scores of different concept probes at different layers of each network, extracts when and where a network learned a concept.

Finally, the what-when-where diagram of each concept is obtained, and the three indicators of “what is the computing concept”, “where does the computing network occur”, and “when does the concept appear in network training” are visualized. Figure 2.

▲ Figure 2: The concepts from A to B are “Total Score Evaluation”, “Have I been Checked?”, “Threat Assessment”, “Can I capture the opponent’s Queen”, “Will the opponent checkmate me with this move?” ” Sub-force score evaluation”, “sub-force score”, “Do I have soldiers”.

It can be seen in Figure c that as AlphaZero becomes stronger, the “threats” concept function and AlphaZero representations (which can be detected by linear probes) become less and less relevant.

This what-when-where diagram includes two baselines that are more needed for detection methods, one is the input regression, shown at layer 0, and the other is the regression triggered from the network with random weights, shown at training step 0. From the above results, it can be concluded that the change of regression accuracy is completely determined by the change of network representation.

Furthermore, many of the what-when-where graph results show the same pattern, i.e. regression accuracy remains low throughout the network, increases rapidly with network depth until around 32k steps, then stabilizes and remains constant in later layers . So all concept-related computation happens relatively early in the network, after which residual blocks are selected when performing moves, or computed to give features outside the concept set.

With more training, many human-defined concepts can be predicted from AlphaZero representations with high accuracy.

For more advanced concepts, the researchers found that AlphaZero’s mastery is still poor. First of all, the concepts that are significantly different from zero at 2k steps are “material” and “space”; more complex concepts such as “king_safety”, “threats”, and “mobility” are obviously non-zero at 8k steps, and only after 32k steps There is real growth. This result is consistent with the sharp rise in r² shown in the what-when-where plot in Figure 2.

The most notable feature of most what-when-where graphs is that the network regression accuracy increases rapidly at the beginning, and then reaches a plateau or declines. It means that the concept set discovered from AlphaZero only detects the earlier layers of the network. To understand the later layers, new concept detection techniques are needed.

AlphaZero’s opening strategy is different from humans

After observing that AlphaZero learned the concept of human chess, the researchers explored AlphaZero’s understanding of chess tactics for the opening strategy, because the opening choice implies the understanding of related concepts. And AlphaZero is different from the opening strategy of humans: AlphaZero will narrow the scope, while humans will expand the scope.

Figure 3A shows the historical evolution of human preference for the first move of White. In the early days, e4 was popular as the first move. Later, the opening strategy was more balanced and flexible. Figure 3B shows the evolution of AlphaZero’s opening strategy with training. It can be seen that AlphaZero always weighs all options equally at the beginning, and then gradually narrows the range.

▲ Figure 3: Comparison of AlphaZero and human first step preferences over training steps and time.

This is in sharp contrast to the evolution of human knowledge, which gradually expands from e4, and AlphaZero is obviously biased towards d4 in the later stage of training. However, preference does not need to be over-explained, because self-play training is based on playing chess quickly, adding a lot of randomness to facilitate exploration. The reason for the discrepancy is unclear, but reflects fundamental differences between humans and artificial neural networks. The possible reason may be that the historical data of human chess emphasizes the collective knowledge of master players, and AlphaZero’s data includes beginners and single evolutionary strategies.

Then when the AlphaZero neural network has been trained many times, will there be a preference for certain opening strategies?

The result of the research is that in many cases, preferences are not stable in different trainings, and AlphaZero’s opening strategies are very diverse. Such as the classic Ruy Lopez opening (commonly known as “Spanish opening”), AlphaZero had a preference for choosing black in the early stage, and followed the typical betting methods, namely 1.e4 e5, 2.Nf3 Nc6, 3.Bb5.

▲ Picture 4: Ruy Lopez started.

AlphaZero will gradually converge to one of 3.f6 and 3.a6 during different trainings. In addition, different versions of AlphaZero each showed a strong preference for certain actions, and this was built early in training. This proves that chess is played in a variety of ways, not only visible between humans and machines, but also in different trainings of AlphaZero.

The process of AlphaZero acquiring knowledge

What is the relationship between the research results on opening strategies and AlphaZero’s conceptual understanding? The study found that various conceptual what-when-where diagrams have obvious inflection points, which coincide with significant changes in opening preferences, especially the concepts of material and mobility seem to be directly related to opening strategies.

The concept of material is mainly learned in 10k to 30k steps, and the concept of piece mobility is gradually integrated into AlphaZero’s value head at the same time. The basic understanding of chess piece material value should be prior to the understanding of chess piece mobility. AlphaZero then incorporates the theory into opening preferences for 25k~60k steps.

The author analyzes the evolution process of chess knowledge in the AlphaZero network: first discovers chess strength, then explodes in basic knowledge in a short time window, mainly related to mobility concepts; finally, in the improvement stage, the neural network opening strategy is perfected within hundreds of thousands of steps. Although the learning time is long, certain basic abilities will emerge rapidly in a relatively short period of time.

Former chess world champion Vladimir Kramnik was also invited to support the conclusion, and his observations are consistent with the above process.

In conclusion, this research demonstrates that the board learned by AlphaZero is capable of reconstructing many human chess concepts, and details network-learned concepts, training time when concepts are learned, and network locations for computing concepts. And AlphaZero’s chess style is not the same as that of humans. Now that we understand neural networks in terms of human-defined chess concepts, the next question will be: Can neural networks learn beyond human knowledge?

(This article is reproduced with the authorization of Leifeng.com; the source of the first picture: Pixabay)

AlphaZero learned human chess concepts during training

AlphaZero’s opening strategy is different from humans

The process of AlphaZero acquiring knowledge

Further reading:

Related

DeepMind new research: AlphaZero’s black box opened | TechNews Technology New Report

AlphaZero learned human chess concepts during training

AlphaZero’s opening strategy is different from humans

The process of AlphaZero acquiring knowledge

Further reading:

Share this:

Related

Losing weight is easy with Decathlon: say goodbye to the kilos gained during the summer.

In conjunction with the release of RTX 4080, “Bright Memory: Infinite” pushes a new patch: DLSS 3 performance greatly increased

You may also like

Leave a Comment Cancel Reply