An international team of scientists from Russia, France and Germany has developed a new reinforcement learning algorithm (Bayes- UCBVI) with the participation of researchers from the School of Computer Science, the HSE Center for Artificial Intelligence and the AIRI Institute of

An international team of scientists from Russia, France and Germany has developed a new reinforcement learning algorithm (Bayes-UCBVI) with the participation of researchers from the School of Computer Science, the HSE Center for Artificial Intelligence and the AIRI Institute of Artificial Intelligence. This is the first Bayesian algorithm with a proof of mathematical validity and has been successfully tested in the practice of Atari games.

The results were announced at the ICML-2022 meeting. Reinforcement learning is a type of machine learning. Compared with classic machine learning, the key feature of this method is the constant interaction of the agent (algorithm) with the environment, which receives feedback in the form of rewards and punishments from the environment. The goal of the agent is to maximize the amount of rewards given to him by the "correct" interaction.

proxy should not just try to find out the right approach based on your current understanding of the environment. He also has to explore the environment: looking for new opportunities to get greater rewards. Therefore, a dilemma arises: research or use known data.

The problem of choosing between exploring the environment and using existing knowledge is one of the main problems in building an effective reinforcement learning algorithm. The Bayes-UCBVI algorithm developed by the researchers runs in an optimistic paradigm, where the agent double-checks the value of actions he rarely performs.

Optimistic principle causes the agent to choose any action for one of two reasons: Either he is not trying to do too much, or he is very sure it is good. This is what ensures that the agents do research on the environment.

"Let's imagine there is a coffee shop near your home. Every morning you buy your favorite coffee and pastries there. But there is a café nearby and you think: If there are some buns that taste better and the coffee is more fragrant? The next morning, you'll face a dilemma: explore a new coffee shop or go to a trustworthy place where you can determine the results.

You decide to explore a new place, but the coffee tastes bad. But you've tried it once and don't know: maybe the last batch of coffee beans just didn't succeed. Based on the principle of optimism, you'll give this cafe at least one chance," explains Daniil Tyapkin, an employee of the International Random Algorithm and Multivariate Data Analysis Lab and AIRI.

researchers point out that while theoretically effective, the optimistic principle is difficult to use to create practical reinforcement learning algorithms that are suitable for complex environments (such as computer game ) or to control real robots. The algorithms proposed by scientists make it possible to bridge the gap between theory and practice.

The author team first proposed the generalization of this algorithm and tested it on 57 Atari games. “This is the first algorithm with theoretical and practical significance,” said Alexei Naumov, one of the authors and head of the International Laboratory of Random Algorithms and Multidimensional Data Analysis. — The mature results of Bayes-UCBVI play an important role in the development of machine learning, which unite the community of theorists and practitioners. Using this algorithm in practice will significantly speed up the process of learning artificial intelligence . ”