Machine Heart Editorial Department
The popular generative adversarial network (GAN) today is just a special case of confronting curiosity? In a recent blog post by Jürgen Schmidhuber, he reiterated this statement. Jürgen said he described in a 1990 article detailing reinforcement learning and planning systems based on two recurrent neural networks (RNNs), the controller and world model, which also includes several concepts that are now well-known in the ML field.
On the last day of 2020, LSTM inventor and veteran of deep learning Jürgen Schmidhuber published a blog post reviewing the research work published by his team on the use of human planning and reinforcement learning published 30 years ago.
He said that in his 1990 article "Making the World Differentiable: On Using Self-supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-stationary Environment" (hereinafter referred to as the FKI-126-90 report) introduced some concepts that are now widely used, including planning with recurrent neural networks (RNNs) as the world model, high-dimensional reward signals (also used as input to neural controllers), and RNNs deterministic strategy gradients, as well as artificial curiosity and intrinsic motivation in neural networks (NNs).
FKI-126-90 Report address: http://people.idsia.ch/~juergen/FKI-126-90ocr.pdf
In the 2010s, these concepts became popular with the decrease in computing power costs. Since 2015, Jürgen et al. have expanded more to solve planning problems in abstract concept space and how to learn to think.
In addition, agents with adaptive loop world models can even provide simple explanations of consciousness and self-awareness.
The following is Jürgen Schmidhuber’s blog content:
In February 1990, I published the FKI-126-90 report (revised in November), introducing several concepts that were later widely known in the field of machine learning.
This report describes a system for reinforcement learning and planning based on two recurrent neural networks (RNNs)—the controller and the world model. The controller tries to maximize the accumulated expected rewards in an initially unknown environment, while the world model learns to predict the results of the controller's actions. The controller can use the world model to plan in advance through rollout to select actions that can maximize the prediction of cumulative rewards. This integrated architecture for learning, planning and feedback was published before Rich Sutton proposed DYNA. The FKI-126-90 report also cites the use of feedforward neural networks for system identification. This approach inspired a lot of follow-up research, not only in 1990-91, but in recent years. Another innovation in
990 is the high-dimensional reward signal. Traditional RL focuses on one-dimensional reward signals, but humans have millions of information sensors to perceive different types of pain and happiness. As far as I know, the FKI-126-90 report is the first RL paper focusing on multidimensional, vector-valued pain and reward signals that come from multiple different sensors, with cumulative values based on predictions for all sensors, rather than just a single scalar overall reward. Compare the functions that were later called general value function. Unlike previous adaptive critics, the signals proposed by FKI-126-90 are multi-dimensional and cyclical.
In addition, unlike traditional RL, these reward signals are also used as information input for controller neural network learning to perform actions that can maximize cumulative rewards. This is also related to meta-learning.
Can these technologies be applied to the real world? The answer is yes.My former postdoctoral colleague Alexander Gloye-Förster led the FU-Fighters team at the Liberty University of Berlin to win the 2004 RoboCup Robot World Cup Speed Championship. The robots that help them win the championship are to use neural networks to plan ahead, which is consistent with the concept proposed in the FKI-126-90 report.
005, Alexander and his team also showed how to use these concepts to create self-healing robots. They built the first elastic robot using continuous self-modeling that can automatically recover after experiencing some kind of accidental damage. The
FKI-126-90 report also states the basis of RNN deterministic policy gradients. The section "Augmenting the Algorithm by Temporal Difference Methods" combines the time difference method based on dynamic programming with a gradient-based world prediction model to calculate the weight changes of independent control networks. More than two decades later, DeepMind used a similar variant.
Finally, the FKI-126-90 report also introduced artificial curiosity through adversarial generation neural networks. In the process of interacting with the world, humans learn to predict the consequences of their actions. At the same time, humans are also curious and will design experiments to obtain new data to learn more. To build curious artificial intelligence bodies, the FKI-126-90 report and another study by me, A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers, propose a novel approach to active unsupervised or self-supervised learning with intrinsic motivation. This method is based on minimax game, where one neural network minimizes the objective function and the other neural network maximizes the objective function. Now, I call the confrontation between two unsupervised adversarial neural networks “Adversarial Artificial Curiosity” to distinguish it from the variants of artificial curiosity and intrinsic motivation that have emerged since 1991.
How does it work against artificial curiosity? The controller NN (probably) generates outputs that may affect the environment. World Model NN predicts the environment's response to controller output. The world model uses gradient descent to minimize its error and thus becomes a better predictor. But in a zero-sum game, the controller tries to find the output that maximizes the world model error, and the loss of these outputs is the gain of the controller. So, the controller is inspired to create new outputs or experiments to generate data that the world model finds surprised until it is familiar and eventually bored with it.
That is, in 1990, we have proposed a self-supervised neural network that is both generative and adversarial (using the term post-2014 here), and generate experimental output and new data for common examples of static patterns and pattern sequences as well as RL. In fact, the popular Generative Adversarial Network (GNN) (2010-2014) is an application against curiosity, where the environment returns 1 or 0 in a given set based on whether the controller's current output is. It is also important to note that adversarial curiosity, GAN, and adversarial PM (Predictability Minimization, 1991) are very different from other early adversarial machine learning settings, which neither contain unsupervised NN nor model data or use gradient descent.
As I have mentioned frequently since 1990, the weight of a neural network should be regarded as its own program (program). It is believed that the purpose of deep NN is to learn useful internal representations of observed data, and even an international academic conference ICLR on learning representations emerges. But in reality, NN learns a program (map weight or parameters) that calculates such representations based on input data. The output of a typical NN is differentiable to its own program. That is, a simple program generator can calculate the direction in the program space, where people can also find better programs. Much of my research work since 1989 has taken advantage of this fact.The controller/model (C/M) planner proposed in the report focuses on simple millisecond planning, trying to predict and plan every little detail of the future. Even today, this is still a standard method in many RL applications, such as Go and Chess applications. However, my 2015 paper, On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models, focuses on abstract (e.g., hierarchical) planning and reasoning [PLAN4-5]. Guided by algorithmic information theory, I describe RNN-based AI (RNNAI), which can be trained on endless sequences of tasks, some provided by users, and others invented by RNNAI itself in a curious, fun way to improve its RNN-based world model.
is different from the system proposed in the FKI-126-90 report. RNNAI [PLAN4] learns to actively query its models for abstract reasoning, planning and decision-making, which essentially make RNNAI learn to think [PLAN4]. The idea of [PLAN4-5] can be applied to many scenarios. In these scenarios, an RNN-like system utilizes the algorithmic information of another system. These ideas also explain concepts like mirror neurons [PLAN4].
In a recent paper co-authored with David Ha (2018) [PLAN6], we propose a world model. The model can be quickly trained in an unsupervised way to learn compressed spatiotemporal representations. Taking the features extracted from the world model as input to the agent, we can train a very compact and simple strategy to solve the problem to be solved. Our model implements SOTA results in multiple environments.
Finally, what does all of this have to do with the two seemingly elusive concepts of "consciousness" and "self-awareness"? The first deep learning machine [UN0-UN3] I proposed in 1991 simulated multiple aspects of consciousness. It uses unsupervised learning and predictive code to compress observation sequences. Using "conscious chunker RNN" to deal with the unexpected things of low-level "subconscious automated RNN". chunker RNN learns to "understand" them by predicting unexpected events. automatiserRNN uses the neural knowledge distillation proposed in 1991 to compress and absorb the previous "conscious" insights and behaviors of chunker RNN, thus making them "subconscious".
Now let's review the predictive world model of the controller interaction with the environment discussed above. The model effectively encodes the growing action and observation history through predictive encoding [UN0-UN3][SNT] and also automatically creates feature hierarchies, with lower-level neurons corresponding to simple feature detectors (perhaps similar to those found in mammalian brains), and higher-level neurons generally correspond to more abstract features, but where necessary should be refined.
Like other excellent compressors, the world model will learn to identify the rules common to existing internal data structures and generate prototype encodings or compact representations or symbols (not necessarily discrete) for frequently occurring observation subsequences to reduce the storage space required as a whole. Specifically, compact self-representation or self-signature is a by-product of the naturally occurring process of data compression, because there is one thing in all the actions and sensory inputs of the agent, that is, the agent itself.
To effectively encode the entire data history through predictive encoding, the agent will calculate the neural activation pattern representing itself by creating some kind of internal subnetwork [CATCH][FKI-126-90]. When this representation is activated by the controller's planning mechanism (mentioned in the FKI-126-90 report) or a more flexible controller query (mentioned in the 2015 paper), the agent will think about itself, realize itself and its future possibilities, and attempt to create a future with the least pain and the most joyful through interaction with the environment.That's why I have always claimed that we already had simple, conscious, self-aware and emotional AI bodies thirty years ago.
Original link: http://people.idsia.ch/~juergen/world-models-planning-curiosity-fki-1990.html#PLAN4