Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment.

2024/06/2211:39:34 technology 1364

Heart of the Machine column

Author: Zhu Songchun team

Today (July 14), the top international academic journal Science Robotics published Zhu Songchun team (UCLA Yuan Luyao, Gao Xiaofeng, Beijing Institute of General Artificial Intelligence Zheng Zilong, Beijing The latest research results of Zhu Yixin and other authors from the University's Artificial Intelligence Research Institute - real-time bidirectional human-robot value alignment.

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

Paper address: https://www.science.org/doi/10.1126/scirobotics.abm4183

This paper proposes an interpretable artificial intelligence (XAI) system and elaborates on a computing framework for machines to understand human values ​​in real time. , and demonstrated how robots can communicate with human users in real time to complete a series of complex human-machine collaboration tasks. Zhu Songchun’s team has long been engaged in work related to explainable artificial intelligence. This article is the team’s second paper on explainable artificial intelligence published in Science Robotics. This research covers cognitive reasoning, natural language processing, machine learning, robotics, and other multi-disciplinary fields, and is a concentrated expression of the cross-research results of Professor Zhu Songchun's team.

In this era of coexistence of humans and machines, in order for machines to better serve humans, what should ideal human-machine collaboration look like? We might as well learn from the collaboration of human society. In the process of human teamwork, common values ​​​​and goals are the basis for ensuring concerted efforts and efficient cooperation between teams. Most of the current machine intelligence is based on data-driven (and in many cases data cannot be obtained), and unilaterally accepts human instructions (one is that instructions cannot be given when human observation is limited)

In order to solve the above problems, And in order to allow machines to conduct better "autonomous" exploration, we need to let machines learn to "read" human values, so we propose "real-time two-way value alignment." This requires humans to find ways to give feedback to AI again and again, and gradually teach AI to "read" human values, that is, to make the "values" of machines and humans consistent. This problem of

is also called value alignment, that is, how to ensure that the value realized by artificial intelligence during the execution of tasks is consistent with the value that the user cares about?

It can be said that value alignment is the basis for reaching a consensus (common ground) in the process of human-machine collaboration and has very important research value. Value alignment is also an important development direction in the future. It is the key to enabling machines to achieve “autonomous intelligence” and is also the only way to achieve general artificial intelligence. In view of this, the team of Zhu Songchun, director of the Beijing Institute of General Artificial Intelligence, has been working on research in this direction.

1. Research background

What should ideal human-computer collaboration look like? When the development of artificial intelligence was in the ascendant, Norbert Wiener, the father of cybernetics, proposed the basis for human-machine collaboration: "If we use a machine to achieve our goals, but cannot effectively intervene in the way it operates...then we'd better be sure that the goals input to the machine are what we really expect."

In recent years, a series of research developments have shown that efficient human-machine collaboration relies on teams. Have consistent values, goals, and understanding of mission status. This requires humans to efficiently establish a consensus on the task across the entire team through communication with machines, and each team member takes behavioral decisions that are easier for other partners to understand to complete collaboration. In most cases, the communication process between teammates is two-way, that is, each member has to play two roles: listener and expresser. Such two-way value alignment determines whether communication in human-machine collaboration can be successful, that is, whether the robot can accurately infer the user's value goals and effectively explain its own behavior. If these two conditions are not met, mutual incomprehension and misjudgment among teammates is likely to lead to collaboration failure. Therefore, if artificial intelligence is to better serve human society, it must be allowed to play these two roles when interacting with humans.

From the listener's perspective, traditional artificial intelligence algorithms (such as inverse reinforcement learning (IRL), etc.) can combine interaction data with machine learning algorithms to learn the user's value goals in a specific task, that is, by inputting the user's value in a specific task. behavior in the task to restore the reward function behind the behavior. However, in many practical and important applications (such as military and medical fields), data acquisition is often very expensive. These machine learning methods rely on large data sets and are unable to cope with real-time interactive human-machine collaboration scenarios.

From the perspective of the expresser, explainable artificial intelligence (XAI) was introduced to promote consensus between humans and machines. Current XAI systems often emphasize explaining how the model generates the decision-making process. However, no matter how much active input or interaction the user has, it can only affect the process of "generating explanations" by the machine, but not the process of "making decisions" by the machine. This is a one-way value-goal alignment, which we call static machine-dynamic user communication, that is, only the user's understanding of the machine or task changes during this collaboration process.

2. Research methods

In order to complete the two-way alignment of value goals between humans and machines, a human value-led, dynamic machine-dynamic user communication model is needed. In such a new model, in addition to revealing its decision-making process, the robot will instantly adjust its behavior according to the user's value goals, thus enabling the machine and human users to cooperate to achieve a series of common goals. In order to grasp user information instantly, we use communication learning to replace the traditional data-driven machine learning method. The machine will make reasonable explanations based on the inferred value goals of the user. This cooperation-oriented human-machine collaboration requires the machine to have theory of mind (ToM), that is, the ability to understand the mental states of others (including emotions, beliefs, intentions, desires, pretense and knowledge, etc.). Theory of mind was first studied in psychology and cognitive science , and has now been generalized to the field of artificial intelligence. Theory of mind is particularly important in multi-agent and human-computer interaction environments, because each agent must understand the status and intentions of other agents (including humans) to perform tasks better, and its decision-making behavior will affect other agents Judge. Designing a system with theory of mind is not only to explain its decision-making process, but also to understand human cooperation needs, so as to form a human-centered, human-machine compatible collaborative process.

In order to build an AI system with the above capabilities, this article designed a "human-machine collaborative exploration" game. In this game, users need to cooperate with three reconnaissance robots to complete exploration missions and maximize team benefits. Settings of this game: 1. Only the reconnaissance robot can directly interact with the game world, and users cannot directly control the behavior of the robot; 2. Users will choose their own value goals in the initial stage of the game (for example: minimize exploration time, collect more resources, explore larger areas, etc.), the robot team must infer this value goal through human-robot interaction. Such a setup truly mimics real-world human-machine collaboration tasks, as many AI systems need to operate autonomously in hazardous environments (such as in the case of a nuclear power plant leak) under the supervision of a human user.

To successfully complete the game, the robot needs to master both the "listening" and "speaking" abilities to achieve two-way value alignment. First, bots need to extract useful information from human feedback, infer the user's value function (a function that describes the goal) and adjust their strategies accordingly. Second, robots need to effectively explain what they "have done" and "plan to do" based on their current value inference, allowing users to know whether robots have the same value function as humans. At the same time, the user’s task is to direct the reconnaissance robot to its destination and maximize the team’s profits. Therefore, the user's evaluation of the robot is also a two-way process, that is, the user must instantly infer the value function of the reconnaissance robot and check whether it is consistent with the human value function. If not, select appropriate instructions to adjust their goals.Ultimately, if the system works well, the scout bot's value function should be consistent with that of human users, and users should have a high degree of trust in the bot system operating autonomously.

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

Figure 1. Overview of the human-computer value alignment process.

Figure 1 introduces the two-way value adjustment process in the game. During the game interaction process, there are three value targets, namely

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

: the user's true value;

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

The robot's estimate of the user's value (in the game, the reconnaissance robot does not have its own value, so they use the estimate of the human user's value as Take action based on);

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

User's estimate of the bot's value. Based on these three value goals, two value alignments are generated -

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

: the robot learns the user's value from the feedback given by the user;

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

: the user understands the robot's value from the explanations and interactions given by the robot. Eventually, the three value objectives will converge in

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

, and the human-machine team will form mutual trust and efficient collaboration.

The XAI system proposed in this article aims to jointly solve the following two problems:

1. How can a machine accurately estimate the intention of a human user during instant interaction and feedback?

2. How do machines explain themselves so that human users can understand the machine's behavior and provide useful feedback to help the machine make value adjustments?

In the system proposed in this article, the robot proposes task plan suggestions and asks human users to give feedback (accept or reject the suggestions), and infer the real human value intentions behind the task goals from the human feedback. In a collaborative game, if the user knows that the robot is actively learning its value goals, the user will tend to provide more useful feedback to promote value alignment.

In particular, each piece of information conveys two aspects of meaning, including (1) semantic information based on value goals and (2) pragmatic information based on the difference between different interpretation methods. Utilizing these two implications, the XAI system demonstrates value consistency in a multi-round, real-time manner, enabling efficient human-computer interaction in a teamwork task with a large problem search space. To align the robot's value goals with the user, the XAI system generates explanations, reveals the robot's current estimate of the robot's value to humans, and justifies proposed plans. At each step of the interaction, the bot provides customized explanations to avoid lengthy explanations, such as omitting repeated known information and emphasizing important updates. After receiving the bot's explanations and sending them feedback, users provide prompts to the bot indicating how satisfied they are with the latest suggestions and explanations. Using this feedback, the robot continuously updates the form and content of its explanations.

To evaluate the performance of this XAI system, we invited human users to conduct a series of experiments to examine whether human-machine bidirectional value coordination was successful. We used three types of explanations and randomly assigned users to one of three groups. Experimental results show that our proposed XAI system can effectively achieve instant two-way value alignment and be used for collaborative tasks; the robot can infer the value of the human user and adjust its value estimate to be understood by the user. Furthermore, diverse explanations are necessary to improve the decision-making performance of machines and their social intelligence. The goal of cooperative artificial intelligence is to reduce human cognitive load and assist in completing tasks. We believe that proactively inferring human value goals in real time and promoting human understanding of the system will pave the way for human-machine cooperation with general intelligence. Flat road.

3. Game settings

As shown in Figure 2, the cooperative game we designed includes a human commander and three reconnaissance robots. The goal of the game is to find a safe path from the base (located in the lower right corner of the map) to the destination (located in the upper left corner of the map) on an unknown map. The map is represented as a partially visible 20×20 grid, each of which may contain a different device, visible only after the scout robot approaches it.

In the game, the human commander and the reconnaissance robot have a structural interdependence. On the one hand, the human commander needs to rely on the reconnaissance robot to explore dangerous areas and eliminate explosives. On the other hand, the reconnaissance robot needs to rely on the feedback provided by the human commander. Better understand the goals of the current assignment.

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

Figure 2: User interface of the Recon Exploration game. From left to right, the Legend panel displays the legend from the game map. The value function panel displays the value function of this game. The scout robot does not know this function and the user cannot modify it. The center map displays information on the current map. The score panel displays the user's current score. The overall score is calculated as the sum of the scores for each objective weighted by the value function. The status panel displays the current status of the system. The proposals panel displays the scout robot's current mission plan proposals, and the user can accept/reject each proposal. The explanation panel displays the explanation provided by the scout bot.

We gave the recon robot an additional set of goals when finding a path, including 1) reaching the destination as quickly as possible, 2) investigating suspicious devices on the map, 3) exploring a larger area, and 4) gathering resources. Game performance is measured by how well the scout bot accomplishes these goals and their relative importance (weights), where the weights are the value function of the human user. For example, if the human commander is more focused on timeliness than on acquiring more resources, then the reconnaissance robot should ignore some resources along the way to ensure that it reaches the destination as quickly as possible. (Note that this value function is only revealed to the human user at the beginning of the game, not to the scout robot. Figure 3 summarizes the process of human-machine interaction.)

Without knowing the value orientation of the human commander, the robot scout team Human value judgments must be quickly inferred. In every step of the action, each member of the robot reconnaissance team must give a next action plan, which is chosen by the human commander. To assist the commander in making decisions, the scout robot team will explain the rationale for the course of action. Combining the commander's feedback, past interaction history and current map conditions, the recon robot team will adjust its judgment of the commander's current values ​​and take corresponding actions.

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

Figure 3: Design of the detective exploration game. Timeline (A) represents the events that occur during a game round, starting from when the robots receive environmental signals and ending with their next move. Timelines (B) and (C) describe the mental change processes of the robot and the user respectively.

4. Instant two-way value alignment model

To estimate the value function of human commanders during communication, we integrate two levels of theory of mind into our computational model. Level 1 theory of mind considers the cooperative hypothesis. That is, given a cooperative human commander, proposals from robots that are accepted by him are more likely to be consistent with the correct value function. Layer 2 theory of mind further incorporates the user's educational approach into the model, making feedback that makes the robot closer to the human commander's true value more likely to be selected by the human commander than other feedback. Modeling the pedagogical inclination of human commanders requires a higher level theory of mind. Combining these two levels of theory of mind, we write the human commander's decision function as a distribution parameterized by a value function and develop a new learning algorithm.

It is worth noting that a comparable but different approach to our human-machine collaboration framework is inverse reinforcement learning. The purpose of inverse reinforcement learning is to recover the underlying reward function based on pre-recorded demonstrations from experts in a passive learning environment. In contrast, in our environment, the reconnaissance robot is designed to learn interactively from the scarce supervision given by a human commander. More importantly, our design requires the robot to instantaneously and proactively infer the value of the human commander during the course of the mission.In addition, in order to complete cooperation, the reconnaissance robot must not only quickly understand the intentions of the human commander, but also clarify the basis for its own decision-making to ensure smooth communication with the human commander throughout the game. Overall, the bot’s task is to make value adjustments by inferring the human user’s mental model, proactively making recommendations, and evaluating human user feedback. These require machines to perform complex mental modeling of human users and have the ability to update the model on the fly.

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

Today, top international academic journals published the latest research results of Zhu Songchun's team - real-time bidirectional human-robot value alignment. - DayDayNews

5. Summary

The XAI system proposed in this article successfully proves the feasibility of the two-way human-machine value alignment framework. From the listener's perspective, bots in all three interpretation groups were able to quickly align with the user's values ​​by correctly ranking at least 60% of the goal's importance by 25% of the game progress. From the expresser's point of view, by providing appropriate explanations, the robot can explain its intentions to the user and help humans better perceive the value of the robot. When a "complete explanation" is provided to the machine, it only needs to be reached after the game progress reaches 50 %, the unification of human user value and robot value can be achieved, while when only a "brief explanation" is provided, the game progress needs to reach 75% to complete the unification of value.

We obtained convincing evidence from the above two perspectives to achieve the process of bidirectional value alignment, specifically:

1. By receiving feedback from humans, the robot gradually updates its value function to be consistent with the human value;

2 . By continuously interacting with the robot, human users gradually develop a perception of the system's capabilities and intentions. Although the value of the robot system is not unified with that of the human user in the first half of the game, the user's perception of the robot's value assessment ability can still be improved.

Eventually, when the value of the bot becomes stable, so does the user's evaluation of the bot. The convergent pairing from the robot's evaluation of the user's value to the real value of the user's value, and from the user's evaluation of the robot's value to the robot's current value, forms a two-way value alignment anchored by the user's real value.

Overall, we propose a bidirectional human-machine value alignment framework and verify its feasibility using an XAI system. Our proposed XAI system demonstrates that when theory of mind is integrated into the machine's learning module and appropriate explanations are provided to the user, humans and robots are able to achieve mental model alignment through instantaneous interaction. Our proposed computing framework provides a new answer to the core question of this paper: "What should ideal human-machine collaboration look like?" by promoting the formation of shared mental models between humans and machines.

In this game task, our work focuses on modeling the mind with values ​​and intentions at its core. Aligning these values ​​can greatly help humans and machines establish a common basis for task-oriented collaboration, making it capable of more complex tasks. Scenarios and tasks. Our work is therefore a first step toward a more general mental model alignment in human-machine collaboration. In future work, we plan to explore what factors can further enhance human user trust (e.g., allow counterfactual queries to the robot), verify the impact of "alignment" on task performance, and apply our system to environments involving more complex and value function tasks.

technology Category Latest News