Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni

2025/05/1711:04:41 technology 1125

Reprinted from AI Technology Comments

Author | Li Mei

Edit | Chen Caixian

Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni - DayDayNews

Information capacity = information quantity/data quantity

In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cognitive capabilities. Similar to humans using various sensory information such as vision, hearing, smell, and touch to perceive the world, machines also need to simulate human synesthesia to improve their cognitive level.

At the same time, with the explosion of multimodal spatiotemporal data and the improvement of computing power, researchers have proposed a large number of methods to deal with the growing demand for diversification. However, the current multimodal cognitive computing is still limited to the imitation of human apparent abilities and lacks theoretical basis at the cognitive level. Faced with more complex intelligent tasks, the intersection of cognitive science and computing science has become inevitable.

Recently, Professor Li Xuelong of from Northwestern Polytechnical University published an article "Multimodal Cognitive Computing" in the journal "Chinese Science: Information Science". Based on "Information Capacity", it established an information transfer model for cognitive processes, and proposed the view that "Multimodal Cognitive Computing can improve the information extraction capability of machines" , and theoretically unified various tasks of multimodal cognitive computing.

Li Xuelong believes that multimodal cognitive computing is one of the keys to implementing general artificial intelligence, and has broad application prospects in fields such as "vicinagearth Security". This article explores the unified cognitive model between humans and machines, which is inspiring to the research on promoting multimodal cognitive computing.

Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni - DayDayNews

​Quote format: Xuelong Li, “Multi-Modal Cognitive Computing,” SCIENTIA SINICA Informationis, DOI: 10.1360/SSI-2022-0226

Li Xuelong is a professor at Northwestern Polytechnical University. He focuses on the relationship between intelligent acquisition, processing and management of high-dimensional data, and plays a role in application systems such as "Vicinagearth Security". In 2011, he was selected as an IEEE Fellow and was the first mainland scholar to be elected as the executive member of the International Association for Artificial Intelligence (AAAI).

AI Science and Technology Review summarized the key points of the article "Multimodal Cognitive Computing" and had an in-depth conversation with Professor Li Xuelong in this direction.

1 Machine cognitive ability lies in the information utilization rate

Based on information theory, Li Xuelong proposed that multimodal cognitive computing can improve the machine's information extraction ability, and modeled this view from a theoretical perspective (as follows).


First of all, we need to understand how humans extract event information.

1948, the founder of information theory Shannon proposed the concept of " information entropy " to represent the uncertainty of random variables. The smaller the probability of an event, the greater the amount of information provided by its occurrence. That is to say, in a given cognitive task T, the amount of information brought about by the occurrence of event x is inversely proportional to the probability of event p(x):

Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni - DayDayNews

. The information is transmitted with various modes as carriers. Assuming that event space X is a tensor on the perceptual mode (m), space (s), and time (t), then the amount of information obtained by an individual from the event space can be defined as:

Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni - DayDayNews

Human beings have limited attention within a certain time and space range (assuming it is 1), so when the space event changed from a single mode to a multimodal state, humans did not need to constantly adjust their attention and focused on unknown event information to obtain the maximum amount of information:

Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni - DayDayNews

This shows that the more modes the space event contained at that time, the greater the information the individual obtained, and the higher the cognitive level.


So for machines, does the larger the amount of information they obtain, the closer the machine is to the human cognitive level?

The answer is not the case. In order to measure the cognitive ability of the machine, Li Xuelong, based on the "information" theory, expressed the process of extracting information from the event space as follows. Where D is the amount of data of event space x.

Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni - DayDayNews

Thus, the cognitive ability of a machine can be defined as the ability to obtain the maximum amount of information from unit data. In this way, the cognitive learning between humans and machines is unified into the process of to improve information utilization.

So, how to improve the utilization rate of machines for multimodal data and thus improve the multimodal cognitive computing capabilities?

Just as human cognitive improvement cannot be separated from the association, reasoning, summary and deduction of the real world , in order to improve the cognitive ability of machines, we also need to enter from the corresponding three aspects: association, generation, and collaboration. is also the three basic tasks of multimodal analysis today.

2 The three main lines of multimodal cognitive computing

The three tasks of multimodal association, cross-modal generation and multimodal collaboration have different focuses on processing multimodal data, but the core is that uses as little data as possible to maximize the amount of information.


Multimodal association


How do contents derived from different modals correlate and correspond to each other at the spatial, time and semantic levels? This is the goal of multimodal correlation task and the prerequisite for to improve information utilization.


The alignment of multimodal information at the spatial, temporal and semantic levels is the basis of cross-modal perception, and multimodal retrieval is the application of perception in real life. For example, relying on multimedia search technology, we can enter vocabulary phrases to retrieve video clips.

Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni - DayDayNews

Figure Note: Multimodal alignment diagram


Inspired by human transsensory perception mechanism, AI researchers have used computable models for cross-modal perception tasks such as lip reading and missing modal generation, and further assisted cross-modal perception of disabled groups. In the future, the main application scenarios of cross-modal perception will no longer be limited to perceptual alternative applications of people with disabilities, but will combine more with human transsensory perception to improve the level of human multisensory perception.


Nowadays, digital modal content is growing rapidly, and the application demand for cross-modal retrieval is becoming more and more abundant, which undoubtedly presents new opportunities and challenges for multimodal correlated learning.


Cross-modal generation


When we read a novel plot, the corresponding picture will naturally appear in our minds. This is the embodiment of human cross-modal reasoning and generation ability.


Similarly, in multimodal cognitive computing, the goal of cross-modal generation tasks is to give the machine the ability to generate unknown modal entities. From the perspective of information theory, the essence of this task has become a problem of improving machine cognitive capabilities in multimodal information channels. There are two ways: one is to increase the amount of information, that is, cross-modal synthesis, and is to reduce the amount of data, that is, cross-modal conversion.


Cross-modal synthesis task is to enrich existing information when generating new modal entities, thereby increasing the amount of information. Taking the image generation based on text as an example, the early days of the process of entity association was mainly adopted, and the degree of dependence on the retrieval library was often very high. Today, image generation technology is mainly based on generative adversarial networks and has been able to generate realistic and high-quality images. However, face image generation is still very challenging because from the information level, even slight expression changes may convey a very large amount of information.


At the same time, converting complex modals to simple modals and finding more concise expression forms can reduce the amount of data and improve the ability to obtain information.

Reprinted from AI Technology Review Author | Edited by Li Mei | Chen Caixian Xinrong = Information quantity/Data quantity In today's data-driven artificial intelligence research, the information provided by single modal data can no longer meet the needs of improving machine cogni - DayDayNews

Figure Note: Common cross-modal conversion tasks


As a model for the combination of two major technologies of computer vision and natural language processing, cross-modal conversion can greatly improve online retrieval efficiency. For example, give a brief natural language description for a lengthy video, or generate audio signal lights related to it for a video information.


Currently, the two mainstream generative models VAE (variant autoencoder) and GAN (generating adversarial network) have their own strengths and weaknesses. Li Xuelong believes that VAE depends on assumptions, while GAN has poor interpretability, so the two need to be reasonably combined.It is particularly important that the challenge of multimodal generation task lies not only in terms of generation quality, but also in the semantic and representation gap between different modes. How to conduct knowledge inference under the premise of semantic gap is a difficult point that needs to be solved in the future.


Multimodal Cooperation


Induction and deduction play an important role in the human cognitive mechanism. We can summarize and integrate and jointly interpret multimodal perceptions such as seeing, hearing, smelling, and touching, and use this as the basis for decision-making.


Similarly, multimodal cognitive computing also requires coordination of two or more modal data, cooperating with each other to complete more complex multimodal tasks, and improving accuracy and generalization capabilities. From the perspective of information theory, its essence is the mutual integration between multimodal information to achieve the purpose of information complementarity, which is the optimization of attention.


First of all, modal fusion is to solve the problem of differences in multimodal data caused by data format, space-time alignment, noise interference, etc. At present, the fusion methods of opportunity rules include serial fusion, parallel fusion and weighted fusion, while learning-based fusion methods include attention mechanism model, transfer learning and knowledge distillation.


Secondly, after the multimodal information fusion is completed, the modal information needs to be jointly learned to help the model mine the relationship between modal data and establish auxiliary or complementary connections between modals.


Through joint learning, on the one hand, it can improve modal performance, such as visual-guided audio, audio-guided vision, and deep-guided vision and other applications; on the other hand, it can solve tasks that are difficult to achieve in previous single-modal states. such as complex emotion computing, audio matching face modeling, audio visual-guided music generation, etc. are all the development directions of multimodal cognitive computing in the future.


3 Opportunities and Challenges

In recent years, deep learning technology has greatly promoted the development of multimodal cognitive computing in theory and engineering. However, nowadays, application demands are becoming more diversified, and data iteration speed is accelerating, which poses new challenges for multimodal cognitive computing and brings many opportunities.


We can look at four levels of improving machine cognitive capabilities:


At the data level, traditional multimodal research separates data collection and calculation into two independent processes, which has disadvantages. The human world is composed of continuous analog signals, while machines process discrete digital signals, and their conversion process will inevitably lead to information deformation and loss.


In this regard, Li Xuelong believes that intelligent optoelectronics represented by optical neural networks can bring solutions. If the sensory computing of multimodal data can be completed, the machine's information processing efficiency and intelligence level will be greatly improved.


At the information level, the key to cognitive computing is 's processing of advanced semantics in information, such as positional relationships in vision, image style, musical emotions, etc. At present, multimodal tasks are limited to interactions between simple goals and scenarios, and cannot understand deep logical semantics or subjective semantics. For example, a machine can generate an image of a flower blooming on the grass, but cannot understand the common sense that flowers and plants will wither in winter.


Therefore, building a communication bridge for complex logic and feeling semantic information under different modes and establishing a distinctive machine measurement system is a major trend in future multimodal cognitive computing.


At the level of the fusion mechanism, how to optimize the multimodal model composed of heterogeneous components is a current difficulty. Currently, multimodal cognitive computing mostly optimizes the model under a unified learning goal. This optimization strategy lacks targeted adjustments to the heterogeneous components of the model, resulting in the existing multimodal models having major under-optimization problems, and it is necessary to enter from multiple aspects such as multimodal machine learning and optimization theoretical methods.


At the task level, the cognitive learning method of the machine varies with the task. We need to design a learning strategy for task feedback to improve the solution capabilities of multiple related tasks.


In addition, in view of the current disadvantages of machine learning understanding the world's "bysitting" learning method from images, text and other data, we can learn from the research results of cognitive science. For example, Embodied AI is a potential solution: agents need to interact with the environment multimodal to continuously evolve to form the ability to solve complex tasks.

4 Dialogue with Li Xuelong

​AI Technology Comment: In the research of artificial intelligence, why should we pay attention to multimodal data and multimodal cognitive computing? What benefits and obstacles does the growth of multimodal data bring to the performance of the model?


Li Xuelong: Thank you for your question. The reason why we pay attention to and study multimodal data is, on the one hand, because artificial intelligence is essentially data-dependent, and the information that a single modal data can provide is always very limited, while multimodal data can provide multi-level and multi-view information under the same task; on the other hand, it is because the objective physical world is multimodal, and the research on many practical problems cannot be separated from multimodal data, such as searching pictures through text, listening to sound and identifying objects, etc.


We analyze multimodal problems from the perspective of cognitive computing, starting from the essence of artificial intelligence, and by building a multimodal analysis system that can simulate human cognitive patterns, we hope that machines can perceive the surrounding environment intelligently like humans.


The complex interlaced multimodal information will also bring a lot of noise and redundancy, increasing the pressure of model learning, making the performance of multimodal data in some cases worse than that of single mode, which poses greater challenges for the design and optimization of the model.


AI Technology Comment: From the perspective of information theory, what are the similarities between human cognitive learning and machine cognitive learning? What is the guiding significance of research on human cognitive mechanisms for multimodal cognitive computing? If there is a lack of understanding of human cognition, what difficulties will multimodal cognitive computing face?


Li Xuelong: Aristotle believes that people’s understanding of things starts with feelings, while Plato believes that what is derived through feelings cannot be called knowledge.


Humans have received a large amount of external information since birth and gradually established a self-cognitive system through perception, memory, reasoning, etc. The learning ability of machines is achieved through training a large amount of data, mainly to find the correspondence between perception and human knowledge. According to Plato, what machines learn is not knowledge. In our article, we quoted the theory of "Information Capacity" and tried to start from the ability to extract information to establish a cognitive connection between people and machines.


Humans transmit multimodal information to the brain through various perception channels such as sight, hearing, smell, taste, and touch, causing joint stimulation to the cerebral cortex. Psychological research has found that the combined effects of multiple sensory can produce cognitive learning modes such as "multi-sensory integration", "synaesthesia", "perceptual reorganization", and "perceptual memory". These human cognitive mechanisms have brought great inspiration to multimodal cognitive computing, such as multimodal collaboration, multimodal association, and cross-modal generation. At the same time, they have also given birth to typical machine analysis mechanisms such as local sharing, long-term and short-term memory, and attention mechanisms.


At present, the human cognitive mechanism is actually not clear. lacks guidance from human cognitive research, multimodal cognitive computing will fall into the trap of data fitting, and we cannot judge whether the model has learned the knowledge that people need. This is also a controversial point in artificial intelligence at present.


AI Technology Comment: What evidence supports the view that "multimodal cognitive computing can improve the machine's information extraction ability" from the perspective of information theory?


Li Xuelong: This question can be answered from two aspects. First, multimodal information can improve the performance of a single mode in different tasks.A lot of work has been verified that when adding sound information, the performance of computer vision algorithms will be significantly improved, such as target recognition, scene understanding, etc. We also made an environmental camera and found that by fusing multimodal information from sensors such as temperature and humidity, the imaging quality of the camera can be improved.


Second, the joint modeling of multimodal information provides the possibility for implementing more complex intelligent tasks. For example, we have done the work of "Listen to the Image", encoding visual information into sound, allowing blind people to "see" the scene in front of them, which also proves that multimodal cognitive computing helps machines extract more information.


AI Technology Comment: In the multimodal association task, what are the interconnections between alignment, perception and retrieval?


Li Xuelong: The relationship between these three is essentially relatively complex. In this article, I only gave some of my own preliminary views. The premise for information on different modality to generate association is that they jointly describe the same/similar objective existence, but this association relationship has the problem of being difficult to determine when external information is complicated or interfering. This requires first aligning the information of different modalities and determining the association correspondence relationship. Then, based on alignment, the perception from one modal to another modal is realized.


This is like when we only see a person's lips moving, we can feel like hearing what he says. The emergence of this phenomenon is also based on the correlation and alignment of viseme and phoneme. In real life, we have also further applied this cross-modal perception to applications such as search, which can retrieve the product's pictures or video content through text to realize computable multimodal association applications.


AI Technology Review: The recently very popular models such as DALL-E are an example of cross-modal generation tasks. They perform well in text-generating image tasks, but there are still great limitations in the semantic correlation, interpretability, etc. of their generated images. How do you think this problem should be solved? What is the difficulty?


Li Xuelong: Generating images from text is a "imagination" task. People see or hear a sentence, understand the semantic information, and then rely on their brain memory to imagine the most consistent scene, creating a "sense of picture". At present, DALL-E is still in the stage of using statistical learning to fit data, and summarize large-scale data sets, which is also what deep learning is best at.


However, if you really want to learn human "imagination", you also need to consider human cognitive patterns to achieve "high level" intelligence. This requires the cross-fusion of neuroscience , psychology, and information science . is a challenge and an opportunity. In recent years, many teams have also made top work in this regard. Through the cross-fusion of multidisciplinary disciplines, exploring the calculation theory of human cognitive models is also one of the directions of our team's efforts. I believe that it will also bring new breakthroughs to "high-level" intelligence.


AI Technology Review: In your research work, how did you draw inspiration from cognitive science? What research do you pay special attention to in cognitive science?


Li Xuelong: question, what is the clearness? For the source of fresh water. I often observe and think about some interesting phenomena from my daily life.


20 years ago, I browsed a web page with pictures of Jiangnan landscapes. When I clicked on the music on the web page, I suddenly felt immersive. At this time, I began to think about the relationship between hearing and vision from a cognitive perspective. In the process of learning cognitive science, I learned about the phenomenon of "Synaesthesia". Combined with my own scientific research direction, I completed an article entitled "Visual Music and Musical Vision". This is also the first time that "Synaesthesia" has been introduced into the information field.


Later, I opened the first cognitive computing course in the field of information and also created the IEEE SMC Cognitive Computing Technical Committee, trying to break the boundaries between cognitive science and computing science. At that time, I also defined cognitive computing, which is the description on the current technical committee homepage.

​​

​In 2002, I proposed the information-providing ability of unit data, that is, the concept of "Information Capacity". I tried to measure the cognitive ability of machines. I was also honored to win the Tencent Science Exploration Award in 2020 under the title of "Multimodal Cognitive Computing".


Until now, I have also continued to pay attention to the latest progress in synesthesia and perception. In nature, there are many modes outside the five senses of human beings, and there are even potential modes that are not yet clear. For example, quantum entanglement may indicate that the three-dimensional space we live in is just a projection of high-dimensional space . If this is true, then our detection methods are also limited. Perhaps these potential modalities can be exploited to allow machines to approach or even surpass human perception.


AI Technology Comment: On the issue of how to better combine human cognition with artificial intelligence, you proposed to build a modal interactive network with "meta-Modal" as the core. Can you introduce this view? What is its theoretical basis?


Li Xuelong: metamodal itself is a concept derived from the field of cognitive neuroscience. It refers to the brain having such a type of organization. When performing a certain function or representation operation, it does not make specific assumptions about the sensory categories of the input information, but it can still have good execution performance.
metamodal is not a concept of sudden whims. It is essentially a hypothesis and conjecture by cognitive scientists after integrating phenomena and mechanisms such as transmodal perception and neuronal plasticity. It also inspires us to construct efficient learning architectures and methods between different modalities to achieve more generalized modal representation capabilities.


AI Technology Comment: What are the main applications of multimodal cognitive computing in the real world? Give an example.


Li Xuelong: multimodal cognitive computing is a study that is very close to practical applications. Our team had a cross-modal perception work that encodes visual information into sound signals and stimulates the primary visual cortex of the cerebral cortex. It has been applied to help the disabled and help the blind to see external things. In daily life, we often use multimodal cognitive computing technologies. For example, short video platforms will integrate voice, image and text tags to recommend videos that may be of interest to users.


More widely, multimodal cognitive computing is also widely used in the local security mentioned in the article, such as intelligent search and rescue. Drones and ground robots collect various data such as sound, images, temperature, humidity, etc., and it is necessary to integrate and analyze these data from a cognitive perspective and implement different search and rescue strategies according to the on-site situation. There are many similar applications, such as intelligent inspection, cross-domain remote sensing, etc.


AI Technology Comment: You mentioned in the article that currently multimodal tasks are limited to the interaction between simple goals and scenarios, and once deeper logical semantics are involved, it will be difficult to take steps. So, is this an opportunity for the revival of symbolic artificial intelligence? What other feasible solutions are available to improve the machine's ability to process high-level semantic information?


Li Xuelong: Russell believes that most of the value of knowledge lies in its uncertainty. Learning knowledge requires warmth and ability to interact and feedback with the outside world. Most of the research we see at present are single-modal, passive, and given data-oriented research, which can meet the research needs of some simple goals and scenarios. However, for deeper logical semantics or subjective semantics, more modal-supported and actively interactive situations need to be fully explored and explored in space-time multi-dimensional contexts.


To achieve this goal, research methods and methods may be able to draw more from cognitive science. For example, some researchers have introduced the "embodied experience" hypothesis in cognitive science into the field of artificial intelligence, exploring new learning problems and tasks in the context of active interaction with the outside world and inputting multiple modal information, and have obtained some gratifying results. This also shows the role and positive significance of multimodal cognitive computing in connecting artificial intelligence and cognitive science.


AI Technology Review: Smart optoelectronics is also one of your research directions. You mentioned in the article that smart optoelectronics can bring exploratory solutions to the digitalization of information. What work can smart optoelectronics do in the perception and calculation of multimodal data?


Li Xuelong: Optical signals and electrical signals are the main ways for people to understand the world. Most of the information received by humans every day comes from vision. If you go deeper, visual information mainly comes from light. The five senses of human audiovisual, smell and taste touch also convert different sensations such as light, sound waves, pressure, odor, and stimulation into electrical signals for high-level cognition. Therefore, photoelectricity is the main source of information for humans to perceive the world. In recent years, with the help of various advanced optoelectronic devices, we have sensed more information beyond visible light and audible sound waves.


It can be said that optoelectronic devices are the forefront of human perception of the world. The intelligent optoelectronics research we are engaged in is committed to exploring the integration of photoelectric perception hardware and intelligent algorithms. introduces physical priors into the algorithm design process, uses algorithm results to guide hardware design, form mutual feedback between "sensing" and "computing", expand the perceptual boundaries, and achieve the purpose of imitating or even surpassing human multimodal perception.


AI Technology Comment: What research work are you doing in the direction of multimodal cognitive computing? What are your future research goals?


Li Xuelong: Thank you for the question. I currently focus mainly on multimodal cognitive computing in Vicinagearth Security. Security in the traditional sense usually refers to urban security. At present, human activity space has expanded to low altitude, ground and underwater. We need to establish a three-dimensional security defense system in the local space to perform a series of practical tasks such as cross-domain detection and autonomous unmanned systems.


A big problem facing local security is how to intelligently process a large amount of multimodal data generated by different sensors, such as allowing machines to understand the targets observed by drones and ground monitoring equipment from a human perspective. This involves multimodal cognitive computing, as well as the combination of multimodal cognitive computing and intelligent photoelectricity.


In the future, I will continue to study the application of multimodal cognitive computing in local security, hoping to open up the connection between data acquisition and processing, rationally utilize "forward excitation noise" (Pi-Noise) to establish a local security system supported by multimodal cognitive computing and intelligent photoelectricity.


Reference link:

https://www.sciengine.com/SSI/doi/10.1360/SSI-2022-0226;JSESSIONID=7c3d5b26-e0d8-42c1-8790-d3b5f379664e

technology Category Latest News