Source of this article: brainnews Computer vision is the "eye" of artificial intelligence and the core technology for perceiving the objective world. Since the beginning of the 21st century, the field of computer vision has flourished, with various theories and methods emerging i

2025/06/1106:03:58 hotcomm 1852

Source of this article: brainnews

Computer vision is the "eye" of artificial intelligence and the core technology for perceiving the objective world. Since the beginning of the 21st century, the field of computer vision has flourished, with various theories and methods emerging in large quantities, and has achieved remarkable results on multiple core issues. In order to further promote the development of computer vision, CCF-CV organized RACV 2019 and invited many senior experts in the field of computer vision to discuss the current development status and future trends of related topics. We will organize the discussion content according to the topic record, and restore the arguments in the original form as much as possible. We hope that it will help stimulate brainstorming, generate a series of inspiring views and ideas, and promote the sustainable development of the field of computer vision.

This issue’s special topic is “The development trend of computer vision in the next 5-10 years”. Experts conducted in-depth discussions from many aspects such as the development history of computer vision, existing research limitations, future research directions and visual research paradigms.

Theme organizers: Lin Zhouchen, Liu Risheng, Kan Meina

Discussion time: September 27, 2019

Speaker: Cha Hongbin , Chen Xilin, Lu Huchuan, Liu Yebin, Zhang Guofeng

Attendance guests [Speech order]: Xie Xiaohua, Lin Zhouchen, Lin Wu, Shan Shiguang , Hu Zhanyi, Ji Rongrong, Wang Yizhou, Wang Jingdong, Wang Tao, Yang Ruigang, Zheng Weishi, Jia Yunde, Lu Jiwen, Wang Liang

Text compilation: Kan Meina, Liu Risheng

Opening: Shan Shiguang, Lin Zhouchen

Shan Shiguang: Last time, at the Standing Committee of the Computer Vision Committee, under the initiative of Academician Tan, this RACV tried a relatively small-scale form that mainly focused on discussing the future and problems. This time, RACV hopes that all the spokespersons will not talk about their own or have done their work, but will talk about their own opinions and opinions around each topic. Everyone can speak without any scruples and argue. We will have records and recordings, but the final text will be published only after everyone confirms it.

Lin Zhouchen: RACV is to hope that everyone will have some in-depth discussions and challenge each other to achieve the purpose of in-depth discussion. The first theme is the development trend of CV in the next 5-10 years. I hope that our seminar, especially the topic of CV development trends, will be similar to the Dartmouth Conference and produce some new ideas.

Guest topic speech

1. Check Hongbin

The development trend of CV in the next five or ten years is difficult to predict. Sometimes thinking too much will easily lead to deviation. So, today I mainly talk about what we should do later from the perspective of my understanding.

First of all, let’s talk about what computer vision is? I have given a relatively strict definition here, that is, using computer technology to simulate, simulate and implement the visual functions of biological organisms. But this definition does not fully explain the matter. Here we combine the two concepts of computer and vision, but we do not say what computer and vision are. Everyone can recognize what a computer is. But what is vision? In fact, there is no definition that everyone recognizes in the field of computer vision.

Let’s first look at what research content is available in the field of computer vision now. Let’s first look at the keywords of each ICCV branch this year. The largest areas are deep learning; recognition; segmentation, grouping and shape, etc. Are these areas visual? It can also be said that image processing, analysis and understanding can also be explained. The key question is, are we really doing vision? It is necessary to think about this again. For example - Face recognition: Face recognition can now recognize a large number of face images and videos, which can be recognized by hundreds of thousands or millions of people. It uses big data-driven methods to achieve its purpose, and it is learned offline. However, the recognition algorithm is relatively poor in practical applications to light, occlusion, etc. Let’s look back at what functions does human facial recognition have? We have a strong function in identifying faces, but we can only recognize a small number of faces, such as relatives, friends, colleagues, etc. After exceeding a certain range, it is difficult for people to recognize faces of strangers. We can see that there are differences but we cannot tell who is who. Second, people conduct active sample learning in life situations. The reason we can know relatives is because we live with them in our daily lives and establish various relationships. We actively use samples to learn and use different levels of characteristics. So, although we recognize faces in small numbers, we have a strong ability to fight interference. So I think this is the difference between human face recognition and current machines. , face recognition in human vision has its obvious characteristics, and it can deal with visual processing tasks in real environments well.

So what factors should be considered in visual processing in real environments? We have intelligent machines such as computers and robots, and there are two other key parts of . The first part is to establish connections with the external world and interact with the environment through the visual interface; the second part is that when we talk about vision, the perception mechanism of organisms provides us with a lot of basis. Among them, what we have to deal with is the openness of the real environment and the complexity of the three-dimensional world. We have to face many dynamic changes in the scene and the diversity of hierarchical structures.

On the other hand, what is the perception mechanism of organisms? It is a learning process, but this learning is flexible and is not the way we learn offline and fixed as we are now. Our machine learning now is just testing. However, the testing and learning process in our human learning is not strictly divided. It has structural flexibility and requires hierarchical processing. In addition, it has the initiative and is able to actively learn according to its purpose and tasks. At the same time, what we need in our daily life is a process of time sequence data, an incremental processing process. From this perspective, our future computer vision research needs to consider integrating the characteristics of the real environment with the perception mechanism of organisms. This will be closer to the original meaning of the word "vision".

So what are the things we can consider? is the first and foremost a learning problem. Nowadays, deep learning is used a lot, but it is only part of our pattern recognition . There is still a lot of room for mining for visual research. In other words, when we consider machine learning in computer vision, we not only have depth, but also combine the width of the network, structural reconfigurability and structural flexibility. We must study and understand different structural levels, and take into account the connection relationship between different modules into the network. This is how our human brain is drawn up from the lower-level features of vision. It has many different functional structures inside, and this functional structure is plastic. Secondly, in addition to the commonly mentioned recognition function, we must implement some cognitive mechanisms such as memory and attention through learning. has already done some work in this regard. In the future, these mechanisms may be integrated into our current system as a core goal in computer vision. In addition, it should also be considered to select the required samples for independent learning through environmental interaction. Therefore, structural flexibility in this learning method should be one of our goals.

Another point is that our current computer vision still lacks the processing of dynamic scenarios. Many of our work is done in static scenes, such as facial recognition, and also in static scenes. Although sometimes we use videos, we do not take into account the dynamic characteristics of the entire scene in depth. Now, the tracking, detection, analysis, behavior identification and understanding of dynamic targets are being done, but it has not yet reached a systematic level.We should also focus more on things like positioning mobile sensors, reconstruction and understanding of three-dimensional dynamic scenes. So, I think dynamic vision is another important research direction in the future.

There is also an active vision. Active vision combines perception with movement and control to form a closed loop. There was a research topic in computer vision for a long time, called visual servo, which wanted to combine control and perception well. Part of our perception serves task purposes, and part of our perception itself, that is, we consider the implementation of perception functions from an active control perspective to improve the adaptive ability of the perception system. Transfer learning, uninterrupted learning or lifelong learning can all be applied. In addition, common sense, consciousness, motivation and their relationship should also be considered. means that we need to raise our vision to a conscious and controllable process.

If we combine the timing mentioned above with dynamic processing, etc., we should consider online learning more. should not rely entirely on the current offline learning and only use labeled data. Instead, we should predict and learn in a dynamic environment based on the characteristics of motion and dynamic data streams themselves. can combine the aforementioned memory and attention mechanisms, and ultimately realize an unsupervised online learning system. In this way, some characteristics and changes in the real environment can be taken into account and a new set of theories can be formed. Compared with the current deep learning, image processing analysis and understanding, this theory will be closer to the concept of vision we are talking about.

2. Chen Xilin

prediction shows that the future is a very risky thing. I can only talk about this topic composition. I prefer to look at this from a historical perspective. First, let’s review the development history of computer vision. I divided the process of the past few decades into the following stages. The first phase I call the Enlightenment phase, the iconic event of was the 1963 L. Robert's Machine Perception of Three-dimensional Solids and the hand-eye system that Minsky arranged for several undergraduates in the summer of 1966. At this stage, the estimate of computer vision was too optimistic, thinking that it was too easy and could be resolved soon, as S. Papert wrote in his report, “The summer vision project is an attempt to use our summer workers effectively in the construction of a significant part of a visual system.” An important revelation in the enlightenment phase is that it is much more difficult to find this problem than imagined.

entered the second stage from the early 1970s , which I call reconstructivism, which is represented by D. Marr's visual framework. This framework is well explained in Marr's summary book "Vision --A Computational Investigation into the Human Representation and Processing of Visual Information". Its core is to restore all objects to three-dimensional expression. The basic process is: image à basic element diagram (primal sketch) à three-dimensional expression centered on the observer (2.5D skectth) à 3D expression centered on the observation object. This process looks beautiful, but there are two problems - first, whether such a process is necessary, and second, if both attempts to restore three-dimensionality, whether it is realistic for perception measurement or calculation. I personally believe that the role of three-dimensional in computer vision is also limited. Work at this stage also led to reflection and debate on computer vision research in the early 1990s. If you are interested, please take a look at the discussion article in 1991 CVGIP: Image Understanding Volume 53, Issue 1.

The third stage I call it classificationism, anyway, as long as it can be identified, no matter whether the white cat is white and black cat catches the mouse. Face recognition, various multi-type object recognition, etc. are all popular at this stage. Researchers use various methods, from studying various invariant operators (such as SIFT, HOG, etc.) to classification methods (such as SVM, AdaBoost, etc.). This stage promotes the resolution of the identification problem, but it seems that the last mile is missing.

The latest stage I call the strength-to-scale stage, with its core being the revival of connectionism, thanks to the cheaper price of data and computing resources. This type of method seems to be well solved on various classification problems. But behind these methods is a lot of things that need to be pursued and thought about in research. In the past, we were all talking about finding a wonderful way. Just like we want to aim at the target, hit the target at the lowest cost. Now this kind of method is more like a cannon break, and today we seem to have entered such an era of cannon break.

So what will happen in the future? Judging from the previous development history, computer vision has entered the era of barbarians after decades of development. What does it mean to enter the era of barbarians? Today we are talking about the popularity of artificial intelligence, but almost all the examples used to verify artificial intelligence are related to computer vision. Today, many so-called computer vision research are about training a model with deep learning, so this is the era of barbarians. So what's the problem with the Barbarian Age? We look at a history related to the Barbarian Age - Rome Empire. The Roman Empire was destroyed by barbarians. Rome (more specifically refers to Western Roman ) lasted about 500 years from its founding to its destruction. Moreover, after the Western Rome was destroyed, there was another one called the Holy Roman Empire. According to the statement in Yuval Herali " A Brief History of Humanity ", the latter was neither sacred nor an empire. Back then, everything in the Roman Empire paid attention to beauty - the Colosseum, the water diversion channel, and the roads that were built wherever they were (to Rome). Early researchers also pursued beauty every day, mathematical beauty, physics beauty, etc., just like the Roman Empire back then. is now really the same as the Roman Empire, and we have met barbarians. Who is this barbarian? It is deep learning. Just like the Romans in the past, and the barbarians were concerned about wealth, we also face the problem of how to choose in the research of computer vision. Of course, history will also be surprisingly similar. The barbarians did nothing after occupying Rome. Later they established the Holy Roman Empire, which later led to the Renaissance. Today's research on computer vision also requires a Renaissance in our opinion. What is our Renaissance? Our current computer vision is in such a period of time that needs to be thought about. Instead of blindly turning to deep learning. Now some research is moving towards the stage of brute force, just like fighting wars and the number of tanks and cannons, relying on the scale and computing power of the GPU. Next, where do we need to go? This is what needs to be thought about in this era of barbarians.

prediction of the next five to ten years is a huge risk issue. So I can only talk about some possibilities for the future through the history mentioned above and my thoughts.

First of all, a future trend worth paying attention to is from recognition to understanding. To use the ancients' statement, it is from knowing what is happening to knowing why. Computer vision has made significant progress in recognition over the past decade, but the recognition now is far from what we expect. For example, if you teach it to identify a cup, it will not think that the cup has any relationship with water, nor will it think that the cup has any other functions, so it is completely cramming. Today's identification is far from explainable. Speaking of interpretability, I think interpretability in the field of computer vision should be an explanation of conclusions, rather than an explanation of network behavior, the former should be more valuable. So what should I do to explain all this? should rely on some form of logical relationship, which can be expressed through language, and language should play a bridge role. The language here is related to natural language and different from natural language. It can be independent of our natural language and is the language that the machine itself understands the world. In other words, we just recode the objects in the world and then establish the connection between objects, objects and environment. With such a relationship from basic attributes to objects and to the environment, it is possible to realize the reason from knowing what is happening.So I think the most important trend in the future is from recognition without knowledge support to understanding that requires knowledge support, or from simply Bottom-up recognition to broader computer vision with feedback and reasoning that needs knowledge inspiration. This is also the research direction I have paid special attention to in recent years.

Secondly, a trend worth paying attention to is the limited demand for a sense of space. : Regarding why animals need vision, it mainly has two needs - first, to ensure that they find food and not be eaten by natural enemies - to identify; second, to ensure that they will not cause accidental damage (fall or impact, etc.) due to wrong judgment of space. The most important thing about vision is to solve these two things. So why talk about the limited demand for sense of space? Our three-dimensional sense of space only needs to be very accurate when it is relatively close. When the distance is a little further, most of the time, you don’t care about the exact spatial location, but may care about some relationships such as occlusion and order. In addition, if you try to represent all objects in three dimensions, it will be difficult either in terms of calculation costs or in terms of realization. Just imagine that you can restore an object one meter away, which can be done very accurately. For objects one hundred meters or more, if you want to maintain the same quantization accuracy, quantification of the depth value will become a problem. This means limited demand, but I think this matter must be very important, especially in the near future.

The third trend worth paying attention to is the combination of different modal , which is the so-called unity of intelligence. A person’s intelligence cannot be separated from his clever ears and eyes. The modality here in is not limited to audiovisual, but can also include different two-dimensional and three-dimensional visual sensing information. Biological perception has never been solely based on a single modal. One problem that needs to be solved in multimodal is the alignment and causality between different modalities. If information obtained from multiple modalities exists at the same time, space-time alignment is a very important challenge. Another issue related to space-time alignment is causality. Although we want to obtain causality, most of the time we get only correlation. The third factor can be caused by the two phenomena, just like the discharge between clouds causes lightning and thunder. These two things are related, but it is by no means lightning to cause thunder. In most cases I prefer to explore associations rather than causality, especially in data-driven models, it is difficult to leave the mechanism to try to find causality. But the combination and association of different modalities in future computer vision research is an important trend.

The fourth trend that needs attention is active vision. The so-called initiative is to incorporate feedback mechanisms into the visual system, so that there is a possibility of choice. If vision exists only in an independent form, both the accuracy, resolution and processing capabilities required for perception need to increase exponentially. Due to the active selection mechanism, biovision has a good balance in field of view, resolution, three-dimensional perception and energy consumption. When the research of computer vision is not just to verify a single function, the above-mentioned balance of biological vision also needs to be considered in the computer vision system to realize a closed loop from perception, response to behavior. From passive perception to active perception, this is an important trend from algorithms to systems. The "seeing" of vision, "response" and "behavior" form a broad computer vision system, and through active "behavior" is explored, the unity of "soul" and "body" is achieved. This is crucial for vision application systems - for example, a pre-trained service robot can achieve overall intelligence improvement through active exploration in new environments. So I think this is an important trend in future visual application systems.

I did not talk about which specific algorithms are important. One thing I want to say is about deep learning. I think in the future, deep learning will become a basic component like the registers, triggers, memory and even CPUs seen in computers today. Regarding trends, continuing the previous division, computer vision will enter a knowledge-centric stage. With the widespread use of deep learning, computer vision systems will not only handle single tasks.Active vision will play an important role in the processing of complex visual tasks. Through active response and exploration, we construct and improve the correlation (causal) relationship of the visual system to observe the world and understand the space-time relationship, physical properties, etc. of spatial objects. This is my personal prediction of the issue I discussed today.

3. Lu Huchuan

Just now, the two teachers in the previous chapter have already made some points out, and I may have some of them that are similar to them.

From a theoretical perspective, I think the current theory of deep learning seems to be a little unmoved. Specifically, from the development of Backbone, there is basically no new content in the design of network structure. On the other hand, some fields are still relatively popular and are developing relatively fast. For example, the combination of natural language processing (NLP) and vision has made a lot of progress in recent years, especially the actual needs of chat robots and other related technologies, which have driven great progress in VQA and other technologies. In particular, the combination of graph-based methods and vision may become increasingly hot. Taking the knowledge graph as an example, if you know some prior knowledge and some knowledge graphs, you may better understand images or videos. For example, given an image, there is a cat and a fish tank inside. The cat hugged the fish tank with its paws and stared at the fish tank. If we know the relationship between the cat and the fish in the knowledge graph, we can well describe the cat who wants to eat the fish in the fish tank, thereby better helping to visually understand the relationship between the target and the target in the image or video. So, I think the combination of graph-based methods and vision will develop more in the next few years.

Second aspect, I think three-dimensional vision will continue to develop rapidly. has emerged since the first two years and has become more popular now. It is not only limited to three-dimensional scene reconstruction and other fields. Recently, some excellent work has emerged based on three-dimensional vision detection and segmentation. With the demand for various embedded devices and mobile phones, Huawei phones already have three cameras on the back, and even multiple cameras (its definition of three cameras, one is ultra-wide-angle, one is wide-angle, and the other is high-precision camera. Different resolutions can imitate people's visual methods more). Since the world of human observation itself is three-dimensional, this large number of applications on mobile terminals will lead to the further development of three-dimensional vision in this regard.

The third aspect, when we first mentioned deep learning, we usually said that the handcrafted feature has various disadvantages, while deep learning is an end-to-end network. In fact, the network structure of deep learning is also handcrafted. At present, after the rise of NAS network structure search, I think there may be more improvements in this regard, and some conventional operations, including some conventional modules, can be integrated into it to continuously optimize the network structure rather than handcrafted design. I think there will be more progress in this regard, even in the compression and cropping of network structures in the next few years.

The fourth aspect, after the rise of deep learning, we saw a large number of data sets born, and all of them were data marked with ground truth. Driven by it, the deep network has achieved relatively good performance. At present, most data sets are basically saturated in performance, but they are still far from the actual problem. On the other hand, people's understanding of the world is basically the result of small sample learning, which is different from the current big data-driven model. So can we combine the current big data-driven approach with the way people participate? There are many such papers now studying the learning methods of people actively participating in or human in the loop, which can combine people's active marking of ground truth, guide rapid learning, and even increase performance to a higher level.

Fifth aspect, video understanding has begun to develop in the past few years, especially in recent years, there are more demands and in-depth trends.: Since all the tasks based on images may not be able to do it after reaching a certain level, or there are no more tricks, I will understand more and more various videos, including video summary, video scene classification, advertising recognition, platform logo recognition, etc. Many applications in this area, I think there will be greater development in the next few years.

I think there will be more development areas in the future in terms of topics. As Teacher Chen mentioned that the era of barbarians has arrived, people are very enthusiastic about participating in visual research. Not only in the academic community, but also in the industry, there is also a huge demand for this kind of thing. Therefore, I think that the field of deep learning will develop in depth in various industries. For example, in the past two days, a company has put forward a demand that after stepping on the shoe print, it hopes to identify which criminal suspect was stepping on it. This is footprint identification. Furthermore, they want to use this footprint to determine what the upper of this shoe looks like and what brand it is. Then, through these clues, go to Curry to search and compare. After searching, go to the video to find the suspect, that is, who is the person wearing such shoes. In this process, a series of visual problems are formed step by step from the source to the back, and the industry's demand for in-depth development is infinitely huge. There are many things that I had never expected before in the vision that are constantly improving. Two days ago, I attended the Industrial Robot Exhibition and saw a robot picking up a parcel. We all know that the courier is going to deliver a lot of packages, all kinds of packages. Can let the robot classify it after the package truck brings a truck of packages? I saw at the exhibition that there is such a robot that will automatically identify what kind of package it is and know what its three-dimensional curve is like. Because the angles of the package are completely different, it will adjust the robotic arm to adapt to the normal direction of the three-dimensional curved surface of the package to absorb it. I feel that under the actual needs of different industries, vision technologies such as segmentation and three-dimensional modeling will quickly develop in depth in various industries.

In addition, I think there will be great progress in medical images. medical images are now more about the detection of various diseases. Yesterday I communicated with a medical unit and they provided a large platform. Its ultimate goal is to comprehensively judge what kind of disease the patient is based on information of different modalities of the patient. Not only does it focus on medical imaging information, but also some other examination results, which are actually a cross-modal fusion, including image annotation, medical case annotation, etc. They all make the future combination of medical images and vision more and more closely.

Currently, 5G is not only fast and has large capacity, but it actually brings a broader prospect to computer vision AI, especially in terms of unmanned vehicles. Several people just mentioned three-dimensional maps, etc. After communicating with and China Mobile , they found that their high-precision map can be transmitted in real time through 5G bandwidth, and can see the centimeter-level fineness of Malu Cliff. So I think 5G+AI will bring huge opportunities to the development of our vision-related fields. The above are some of my understanding of the visual development trends in the next 5-10 years.

4. Liu Yebin

I mainly talk about some ideas around the development of three-dimensional vision, virtual reality and artificial intelligence. Virtual reality has been developing relatively smoothly since it became popular in 2016. The trend of three-dimensional vision is to reconstruct visual information and provide three-dimensional content to virtual reality. This is three-dimensional reconstruction. Three-dimensional virtual reality can generate a lot of data through real rendering to serve visual problems. Many visual problems are data-driven. How to get data? More and more parts are obtained through three-dimensional engines. There are several types of computer vision research objects, outdoors, indoors, human faces, hands, as well as medical and life objects. is the core of computer vision, so I mainly use people as the object of visual research to illustrate the development trend of computer vision.

From the perspective of artificial research objects, virtual reality has three goals, namely three I, one Immersion, one Interaction, and one Imagination. all have the relationship between virtual human (AI, machines, etc.) and real people. First of all, virtual people are real in visual appearance. In the future, virtual people, whether they are real robots or stored in computers, have a trend of approaching real people, making interactions more friendly. And this goal is essentially the three-dimensional reconstruction of the human body. The second element is the interaction between humans and machines. Virtual people must be able to perceive the behavior of real people, including understandings such as gesture recognition, behavior recognition, emotions, etc. Finally, the virtual person needs to respond to the scene and be intelligent. He can intelligently handle the next step according to your behavior to ensure that a real virtual person is generated.

Overall, the intelligent modeling technology of virtual reality has been listed as the eight key common technologies in the new generation of artificial intelligence development plan, focusing on breaking through the behavior modeling technology of virtual object intelligence, improving the sociality, diversity, and interactive reality of the behavior of intelligent object in virtual reality, and realizing the organic combination and efficient interaction between virtual reality and augmented reality technologies and artificial intelligence. The focus of in the above definition is behavior modeling . The behavior must be close to human intelligence to have interactive reality, etc. Focusing on the modeling of this human body, the current goal is to accurately reconstruct, the second is to collect large-scale, the third is to portable (one-piece image can be done on mobile phones), the fourth is to be fast enough to respond to interaction requirements, and the fifth is a major development trend now. The modeling results contain semantic information, that is, semantic modeling, including clothing, faces, hair, etc. The last sixth is intelligent generation, that is, the reconstruction results can be displayed in real animation. 's existing three-dimensional visual reconstruction technology is difficult to meet these six requirements, so there is still a lot of research to do around these goals.

One of the main purposes of human body reconstruction is holographic communication. Here is a display of the holoportation system made by Microsoft , which realizes real-time, dynamic three-dimensional reconstruction of the human body under multiple cameras. But the disadvantage of this system is that it requires active light, which leads to high system complexity, real-time and convenience. Realizing real-time, high-precision, three-dimensional dynamic reconstruction is also a future academic research trend. The real-time reconstruction of a single depth camera we developed has a perfect speed and convenience, but the accuracy needs to be improved. Three-dimensional reconstruction of single-image human body, although the quality of the current image is not perfect, I think this is a very practical technical application trend. Through a single image, we can easily reconstruct its three-dimensional model, and it will definitely shine in the future. Single image human manual dynamic three-dimensional reconstruction can be achieved through a single RGB monitoring camera. It can be seen that the three-dimensional reconstruction outputs semantic information, which has replaced the traditional two-dimensional computer vision recognition problem as a development trend.

clothing industry accounts for 6% of the GDP, and digital clothing is a very important place for computer vision applications. is a showcase of some of the latest things we have done. Through a single video, online videos can achieve relatively high-quality three-dimensional modeling of clothing through semantic modeling. It can be applied to some VR and AR. It is achieved by decoupling the human body and clothing, adding semantic information, including decoupling of lighting and texture. This kind of thing can produce some applications in the future, including changing body shapes, including augmented reality simulations, and on the right is a reconstruction of an Internet video, which can change the color of the clothing, etc. I think the trend of portable real-time three-dimensional reconstruction is to gradually move from low-level three-dimensional modeling, including voxel and grid, to high-level three-dimensional modeling, including component-level reconstruction, physical information separation, perceived physical dynamics, and extraction of feature space. These high-dimensional information can be intelligently modeled and generated, responded to the environment, controlled and predicted.Including some research done in graphics, such as the physical constraints of virtual objects that can allow a person to move, and the augmented reality technology that we can climb ourselves will also be introduced, introducing physical and intelligent responses.

Finally, let’s talk about some dynamic three-dimensional reconstruction issues with more broad significance. For example, medical aspects such as the three-dimensional perception of surgical field scenes is a three-dimensional modeling problem of non-rigid complex dynamic scenes. This is a video showing liver surgery, capable of dynamically tracking its shape, and a 3D scanned CT can be mapped in real time in dynamic scenarios, assisting with medical care and surgery. There is also three-dimensional reconstruction of animal behavior in the field of life sciences. I think animals are a very big application point in future vision. We call it computational behavior, also called neurobehavior. It studies the mapping relationship between behavior and neural activity, and analyzes it by collecting animal behavior data. It is very difficult to analyze people in behavioral science because people have very different genes.

But for animals, it can be done that every mouse has the same genes. For example, in pigs and monkeys, it is easier to control some other different factors, so it will be helpful for medical care, including gene control. There are some related articles in Nature sub-issues, Nature methods, and Neural Science. There are actually many problems in it, including interaction of group objects in the natural environment, non-rigid capture, high-level semantic detection, mutual occlusion three-dimensional recovery, time series analysis, and many studies have been published in Nature. The research trend of three-dimensional reconstruction of animal behavior is to hope that animals can live more freely in the experimental environment, be recorded, and discover behavioral differences in advance after drug intervention. There are still many studies like , including the ability to extract features with higher dimensions. We are also doing some research. There are four piglets here, two of which have ALS. We use multi-view shooting, hoping to reconstruct the movements of three-dimensional piglets and identify the behavioral characteristics of ALS piglets through reconstruction movements, which will help future gene regulation and drug treatment.

5. Zhang Guofeng

Several teachers have made a prediction of the development trends in the next 5-10 years from the perspective of computer vision. I would like to express my own views on the development trends in the next 5-10 years from the aspects of three-dimensional vision and AR that I am familiar with.

My research direction is mainly SLAM, so I will first make some development trend prospects from the perspective of SLAM. We all know that visual SLAM is very dependent on features. The future development trend of SLAM technology in will inevitably develop from the previous underlying features such as points, lines, and surfaces to the trend of high-level features such as semantics, texts, objects, etc. , and now there are some tasks that extract motion patterns, such as human gait patterns, robots and unmanned vehicles, to further improve the stability of positioning.

has a trend towards multi-sensor fusion. In fact, each sensor has its advantages and disadvantages. The best way is to integrate the information of these sensors into . For example, with the popularity of depth cameras, some mobile phones have depth cameras, as well as Wifi, Bluetooth, geomagnetic signals, etc. Fusion of these signals can definitely improve the stability of positioning. There will be more types of sensors in the future, such as new event cameras and polarization cameras that have emerged in recent years. I believe that some new sensors will emerge in the next 5-10 years. Through multi-sensor fusion, I believe that SLAM technology will be more and more accurate and robust.

Another trend is that with the advent of the 5G era, SLAM will develop towards the trend of combining cloud and end. For example, the construction of high-precision maps is now placed on the cloud and supports dynamic updates. This naturally involves how to tightly couple SLAM on mobile terminals and high-precision maps on clouds, how to use semantic map information to better position, and how to collaborate on different terminals to do SLAM.

is now the era of deep learning. For SLAM, there are many deep learning-based work. I believe that more work in this area will emerge in the future, such as how to learn a better feature, how to learn better strategies to solve the dilemma of handwriting rules in SLAM, and there may be end-to-end position learning that has done well. Another very important thing is the fusion of semantic information. For example, how to better integrate structural information with semantic information, just like the human eye to see the world. I think this is a future development trend.

or above is about SLAM. Then, Teacher Liu has discussed a lot about three-dimensional reconstruction before, especially the reconstruction of dynamic scenes. I will make a little more addition here. I think in the future, some portable and mobile RGBD sensors will become more and more popular in . For example, depth sensors based on structured light and ToF, I believe that some new sensors will appear in the future, which can help achieve real-time and efficient three-dimensional reconstruction. What is reconstructed here is not only geometry and texture, but also materials, semantics, etc. The three-dimensional reconstruction technology based on photos/video will also make some progress in the next few years, such as achieving higher geometric accuracy and texture, obtaining finer-grained semantics, and combining the computing power of distributed platforms to achieve more efficient reconstruction.

In terms of three-dimensional scanning of large-scale scenes, currently, videos or photos taken by cameras can be used to achieve three-dimensional reconstruction of urban scenes. is usually aerialized by drone shooting and then rebuilt. If further combined with depth sensors (such as Lidar), it is believed that higher-precision scenario construction can be achieved. Combined with the computing power of the distributed platform, it will not be a problem to realize the reconstruction of a complete three-dimensional map of the entire city and even the entire earth. Of course, it is not too difficult to rebuild a static scene. What is even more difficult is how to rebuild a dynamic object and dynamic update of the scene, because the real world is not static, but dynamically changing. I think in the future, a relatively low-cost way such as multi-sensor fusion may be used to achieve dynamic updates of four-dimensional scene maps. Including the object model obtained through three-dimensional scanning mentioned above, it can be registered in a three-dimensional map of the real world to realize the sharing and transmission of three-dimensional information.

Then, I want to talk about the relationship between recognition and reconstruction . Identification and reconstruction will lead to a deeper integration in the next 5 to 10 years. At present, three-dimensional reconstruction is basically a bottom-up method. If the prior knowledge is not fully utilized, the top-down method may be born in the next 5-10 years, such as identifying first and then reconstructing, or the two are carried out simultaneously. recognition can provide a higher level of structural prior, and in turn, reconstruction can help better object recognition, so it will be more closely integrated in the future. In addition, it also requires the integration of deep learning and geometric optimization algorithms to finally build a 3D scene representation that combines geometric appearance, semantic information, structured, dynamically updated.

In addition, because I have been doing AR applications, I would also like to talk about the trend of coordinated development of AR/VR, AI and three-dimensional vision. In fact, AR is mainly an application of AI and three-dimensional vision. If these three can develop closely together, I believe that the digitalization of a real world at the earth level can be achieved in the next five to ten years. The picture on the left is the Cyberverse digital reality technology proposed by Huawei not long ago. It mainly scans the real world through sensors such as cameras and Lidar and builds high-precision maps, and then uses high-precision maps to achieve accurate indoor and outdoor positioning and navigation as well as various AR effects. Cyberverse is not actually a completely new concept. Magic Leap proposed a similar concept in 2018, aiming to continuously integrate the large-scale physical and digital worlds. As shown in the picture on the right, Magicverse includes several layers, and is mainly two types, one is called the basic layer (including the physical world and the digital world), and the other is called the spatial application layer.At the bottom of the basic layer of is the physical world, and then a corresponding digital world is constructed in the physical world, and then there is the spatial application layer above it, including liquidity, energy and water, health and health care, communications, entertainment, etc.

To realize such a digital real world, the most critical point is to digitize the physical world three-dimensionally, that is, how to collect, build and update high-precision maps. I believe that the future will inevitably develop towards multimodal, multi-sensor acquisition and fusion, because each sensor has its advantages and disadvantages and needs to be integrated and complementary. The most difficult question here may be how to update dynamically. I believe crowdsourcing collection and updates are an effective way to achieve this goal, and can achieve low-cost and high-frequency updates. In addition to three-dimensional, high-precision maps should also include semantic information, so semantic information extraction is also very important, and it needs to meet semantic information of different applications, such as positioning, AR/VR display, behavioral analysis, etc. This requires the extraction of semantic information of different granularities. The granularity here can be as large as the entire shopping mall, to a store, and a little smaller than that, it is a product. In addition to the three-dimensional digitization of the physical world, also needs to digitize human behavior, such as motor behavior, consumption behavior, social behavior, etc.

For people built in this way, whether it is the behavior or three-dimensional space, combined with SLAM and AR technologies, we can realize earth-level AR applications. Of course, here we first need to solve the problem of how to tightly couple the high-precision maps in the cloud with terminal SLAM, so that we can achieve long-term and large-scale precise positioning and high-quality integration of virtual and real. The loosely coupled mode will have some defects, the error accumulation will be very fast, and the stability will not be good enough. Based on this method, we can achieve decimeter-level and even centimeter-level positioning and navigation indoors and outdoors.

In addition, we know that the 5G era will soon arrive . 's current AR computing is mainly in terminal , such as mobile phones, AR glasses, etc. In the future, when there is 5G, many computing can be placed on the cloud or on the edge, and the computing requirements for terminals are relatively weakened. In the future, terminals will provide more data acquisition, connection and display capabilities. With the support of cloud computing power, high-quality AR effects can be realized, such as highly realistic physical effect simulation, accurate occlusion effects and virtual and real interaction, accurate lighting estimation and movie-level real-life drawing and virtual and real fusion effects are possible. In the 5G era, on the one hand, the transmission speed is very fast, and on the other hand, with the support of cloud computing power, the application application in the future should not even be pre-installed. Opening an APP is as convenient as entering a URL in a browser or switching channels on a TV.

or above is my views on the future development trends in three-dimensional vision and AR for your reference.

Expert discussion speech

Xie Xiaohua

I feel that we are ignoring one thing, that is, hardware development. For example, we used to do super resolution and did a lot of work, but later, as soon as the HD camera came out, a lot of work was in vain. Will there be a major breakthrough in the visual sensor field in the next ten years? Then there is no need to do some of the work mentioned just now.

Lin Zhouchen

I want to say what kind of computing system is suitable for computer vision? Now we are all based on the von Neumann system, but the visual processing process of human beings is very different from that of the von Neumann system. If it is on a new computing platform, I think it is possible to discuss whether many computer vision problems can be solved better or more efficiently in . Another one, I favor active vision and online learning. I think the current visual system has touched one thing: everyone is from scratch, so if you have limited energy, you can only do a very simple task. I think in the future, we can do a project like wiki, and the whole world can contribute. In this way, everyone is building a unified system together, and this system can use all the data on the network and evolve itself.Then everyone can use this system, which can solve the problem of everyone's system constantly learning from scratch, because a single person can only do a small part of it.

Lin

I want to talk about the benchmark about benchmark or AI evaluation system or CV evaluation system. Because I think a lot of our research is driven by this benchmark, or is driven by this benchmark. The trend of CV is now fusion, synergy, etc. In the future, we may need a new evaluation system to see the situation of CV. It may not be necessary to achieve particularly high accuracy on a special identification problem or segmentation problem. However, we also access understanding, analysis, interpretability, etc. so that we can evaluate the robustness of an AI or CV system, and be more like a human rather than classifying it as a classification problem or reconstruction problem. I think this may be a problem that we need to discuss and discover in a very specific way.

Shanshiguang

We discuss what level of vision can develop in ten years, but we have not clearly defined how we should measure the progress of visual intelligence in general. For example, the current level of visual intelligence is 60 points, and we can achieve 80 points in ten years. There is no clear standard. What is visual understanding and image understanding? How to define it? For example, when we do facial recognition, it is very clear that the recognition rate on a certain database shall prevail. But as a general vision, we don’t seem to have such a standard.

In addition, from a standard benchmark, there are two types of human vision, one is general vision and the other is dedicated vision. For example, we ordinary people cannot understand medical images but professional doctors can do it, but we all have general visual abilities. Are the paths of these two types of visual implementations the same or different?

Another thing is that we may have digitized the earth in ten years as mentioned earlier, but this number may not be a simple digitization, such as map-based. So how will map-based help us to make visuals? I think it is similar to the appearance of a "fire range" that shows up in a visual intelligence test, many of our things can be tested in this "fire range". For example, many systems for autonomous driving use synthetic simulation data for preliminary training. Then maybe when we have a good digital simulation of the earth, we have a good visual "fire range", which can be done both training and testing.

In addition, should you do visual common sense? Everyone is talking about knowledge. I think if the knowledge system does not have common sense, it will feel like it is a castle in the air. We must first have visual common sense. Only with common sense can we have the so-called understanding. I don’t know if it is correct, and I think this issue can be discussed.

Chen Xilin

Regarding the evaluation of understanding, we can think about how people do it. We do have benchmarks and test questions about the knowledge of human formation. But has no test questions about the knowledge in human exploration. The knowledge that everyone understands finally forms a public recognition intersection, and finally gradually expands. Therefore, I personally believe that in the future research that promotes understanding, benchmark cannot be without, but benchmark cannot be the only one. If Benchmark has promoted the development of computer vision in the past 30 years, today may become a factor that has bound the development of computer vision. I often argue with students about this, and some students think that the work of leaving the evaluable dataset is not research. And the study of true intelligence may be without Benchmark - without the smartest, only smarter. For tasks like scenario understanding, one machine may discover 100 sets of relationships, and the other machine may discover 300 sets of relationships, and the latter's understanding ability may surpass the former. If the relationship of the former is a real subset of the latter, the latter must have stronger understanding ability. Of course, it is more that the two may be complementary, just like a human being must have a teacher.

The second thing is general vision and dedicated vision. My point of view is that the so-called dedicated vision for medical interpretation actually goes far beyond vision itself. It is not just vision. The doctor's judgment is knowledge/logical reasoning based on visual phenomena. I don’t agree with the point of view of

Hu Zhanyi

. I have been studying biological vision for more than ten years. Vision is by no means perception, vision includes cognition. With the specific problem of visual object recognition alone, about one-third of the human cerebral cortex is involved. Of course, a certain area of the cerebral cortex participates in visual problems, and it must not be said that the cortex is the visual cortex. Most advanced cortexes in the brain are processed by the information fused by multiple sensory information to make cognitive decisions and behavioral planning. So visual issues involve the combination of the real brain, including the cortex and subcutaneous tissue, and are by no means completed entirely by the visual cortex of the brain. The visual cortex refers to the cortex that mainly processes visual information. Many cortexes participate in the processing of visual information, but are not the visual cortex.

Let me talk about the first point first. Human vision and computer vision are different. If we say that the brain processing mechanism of human vision is completely explained clearly, I think it is no different from the difficulty of fifteen or sixteen years of understanding the origin of the universe. I have studied biological vision for about fifteen or sixteen years. As far as I know, the field of neuroscience has been relatively clear on the V1 area of vision in the field of neuroscience, and the V2 area is not very clear, let alone the V4 and IT areas behind it, as well as advanced cortex such as prefrontal lobe (PFC). Visual problem handling basically involves various areas of the cerebral cortex. So when studying computer vision, I think we need to figure out what computer vision is and what is the core scientific issue of computer vision. We cannot add everything up. I think we need to discuss and discuss carefully. In five to ten years, should we mainly study visual perception or visual cognition? If you study visual cognition, it will take ten thousand years. I don’t study computer vision much at the moment. I mainly focus on biological vision. Maybe what I said is wrong, but I think everyone will focus more and achieve more goals.

We discuss the direction of computer vision research for five to ten years, not specific algorithms. We didn’t know that deep learning could reach the height of today ten years ago. We want to discuss which directions are worth studying. I personally think there are three directions that need to be paid attention to: 1. Computer vision based on neurophysiology is estimated to be a major direction within five to ten years; 2. Video understanding; 3. Visual research related to global strategies involving Chinese characteristics: such as satellite data understanding (global strategy), deep-sea underwater visual information processing (deep-sea strategy).

Ji Rongrong

I myself think it took about 10 years since I graduated from my PhD. I think computer vision far exceeds any direction I was studying at that time, such as natural language understanding, information retrieval, etc. I think an important reason is the gains brought by deep learning. But on the other hand, our systems are too big and too heavy . Is it possible to make this system smaller and make the overhead smaller? There are several dimensions here. The dimension that everyone can think of immediately is to make the system small, and by making it small, you can put it on the end and on the embedded device. The second thing about is to make the system faster than . Nowadays, the calculation of autonomous driving or devices on the terminal may require the system to process data much faster than real-time.

The third point , Nowadays, we are often doing single-point systems. The functions performed by each camera of are a complete closed loop, and they spend a lot of calculation costs and do a lot of repetitive things. Is it possible for future vision systems to coordinate between large-scale systems from point to surface? That is to say, is it possible to go from specialized to broad? Why is it from specialized to broad? Now each model can only solve one task. In order to solve the problem of target recognition, the model of target recognition is used, and in order to solve the problem of semantic segmentation, the model of face is used, I think our human brain does not distinguish so clearly.I personally feel that one by one or 1 v 1 is too resource-consuming. Is there a more flexible mechanism that the network structure can be combined in different forms, such as a backbone of a set of models. If it goes up, it can be used to identify, segment, retrieve, and understand it. In this way, the entire calculation can be reduced. I believe that at this stage of human evolution, we have used our brains to achieve multi-tasking, efficient, parallel and only occupy a very small storage overhead. We only eat three bowls of rice a day, and we can complete the calculation amount of this computer system consumes.

The fourth point is that I think our system is now "eating" data and "eating" too well. I think we humans really don’t use so much data to learn. We use a lot of data reuse. For example, if I identify a fire truck, I just need to add some special components to the car and I can identify the fire truck. We are very smart. However, our current computer systems are too busy with these hard resources, so I think we should also explore more mechanisms in terms of computing resources and training data consumption.

Then, from my own feelings, we have seen the world move forward in the past. I especially hope that in the next five to ten years, the development of computer vision will be led by our Chinese scholars, because we now have a huge market, and the market has technical pain points that we can see immediately. This pain point is around us. We should do it, rather than letting foreigners do our pain point . We should lead it. So I think there are many things that we Chinese computer vision scholars should do in the next five to ten years.

Lin Zhouchen

Higher and smaller on mobile phones, I don’t think this is the right direction. In the future, the visual system should be bigger and bigger, not smaller and smaller. All operations are placed on the cloud through 5G. This is a future trend. We all build a large system on the cloud, which can solve the problem of diversity. Because it is necessary to use a small system to solve various problems, I think this probability is impossible. It is to build a system as complex as the human brain, so that it can solve various problems. This system must be only placed on the cloud. If you want to calculate more on the mobile phone, you will have more. If you want to calculate less, you will have less. Don’t squeeze all the calculations on a small mobile phone.

Ji Rongrong

I don’t think it is completely correct. I think some lightweight calculations can be done on the end and heavier calculations on the cloud. Moreover, the calculation on the terminal can make the transmission of data from heavyweight to lightweight. For example, when you used to pass images, you can now only transmit features. You originally wanted to transmit all areas, but now you only need to transmit specific areas. Because it feels too wasteful to use a mobile phone as a camera device, a mobile phone is actually a good computing device.

Lin Zhouchen

We are not contradictory. I mean trying to solve all the problems on my phone I am against this. At the beginning, you said you wanted to make a small network on your mobile phone, and the smaller the function, the worse the function.

Hu Zhanyi

For this question, I would like to make a suggestion. How much impact 5G has on our computer vision is actually a problem of small terminals and large terminals. If the 5G network is fast, the terminal can simply be very small and does not need to be processed here and put directly on the cloud. I think the impact of 5G on computer vision must be well understood.

Wang Yizhou

What you two said is no contradiction, and you must compress it in professional tasks. Processing is related to the task, as long as the task needs are met. vision is an ill-defined problem. The concept of what is vision is too big, but if it is limited to images, it is too small. So how to grasp it? If we lose this position, we will lose it on the complexity of the problem. Now it has been occupied by deep learning, and no matter how beautiful our theory or performance is, we cannot do it.Then where did we lose it? Is the visual problem solved by deep learning? Vision is not just a learning problem. I just said that vision can be very large, it can be a cognitive problem, top-down, bottom-up, and various tasks. The visual problems we define are not complex enough, and the system is not complex enough. So we need to add the complexity of the system and the complexity of the task, but for each specific task, we need to try to make it compact and fit for task. So how to get back this position, I think we need to increase the complexity in these two aspects, so that we can get this visual thing back. But vision is actually not just a visual problem, it should be a problem in completing tasks dominated by vision. Therefore, it is not certain whether CVPR exists after , or whether it is still the one that is rushing to.

Shan Shiguang

The question we are worth discussing is, how to distinguish the relationship between computer vision and machine learning? Will we admit defeat in the next few years? Is it a problem with computer vision or machine learning? I think we young people are still particularly confused. For example, if there is any problem that machine learning can’t handle, it can only be done by computer vision theory and methods?

Chen Xilin

is now many things are classified as machine learning. You can compare the books from machine learning 30 years ago with those from pattern recognition 30 years ago, and then take today's books from machine learning and pattern recognition to see the difference.

Hu Zhanyi

I think machine learning is a means, it can be used in computer vision and natural language processing. There is no difference between pattern recognition and mathematics. I am more ideal. I think one is an explanation method, and the other is to solve what scientific problems it is necessary.

Wang Jingdong

Then Teacher Shan mentioned this question just now. Computer vision is so popular nowadays, how many things can be made by machine learning. Like alexnet, it also does visual issues, but there is no need to worry at all. I have also done machine learning myself. I can give an example. I have done acceleration, large-scale, etc. earlier, and I did it on Matlab. How can this prove that this is a large-scale problem? So there is no need to worry about this problem at all.

Just discussed a question, which is how to go about computer vision in 5 to 10 years. Now we are facing a situation where there are very few job openings in the visual direction this year, and vision has been popular for 8 years since 2012. How should we continue to move forward? People outside the computer industry give high expectations for CV, such as surpassing humans. In fact, this matter is unreliable and does not surpass humans at all. But people who don’t do computer vision always think that people who are computer vision should do something. But now at this stage, maybe this is like the neural network back then, like a mouse crossing the street, people who talk about computer vision brag. In fact, it’s not something we can brag, it’s something others brag. We need to think that if our vision continues to move forward, scientific research is a problem. On the other hand, how to get continuous attention and truly build some systems that can work. Although we have done a good job in many aspects today, we are honestly not really working. Should computer vision be solved purely from the perspective of vision? In fact, multimodality is a good direction. Vision alone is still a big problem in the monitoring system. In terms of direction, I am more optimistic about the direction of many modes.

Wang Tao

There are many trends in the future development of computer vision. I feel that one of the most important trends in should be active vision. Imagenenet competition can recognize many objects, but image classification is really useless in actual scenes. What really works is a technology like face recognition based on object detection and then recognition. Why is facial recognition successful? The image classification system is still immature.When entering an image, you must analyze it according to different granularities in different regions. For example, when we take a photo in the venue, we will identify people, and then we will count the heads. However, if you want to identify the projector device, you must locate this projector image to be found. The second problem is that the information of the projector has many levels. For example, if someone wants to know the brand, you must see the logo carefully, but if someone wants to know how to operate the projector, you must identify its various interfaces so that you can identify its functions. I feel that the recent Imagenet and ActivityNet behavior recognition competitions, the two types of competitions that everyone is doing are all done using image classification competitions. Using image classification to compete cannot be used in practice, why? It does not actively recognize it like a person. You have to see this person and the frame when that person actually takes action, so that you can recognize it. So I feel that taking the initiative is very important.

The second one must have a level. Hierarchy means not only identifying some basic elements, but also also needs to structure the different hierarchical relationships inside. Our experiments found that the effect of learning things together would be very poor, but if we divide this thing into two parts, first fix the decoder and then fix the encoder and then learn the decoder, the system will be learned. Our learning has to be as high as building blocks. First, we identify basic things such as human faces, water cups, and flowers, and then take a photo to identify the relationship between objects.

The third one, how should we study it. visual research is extensive. If you want to succeed, you must target specific applications. The facial recognition system is very mature, but it is not possible to recognize pedestrians in autonomous driving. It has to be one type after another. In different application scenarios, different data and different properties need to be seen. So I think for specific applications, in the future, in addition to active vision based on deep learning, it should be a better trend to use hierarchical fusion reasoning.

Hu Zhanyi

I think active vision is very important, but active vision cannot make huge progress in 5-10 years. This involves high-level knowledge of feedback in biology, but feedback is difficult to make progress in the short term. About the Purpose of Vision In 1994, CVGIP organized a special issue, and there was a debate. From 1994 to the present, it can be said that active vision has made no progress. There is a lot of feedback in the biological nervous system, but I don’t know what feedback is. If neuroscience is difficult to give some revelation, then it will be difficult for us computer vision to make it. This is my personal opinion.

Wang Tao

I think the previous active vision was unsuccessful because of incorrect research methods and technical limitations.

Hu Zhanyi

Recurrent There are two types of Recurrent, one is the suppression of the same layer, and the other is the feedback from the high-level . In biovision, everyone knows that there is a lot of feedback from the high-level meter, but it is not clear what is the feedback. So according to my understanding, I think it is difficult for biovision to make significant progress in 3-5 years.

Wang Yizhou

I will add that back to learning, learning is the core of vision. Vision is actually a pseudo-question, but learning is an eternal essential problem. Without learning, it is not very important whether the visual existence or not. Instead of calling it computer vision, it is better to call it computational visual intelligence. Vision is a kind of intelligence. The core of intelligence is to learn how to obtain knowledge. Feedback is just a link of learning and reasoning. What is learning? Is it simple pattern recognition or advanced learning? This may be what learning should take next. Give learning a common name called meta-learning. If it is to correspond to computer vision, we call it meta-cognition. The core is learning, and it is impossible not to learn.

Yang Ruigang

I think there should be a difference between machine vision and biological vision, and it is not necessarily necessary for machine vision to learn biological vision.For example, if you look at the global situation, I want a large photo, and then look at the local area, I want a small photo, but if you have a camera that can take 1 billion pixels at once, or if you have a camera that can record the light field, then there is no difference between active learning and passive learning. This kind of hardware difference is at least in two-dimensional images. In the future, I think billions of pixels should be coming soon.

Chen Xilin

I will add this sentence: The initiative here not only refers to resolution, but its essence is to explore through active "behavior" to maximize the utilization of limited resources.

Yang Ruigang

You are talking about an active perception explore, and there is also a kind of non-changing the environment and not changing objects.

Chen Xilin

Even if you don’t make any changes, such as looking at one angle and from another, the light field camera does not solve this problem, and we cannot obtain the light field behind the object.

Yang Ruigang

light field camera array.

Lin Zhouchen

Yang Ruigang means that he simply collects all the information, and this mechanism is still a bit different.

Wang Yizhou

active learning has a downtime problem and a choice problem. That is to say, all your information is here, when you stop, and which piece you decide to pick is the most important thing. So, it is not that you take the initiative to learn, it is not that you take pictures of everything, you have to choose.

Yang Ruigang

The problem of choosing must be in it, but now active learning must involve robots and other issues, which exceeds the scope of computer vision.

Wang Yizhou

So don’t hold on to computer vision, this is what I mean.

Hu Zhanyi

Active learning There are two concepts here . The first is to explore and focus, otherwise there will be no initiative. The second is memory, and active vision is the concept of living things. In computer vision, the concept of active vision is too big.

Cha Hongbin

I think we can compare active vision with deep learning here. The problem with deep learning is that it needs to have an annotated database and data needs to be sorted out in advance. When the visual system works in actual scenarios, it is necessary to choose the samples that are useful to you. In this way, combining sample selection with strategies such as viewpoint selection, structural reconstruction, and computational optimization can effectively exert their initiative without requiring people to collect all the data and feed it.

Zheng Weishi

Learning is very important to vision. Benchmark's promotion has also restricted the current development of computer vision. With ReID reaching 97%, people can't think of what to do, but the problem itself has not been solved. The database is too limited, and the collected things do not fully reflect the entire problem, such as pedestrian occlusion problems and various problems. When there is limited data, learning may not solve it completely. Is it possible that learning under 's limited data can be inspired by 3D? If captures the entire 3D information of a person and the entire behavior of pedestrians, we can remove these impacts in the open environment, and then we can reconstruct them. For example, we can build a shooting range. This shooting range is very important. No matter what system we do, we have to do tests. However, if we only do tests on limited data or one-sided data, we may be limited when we apply it in real life. Therefore, if we can embed 3D into computer vision, which is now dominated by 2D images, it may have another dimension to promote the development of our entire computer vision in the next 3 to 5 years.

So why do 3D? Another thing about is the data privacy issue that may be discussed all over the world now. The privacy issue of data collection is becoming increasingly important. If you are using a virtual shooting range, this privacy issue will not exist at all.Therefore, in the future, including face recognition, pedestrian recognition, and even some behavioral recognition may be affected by serious legal factors, we need to consider from the perspective of 3D and from another dimension. Is it possible to broaden the development direction of computer vision in this regard? This is my point of view.

Jia Yunde

We have done vision very early. We used it as a small river earlier. We have been flowing for so many years, and it suddenly rained heavily. Now it is a flood (deep learning) coming. It is estimated that this model has passed in five years. I think that river will still be there. Because, where is the pathway from the retina to the visual cortex, it is very efficient. Therefore, many people will study what our Chinese laboratory will do in five or ten years? It must be inside the river.

I am optimistic about two directions. The first one of is three-dimensional vision. The three-dimensional vision will not be very hot or cold, and will keep going down. Second, it is the video understanding that Teacher Hu said. Several teachers of also said that it is multimodal, just like when we watch movies, look at the subtitles for a while, look at the subtitles for a while, and understand each other back and forth. It seems quite hot now. It turns out that we are facing a data-semantic gap. Later, the gap between our recognition results and consciousness will also appear. Once there is a gap, it will become a hot spot because there are too many subjective things added inside. I think video understanding should be a hot topic.

Lu Jiwen

I think we are using a lot of machine learning knowledge now. Next, I would rather do some special work from machine learning to machine reasoning. For example, if you give you an image, you will know how this image should develop in the future at a glance, but it will not work for a network that is strong for a computer. I think the reason why computer vision has good performance is often the reason why we define this problem, which can be basically solved based on this definition. Now we may have to look for some computer vision tasks that can better describe or better match human visual abilities. Now, for example, detection, segmentation, retrieval, and recognition are all separate visual tasks, and this simple visual task is still a bit simple. Of course, some teachers may have different opinions, that is, their changes may be more difficult. But in fact, people's vision may not be like this more often. Therefore, I think a very important question in computer vision is how to find such a task that can better match our human visual tasks. Such tasks cannot be too difficult or too simple. I think we need to spend more time thinking and discussing such a task, and I don’t know what it is.

Wang Liang

The main purpose of this topic is to listen to the new insights of our domestic visual experts. Today I heard you talk a lot, in all aspects. If you talk about a trend, there may be a certain trend in every aspect, but everyone’s opinions are different and there are also similarities. The positioning of this topic is to hope that through this in-depth discussion, we will sort out the most important development trends in the visual field that everyone recognizes. It doesn’t matter if there are different opinions, and this discussion is also intersected. I think this kind of discussion is good. Let’s talk about some development trends, and then everyone can have some ideological collisions and sparks. When doing visual research, it would be strange if you have exactly the same view on the development trend. It is also difficult to sort out a relatively clear development trend. Why? If anyone wants to do a good job, he should have some different opinions from others. If everyone has the same opinions, it will be difficult to continue. So I think we are more about communicating our respective ideas to inspire ourselves to have some new ideas, or to find more reasons and basis in my own ideas, and then continue to do this.So I think through these discussions, more about whether we can also have some of our own characteristics in international conferences and research results in the future. So far, what are we spelling for writing articles? I just focused on the performance improvement of database testing, and most of them improved on other people's methods, and then did an experiment and said that I improved by a percentage. But we rarely say that your ideas are different from others, and then your different things will have some effect when used somewhere. At the beginning, your results may be poor, and you may not easily impress others easily, but under your leadership, many people will do this when they are different. So I want to say that in the future, can not only focus on the data on this database to spell, but have more good ideas.

Shan Shiguang

Can it be proposed to establish a review mechanism that only evaluates ideas, method principles, and does not evaluate the quality of the benchmark or database?

Chen Xilin

In a special issue of CVGIP organization mentioned in 1994 that Teacher Hu just mentioned, three aspects that need improvement were proposed that year. Today, only one thing that really is realized--benchmark. In that discussion, it said that the work in our field lacks comparison, and then various comparison data sets were generated. So I just said that in the past 30 years, Benchmark has promoted the progress of computer vision research, which refers to the beginning of the discussion.

Cha Hongbin

I agree with you. Looking back now, after computer vision has been studied for so many years, perhaps after those articles came out, we have never seen any new ideas or new theories. Before that, a hundred schools of thought prospered, and there seemed to be many new statements. After that benchmark came out, everyone was doing the same thing, and in the end the entire field became less active.

Hu Zhanyi

Since we are studying computer vision, I suggest everyone read Marr's book.

Luhuchuan

The benchmark I just mentioned, I think at least the existence of computer vision benchmark makes computer vision different from pure machine learning, and in particular plays a historical role it should have. The main reason why people criticize it now is that it is still a single benchmark. That can only be said that this benchmark is not like a human. If someone sets a more complex benchmark, it is multidimensional. Perhaps this benchmark can drive the development of the next era, and may be able to complete learning or recognition like a human, etc. I don't think benchmark itself has a big problem, because when people are educated since childhood, they teach them what this is and what that is, but people are a comprehensive agent. If benchmark develops to a higher dimension, it may be able to gain better results.

Yang Ruigang

There are too many benchmarks nowadays. Which benchmarks are important, which benchmarks are not important, and there are also various competitions that come with them, which are important and which are not important. In a sense, are you also telling everyone that I won the world number one again, but maybe only ten people participated in the world number one. Is there a way to have such a better quantization mechanism that can give benchmark a benchmark?

Wang Jingdong

Benchmark There is a big problem now that many people can't do it. Imagenet. Many people can’t do it. From the perspective of our researchers, it means that the article may not be released, which may be a bad place. But from another perspective, benchmark is very important. There are many tasks for vision. Another important purpose of our vision is to cultivate students. As for the function of visual training, this may be different from other things, such as Multimedia. I think Multimedia is very good from the perspective of cultivating students. But it has a big disadvantage, it does not have benchmark. from this perspective, benchmark is still needed.But now reviewers have greater expectations for the size of the data set, which is very challenging for many people, especially in schools. Nowadays, there may be a few companies that are stronger. This is my point of view on benchmark.

Wang Yizhou

suggests PRCV to open a single track to encourage innovation. Open a track that doesn't look at performance.