
| Instructor of
, an open course of Zhidongxi | Chen Chao, R&D director of Jizhijia
courseware | Follow the public account of , open course of Zhidongxi , and reply with the keyword "Intel 03" to get the courseware.
replay | https://appoSCMf8kb5033.h5.xeknow.com/st/8Nl2ftuYV
Introduction:
html On April 7, Jizhijia R&D Director Chen Chao gave a live broadcast in the Zhixi East Open Class, with the theme of "Logistics Robots" "Innovation and Practice of Visual Perception and Positioning Technology", this is also the third lecture in the Intel AI Top 100 Innovation Incentive Program series.In this explanation, Mr. Chen Chao gave an in-depth explanation focusing on the visual perception and positioning requirements of logistics robots, intelligent forklift pallet detection algorithms, using OpenVINO to accelerate the deployment of pallet detection algorithms, and Gizhijia VXSLAM positioning technology and practice.
This article is a compilation of pictures and texts for this special lecture session:
Text:
Good evening, I am Chen Chao from Jizhijia. I am very happy to share with you the innovation and practice of visual perception and positioning technology of logistics robots today. This sharing mainly includes 4 aspects:



4. AMR positioning technology and practice based on VXSLAM
Requirements for visual perception and positioning of logistics robots
For ordinary consumers, they are more exposed to e-commerce express delivery. In fact, a complete logistics scenario includes factory logistics, warehousing logistics, Express logistics and commercial logistics. In addition to e-commerce, almost all physical industries have a very complete logistics system, and they all need to solve their own storage and transportation problems of raw materials and semi-finished products.
Logistics is a very large industry with many links, thus forming many smart logistics scenarios, such as smart sorting, smart handling, smart forklifts, smart warehousing and smart factories. For many e-commerce backend warehouses, it is a typical smart picking scenario. Here we can use some shelf-to-human robots to move the shelves to a fixed location for manual picking. Intelligent handling is more commonly used in factory production lines to distribute some materials and raw materials. Sorting robots are mainly used in some postal and express delivery systems for package delivery and sorting. Smart forklifts are mainly used for short-distance transfer and transportation of relatively heavy items, as well as smart warehousing and smart factories. Behind the many different logistics scenarios is the support provided by various forms of robots.

What are the requirements for the perception and positioning of these robots? First of all, because logistics scenarios are very diverse, this places different types of requirements on robots, such as robotic arm robots, which need to identify and locate goods; roller robots, which need to dock goods to designated assembly lines; There are also some handling robots, unmanned forklifts, etc. Different robots will complete some specific tasks.
The next one is a forklift. The forklift has a very strong load capacity, but its danger is also relatively high. In the working environment of a forklift, whether the front fork rake hits a person or goods, it will be a serious accident, so the highly reliable algorithm is needed to ensure the safety of the robot.
Regarding stability, compared to other robots, logistics robots may have 7×24 stability requirements. In the sorting scene of Wuhan’s postal system, there are hundreds of machines, and different types of robots work closely together to deliver packages. of delivery. We can imagine that in such a dense scene, if the robot malfunctions and locks certain areas, it will cause the paralysis of a large area of the work area and cause the accumulation of orders. The upper right corner of
is the production line of Toyota in Japan. The robot can dock some parts. The accuracy requirements in this regard are very high.
Analysis of Jizhijia’s smart forklift pallet detection algorithm
In addition to the high reliability and safety requirements mentioned above, smart forklifts also have some unique features. There are many types of forklift pallets. There are European standard ones, national standard ones, and non-standard ones, and some are customized by customers. They are made of different materials, such as plastic, iron, and wood. They have different shapes, including single-hole ones and double-hole ones. Porous and porous, then we need to adapt to multiple types of targets. In terms of accuracy, we can see that the fork rake of the forklift is actually relatively thick. Compared with the jack, the margin left for its positioning is actually very small, which requires our detection algorithm to have very high accuracy.

Forklifts use depth cameras. For point cloud processing, there are now some typical methods, such as early PointNet, VoxelNet, and later Point-GNN, etc. Here we focus on PV-RCNN, usually Generally speaking, the calculation efficiency based on the Voxel method is relatively high, but its positioning accuracy is low. The receptive field based on the Point method is variable and generally has higher accuracy. PV-RCNN combines the advantages of both. It uses 3D Voxel first makes an initial positioning, and then associates it with the keypoints to make a precise positioning. Specifically, it actually contains two large branches. The first is to perform 3D Voxel processing on the original point cloud to obtain some candidate 3D boxes. On the other hand, it performs some original point sampling to sparse the key points, and then The multi-level features just extracted by Voxel are related through indexing, fused, and then refined. In fact, the point cloud-based method is the most direct for us, but in actual use, the labeling cost of 3D point cloud data for pallets is very high, which may be more than 10 times that of image labeling, so we still use the image-based method. method.

The upper left corner of the above picture is the obtained depth map. After converting it into an image, you can use some well-known two-stage processing methods such as Faster-RCNN and R-FCN, one-stage Yolo and RetinaNet, etc. Here we introduce a typical one FSAF, we know that scale is a relatively big problem in target detection. In order to solve the scale problem, now we mainly The flow detectors all adopt the FPN structure, and then arrange anchors on different layers based on anchors. Then you need to set the position of the anchor in advance, as well as its various sizes and aspect ratios. There are many specific setting methods. It is based on some heuristic methods. To put it simply, the size boxes of some targets and the anchor boxes on each feature layer are calculated and sorted by IoU. We consider the one with the largest IoU to be the detection layer, which is actually not an optimal result. FSAF mainly focuses on feature selection. It builds a branch of SAF based on RetinaNet. We can use an anchor-free branch to output the heat map of target detection, and then use this heat map to guide anchor-base. Methods for target detection can be optimized through learning and training. In the end, we can choose what size, what scale of target, and what layer should be detected. A great advantage of FSAF is that it is a modular method. It can be directly integrated with many mainstream detection algorithms and improve the performance of the algorithm to a certain extent.

Finally, let’s take a look at our detection method. First of all, everyone knows that mainstream target detection is based on the form of bounding box. To put it simply, it is to take a bounding box, whether it is an anchor or other methods, to de-densify the entire image. sampling, and then finally regression out the target category and its location. Later, some methods based on key points were born. Because of the anchor method, we need to design the anchor more carefully, and there must be a lot of priors in it in advance, and there are a lot of hyperparameters. Later, there are two main types of methods based on key points. One is the CornerNet method based on left top and right bottom. There is also a method based on CenterNet, which is based on the center point of the target. It first detects the center of the target and then returns its attributes, including its boundaries.Here we combine the characteristics of the two methods and use 5 points to describe it, because as mentioned earlier, we have very high accuracy requirements for the target, so you can imagine that in the field of view of the image, the pallet is not actually a real rectangular frame. , it is actually a quadrilateral. We can achieve higher accuracy by using 4 additional points for detection. In fact, a heat map of the center point of a target is first generated through an Encoder-Decoder network. The heat in the heat map represents the probability of the target. After obtaining the probability, we screen and then perform regression of key points.
This method has some characteristics. First of all, it is a one-stage method. It does not have the RPN process and information blocking. The front-end features can be directly passed to the final detection layer. We believe that this type of method may have higher performance in the future. ceiling. In addition, since it is a non-bounding box method, there is no NMS operation, so the speed will be faster. In addition, because it adopts the form of encoding and decoding, we can directly detect on high-resolution images, so unlike general target detection, which may be detected on low-resolution or FPN results, there will also be higher Accuracy.
Finally, for another branch, I think it is also a very important branch. It supports an important goal of the entire pallet detection algorithm design: high reliability. How to do it? We will create another Verification branch that accepts both the output of network detection and the pallet prior model. Because there is a database of some pallet models in advance, we can make a similarity judgment between the pallet models in the database and our detection results to determine the confidence of our detection, or as mentioned just now, in practical applications, we It would be better to have a lower detection rate and miss some targets, but you must not let the forklift pick up the wrong objects to avoid causing damage. This is actually an improvement in network performance that we combine with domain knowledge.
uses OpenVINO to accelerate the pallet detection algorithm to deploy
. First, train the network model on the server side, then optimize and transform the model based on OpenVINO, and finally use VINO's deployment engine to deploy on the robot side on the Core i3 computing platform.
Some friends may not be familiar with OpenVINO. Here we will give a brief introduction. OpenVINO is an acceleration optimization framework for deep learning and traditional vision launched by Intel . It can support most of Intel's current computing platforms, including CPUs, integrated graphics cards, etc. It provides a unified API that allows our algorithms to be compiled and optimized once, and quickly transplanted on different platforms. Let’s focus on the model optimization and inference engine parts of the DeepLearning part. The model optimization part can receive the trained model and then convert it into an internal unique IR format. Then the deployment inference engine part has some hardware instruction level optimization. It is a very lightweight and efficient deployment framework that can be quickly and lightweight deployed for model inference on our embedded platform. In the Deep Learning part, it actually provides many pre-trained models. More importantly, it provides many very practical tools. For example, the calibration tool can perform some quantification, which we will also talk about later, including model performance analysis and some visualizations. tools, etc. In terms of traditional vision, we have also made some instruction set optimizations for the familiar , Opencv and OpenVX, and also provided some underlying media SDKs required to build a complete visual processing pipeline.
Next, let’s talk about model optimization. The model optimizer is one of the core parts of the entire OpenVINO framework. It currently supports almost all mainstream frameworks. We use TensorFlow, but it also supports Pytoech, Caffe, etc. It can analyze the topological structure of the model, then perform some optimization operations, and then transform it.

Let’s focus on the model optimization technology. In fact, you can look at the picture on the right. In many networks, there is a BN layer. As we all know, the BN layer is actually some preliminary operations of bias and scale. We do a separate layer here. In fact, The calculation efficiency is not high.It is a model optimizer that can automatically merge these networks according to the topology structure, and can merge the BN layers forward or move them backward. It can be seen that the structure of the network is greatly simplified, and more importantly, the speed of inference will be greatly improved.
also has some stride optimization operations. For example, there are many 1×1 convolutions in some structures such as Resnet, and its stride may be greater than 1. For these operations, we can move the stride forward, for example Look at the example in the lower right corner of the picture above. The front is the input of 56×56, and then there are two branches. The operations where kernel_size is 1 but stride is 2 are all performed here. In fact, many previous calculations are discarded. We can move it forward and construct an operation where stride is 2 and kernel_size is 1 on the previous branch. , so that the input feature map is greatly reduced from the front, then the overall efficiency improvement is conceivable.

Finally, let’s take a look at the inference engine. The inference engine can actually be optimized for different platforms. Let’s focus on the reasoning of INT8. We know that INT8 has many advantages over the ordinary FP16 method. It is more energy efficient, has more single-instruction execution output, higher bandwidth, and lower memory consumption. In
we can use the calibration tool to quantify some models. The specific process is to first run the trained model on the calibration data set, and then record the activation data of each layer under float32. Next, you can perform conversion inference on all possible conversion levels into INT8, and then collect some accuracy drops. For example, after some layers are converted, its accuracy may drop by a point, which is unacceptable, while some layers may have almost no change, so we will convert it.
After conducting direct quantification standards for these models, we can enable 32-bit models to perform inference in the INT8 manner. Different platforms are different, this operation will probably improve the performance by 30%. Based on the previous model optimization and promotion engine, we also need to build a pipeline for the entire detection process. Then we use some of Intel's underlying media processing libraries we mentioned before, such as the media SDK to do some encoding and decoding operations, including Opencv's front-end preprocessing, including back-end post-processing, etc., and in the middle is our IE part. reasoning framework. Compared with our traditional OpenCL and other manual optimization methods using OpenVINO, the overall performance has been improved by 4 times.

Finally, let’s take a look at some of our implementation cases. The picture on the left is a forklift that was implemented in the warehouse of a major international pharmaceutical manufacturer in Europe during the epidemic. The goods are filled with medicines. We can imagine that the safety requirements for forklifts are very high, and medicines are very expensive. The picture on the right is a project of a gypsum board factory. You can see that there are many types of pallets here. The top is double-hole. In fact, strictly speaking, it is not a pallet. It is the customer who uses local materials. Plasterboard sheets were cut to look like pallets. We saw that there was some stretch film on it, and they would have some droop, so this project posed a very big challenge to our positioning accuracy. In the end, we met the customer's requirements with a positioning accuracy of 5 mm and successfully deployed it.
AMR positioning technology and practice based on VXSLAM
First, let’s briefly introduce AMR. AMR is the abbreviation of autonomous mobile robot. Compared with the AGV we are familiar with, it has There are some differences. The most important point is the difference in positioning and navigation methods. Traditional AGVs mainly navigate based on some guide lines on the ground, such as some ribbons or some magnetic strips or some wires. Later, there are some There are plans to place QR codes on the ground, and there are also plans to put reflective panels on the walls.
But no matter what, the navigation path of traditional AGV is generally relatively fixed. Correspondingly, AMR is mainly based on SLAM positioning method. It does not require the transformation of the environment's infrastructure. Positioning can be performed after establishing an environment map in advance.Based on different positioning and navigation methods, they have the following differences. For example, AGV is more suitable for performing the same tasks, and it is more suitable for traditional businesses, while AMR can be flexibly configured, whether it is a line or a task. Well, it's more suitable for agile businesses. For example, in a typical scenario in the manufacturing industry, which has many operating points, frequent business changes, and a mix of humans and machines, AMR may be a more suitable solution.

SLAM technology, traditional SLAM includes laser SLAM, visual SLAM, and even ultrasonic SLAM and anti-collision strip SLAM, but no matter what kind of SLAM, it is essentially a state estimation problem. A robot equipped with sensors moves in the environment and estimates a change in its own position by observing landmark points. At the same time, we splice all the landmarks together to build a map, which is SLAM.
Traditional visual SLAM may have some better solutions, such as RTAB-Map, LSD-SLAM, VINS-Fusion, etc. In recent years, with the introduction of deep learning technology, the performance of traditional SLAM has been improved to a certain extent. Here I may give a more representative example of HF-NET, which is also divided into two parts: offline mapping and online positioning. In the offline mapping part, we use HF-NET to process the input image sequence. Extract its local feature, here we use super point, and then use SFM to build a map of the entire scene. Here we mainly triangulate the feature points.
, on the other hand, uses HF-NET to extract some global features, and then aggregates the features, which is mainly used for subsequent location recognition. In the online positioning stage, based on the image input by the robot, HF-NET can first extract its global features, and then select some candidate locations in the entire scene based on the location recognition module. The other branch extracts some local features and performs 2D to 3D matching between the local feature's position and some candidate 3D points to restore the position. The current HF-NET method works well for some large scenes and scenes that include some changes, such as scenes with large lighting changes.
Currently there are some successful cases of VSLAM, the more representative one is Google's VPS, which can perform indoor and outdoor positioning and navigation based on a single image taken by a camera on a mobile phone. The other company is Facebook. In recent years, Facebook has brought together well-known talents in the field of VSLAM around the world, including the acquisition of Zurich Eye/Oculus, etc., to create a set of visual positioning solutions, and the effect is relatively stable.

VSLAM still has many challenges, the most important of which may be the challenges brought about by the dynamic environment. The picture above is a typical warehouse environment. You can see that the upper right corner is filled with cartons. Depending on the working time or the task, , these placed boxes will undergo huge changes. For example, when the robot moves things, there may be three boxes, but when it comes back to move, there may be none. In addition, there are changes in lighting. Usually the lights are turned on on the production line, but sometimes some work points will undergo changes in scheduling and other changes over time, resulting in the AMR needing to pass through some areas where the lights are turned off. In addition, in warehouses or factories, some areas have huge skylights or side windows, and changes in sunlight will also affect positioning. On the other hand, there is the problem of positioning accuracy in open scenes. In fact, in many warehouses, there are very few fixed targets, only a few large pillars. These goods change with time. Accurate positioning in such an empty scene is also a difficult problem.

Finally, let’s talk about the VXSLAM solution. First, let’s explain the meaning of information. From another dimension, heterogeneous sensor fusion is carried out. The first and most basic one is the camera, in addition to IMU, Odom, and laser.
The most important goal of our entire system design is high robustness, which is consistent with the forklift mentioned earlier. The picture above is the framework of the entire algorithm, the most important of which is mapping, positioning and map management.First of all, after the mapping part accepts the images and various data input by the sensor, it performs BA optimization to build the map. At the same time, what is special is that we have an object estimation module to remove dynamic targets and estimate the position of some static targets. Add to map. In addition, the location recognition module can be used to correct loopbacks. In the map management module, we maintain various heterogeneous maps, such as the various feature maps just mentioned. From another perspective, the map is multi-layered, which will be introduced to you later. In the positioning stage, the position of the robot is estimated based on the established map and some results from the location recognition module.
Let’s first talk about heterogeneous maps. Everyone knows that laser SLAM, especially single-line laser SLAM, is relatively mature in the field of robotics and has many advantages. The first thing to fuse is the grid map of the laser. There are actually many problems in fusion here. You can see two heterogeneous sensors with different frame rates, different data types, and different characteristics. For example, for laser, it is globally consistent. The accuracy is very good, but there is jitter in the local area. The local visual accuracy is actually very high and very smooth, but there are problems such as scale drift. We use a joint optimization method to achieve a good integration of the laser grid map and the visual feature map, so as to improve the robustness and accuracy of positioning when integrating positioning. On the other hand,
is an object map. Based on the OpenVINO platform, it actually runs a target detection and pose estimation module. For people, including temporarily placed targets, some removal will be done to prevent interference with mapping and positioning. For some static targets, position estimation is also performed. For example, if there are some fixed heavy equipment on the production line, including some pillars, walls, fire hydrants, etc., we will add it to the map. The target-based positioning accuracy of
is not very high, but the improvement in robustness is very large. In many cases, especially in dynamic scenarios, we can estimate the position based on only one target, or the entire pose. Make some incomplete constraints. This is crucial to improve the robustness of the robot's overall positioning.

There are also multi-layer maps. I have just mentioned several important challenges that some maps may face. The first of them is the dynamic environment. The dynamic environment includes changes in layout and changes in lighting. In the warehouse, it may be morning, noon and The lighting at night is very different. In addition, including different operating areas, the placement of goods may change over time. We need to create a multi-layered map. The above picture is a map created for the same area in different time periods. We can put it Put them together for joint optimization. We have constructed a static reference map throughout the map. In the integrated positioning stage, we can determine which map to base on based on their matching degree and some priors.

Finally, let’s take a look at the customer’s site. The left side of the picture above is one of its production lines. You can see that there are many material boxes piled on both sides of the production line. Every morning, these material boxes will be transported here by robots. , as the operation progresses, these material boxes will be transported to other locations for replenishment, so this scene is very dynamic. The upper image in the middle is the established laser map, the lower one is the visual map, and the red line is the result of visual positioning.
is based on reliable and stable sensing and positioning technology. AMR robots can achieve rapid deployment in 1-3 months and improve the efficiency of production line logistics by 2-3 times. With high return on investment, high reliability and agility, We believe that AMR will have wider applications in the logistics field.
finally summarizes, taking AI technology as the core to achieve high-reliability AMR perception and positioning. We have repeatedly emphasized that reliability and intelligence are our two most important goals in technology. Based on this, we hope to create efficient and stable intelligent logistics robot solutions to empower global companies to achieve intelligent upgrades in logistics.
