The entire press conference is very hard-core and there are a lot of products.
FSD, Dojo, and Tesla Bot supercomputers are the three highlights of this press conference.
However, what we are mainly concerned about is Tesla's new progress in FSD.
Previously, most of them followed Tesla's pace and embarked on the path of progressive autonomous driving development.
However, with different capabilities and different understandings of autonomous driving, Tesla and a number of companies parted ways.
Tesla has taken the pure visual route, while other companies have begun to invest in the embrace of lidar . However, for most people, Tesla is still the same as the Vane.
They are also very curious: What progress has been made in Tesla's pure visual autonomous driving? Before chatting about progress, let’s first understand the basic framework of Tesla FSD.
first talk about the test scale. In 2021, there are about 2,000 "Yang" participating in the FSD Beta test. This year, this number has increased to 160,000.
From last year to now, FSD has carried out 35 version iterations, trained 75,000 neural network models, basically one in 8 minutes, and pushed 35 version updates.
Currently, FSD Beta can navigate from one stop to another to a certain extent, and can automatically complete the operations of identifying traffic lights and passing through junctions, turning and other operations.
Let’s take a look at the basic Tesla FSD framework.
Everything is completely realized by single-page intelligence. The environment model is generated through the camera running the neural network on the plane, and then the vehicle is planned and controlled based on this model.
This is a multi-camera neural network. The system specifies what coordinates of the physical world are based on the received images.
yes, guess. Although we see pictures, in the camera, what we see is actually a dimensional grid that needs to be converted and encoded.
Then, through continuous model training, we can identify what the object in the image is, such as trees, walls, and car.
Of course, the contents identified are not only these, but also various semantic layers, including various lanes, traffic lights, stop lines, etc.
. After identifying objects, the system will obtain the state coordinates of these objects and predict the subsequent movement of these objects.
For typical computer vision technology , it is very difficult to process these contents, so Tesla is constantly diving into the field of language technology, and then extracting the most advanced technology from other fields and integrating it.
However, many objects still cannot be detected or accurately identified, which requires the data to be annotated. Tesla already has its own automatic annotation system.
In addition, Tesla uses its own simulation system to build images, train the model through the data engine pipeline, and then put it on the page to see if it is feasible.
If a failure occurs, the team will analyze it, provide the current tag and add the data to a large training set. This process systematically solves the problem.
In order to train these new large-scale neural networks, Tesla has expanded its training infrastructure by 40%-50% this year. This year, Tesla has equipped about 14,000 GPUs in the United States training cluster this year.
At the same time, Tesla has also developed its own artificial intelligence compiler to support the new operations required by these neural networks and map them to Tesla's best underlying hardware.
Currently, Tesla's reasoning engine can distribute the execution of a single neural network on two separate systems on the chip. In essence, it is actually a unique computer connected to the same autonomous driving computer.
is actually the same neural network performing computing on both FSD chips. To this end-to-end delay of the new system must be strictly controlled.
For this reason, Tesla has laid out a lot of new code, all of which run in the new network, generating vector space.
Then a new model is built around the car, and the planning system draws the operating trajectory based on this. Tesla FSD is in rapid success through the combination of model-based neural networks.
Turn left without protection is a major problem in autonomous driving.
Automatic driving vehicle involves multiple variables when making decisions and planning. It is necessary to sort out the relationship between different variables and then deduce the most reasonable way of passing.
The following is a very representative scenario: you turn left without protection and encounter a pedestrian passing through.
This basically boils down to solving the problem of precise planning of multiple agents on self and all other agent trajectories.
This requires the system to be able to clarify the relationship between each object in a very short time, and then deduce the most reasonable pass strategy.
Don't forget, the system also needs to predict the subsequent movement of these objects, and the number of related interactive combinations will increase explosively.
This calculation is very large, and planners need to make a decision every 50 milliseconds.
Many businesses are not able to do this scenario well. Includes Robotaxi, a driverless company. So, how did Tesla do it?
Tesla uses a framework called Interaction Search to study a series of moving objects. The state space of
corresponds to the self's motion state, the motion state of other subjects, the future prediction of multiple models, and all static entities in the scene.
can use a set of object motion trajectories to view different interactive decisions in the scene, and can also add new variables to obtain more decision optimizations.
For example, it is the case of just passing the junction. We start with a series of visual measurements.
Just like the calculation volume just mentioned is very huge.
so Tesla chose to build a lightweight network for trajectory generation. The result is: shortens the running time of each action to 100 microseconds. By comparison, each action previously took 1-5 milliseconds.
This improvement is very obvious.
In addition, Tesla will also have a "track scoring" standard, which is to improve the comfort of FSD users.
Speaker is: FSD Not only can it be done, but it must also be done well, so that the plane can be driven as steadily as a driver.
Here, Tesla will run two variable neural networks that can enhance each other.
One of the sets is FSD Beta, which lets us know how likely a person is to be predicted in the next seconds.
The first set comes from human driving data. Score the performance of the FSD system, which can help Tesla better optimize the FSD experience.
Tesla said: "The coolest thing about this architecture is that it allows us to create a cool fusion between data-driven methods, without that much manual cost, but the verification and verification of results is still based on reality." The simple part of
is: Tesla's autonomous driving planning decision-making time is shorter, more capable, and better experience.
All Tesla perception depends on these 8 cameras.
Tesla relies on the camera to obtain image information, and then uses algorithms to obtain the effect of 3D imaging of lidar-like laser radar (from image space to vector space).
Now the FSD UI is only part of the rendered space vector :
but it still cannot be compared with real lidar in terms of accuracy.
For example, when a vehicle passes by, the object cannot be recognized. This is also a big problem that Tesla was difficult to get to last year:
For companies that use lidar solutions, they can detect objects and capture objects' movements very accurately and easily.
But it is not easy for vision to do this.
Tesla uses the occupancy rate to understand the size of occlusion in visual 3D. The occupation network below is their result. What we see is the rule network output from the layer 1 of the system.
(This 3D modeling does not appear in the visual UI currently pushed by Tesla to users, but it is also cool)
Specifically, the occupancy network uses the video stream of 8 cameras as input, directly generates a unified volumetric accuracy for each three-dimensional position around all the gas in the volume space, and then predicts the probability that the location will be occupied by a large amount.
At the same time, contacting the input video context, it also predicts obstacles that may be blocked soon.
At each location, it will generate a set of semantics, such as 绿绿绿, 绿绿, etc., and then label it with no color.
so the following figure is modeled:
At the same time, the occupancy rate in the movement can also be predicted.
Since this model is a meaning network, it does not distinguish between static and dynamic objects, and can also model random motion.
Currently, this network is on all Tesla FSD computers and is very efficient. Using Tesla's new accelerator, it runs around every 10 milliseconds.
So, that's a little bit of what Tesla does to replace lidar with pure vision.
In addition to the white pixels, it can also output road surface related information using the network, such as the situation of its road surface (such as slope), and also road surface semantics.
This will be of great help to system control.
directly apply to the case. Looking at this picture, the three-dimensional information of the ramp is also well predicted. With such information input, the subsequent system can decide whether to slow down next.
If this matter is handed over to companies that directly adopt high-precision map solutions, this step is very simple.
Because high-precision map companies have entered these road information, including slope, curve curvature and other information into the map, when the vehicle reaches this point, the vehicle can make predictions and control based on these known information in advance.
So, this is how Tesla has made good progress in replacing high-precision maps.
First, the camera extracts image data and corrects it, and then uses RegNets and BiFPNS to extract image features and builds a 3D position query. All images and features have their own keys and values.
Through these keys and values, you can know what object is ahead, or an object whose part is blocked.
These contents then output high-dimensional spatial features through the attention module, and these spatial features are consistent. Then use the instantaneous range measurement of the vehicle to derive the motion trajectory.
uses these space-time characteristics through the deconvolution neural network to introduce the final occupancy rate and occupancy flow, forming a box with a fixed ruler.
But this may not be very accurate for planning and control.
In order to obtain higher resolution, Tesla also generates a shape map for each pixel, which you treat as coordinates, and then sends these coordinates and 3D spatial point queries to MLP (multi-perceptron) to obtain the position and semantics of any point.
Many people may be dizzy when they see this. Let’s take a look at this case:
Tesla is constantly driving, and the bus in front is recognized as a red box of "L". When the vehicle gradually approaches, the bus is also moving, and the head directly changes from red to blue.
As time goes by, the whole Ba is turned blue, and you can even see the precise curvature of this network predicting the left movement of Ba is.
This is a very complex problem for traditional target detection networks, and one or two cubes may be used to fit the curvature.
However, for occupying the network, you only need to pay attention to the occupancy of space, and then you can accurately build the curvature model.
In addition, what I just mentioned is the recognition of the curved road surface and related semantics.
Finally, use a large automatic labeling dataset to train the occupied network.
In addition, Tesla is also paying attention to other neural networks, such as NeRF ((Neural Radiance) Fields)
This is a brief explanation. NeRF, neural radiation field. It is a technology to reconstruct three-dimensional scenes using multiple images.
directly put the case.
, for example, the shelf in front of you. Through the training of multiple sets of images on the neural network, you can build a three-dimensional scene of this shelf, and Two new sets of different new views from the previous image can also be given.
Obviously, this technology is very pleasing to Tesla.
Tesla is considering integrating some functions into network-occupying training.
This is a demonstration case they did, in order to present 3D for autonomous driving World.
But it is not easy to do this. Tesla continues to invite the masses to join the Tesla autonomous driving team in this place.
has a powerful model, and the next step is to train it. This requires a large amount of data videos that can be learned.
Did you see this picture? This is not a fault or a snowflake, but a video. About 140 million frames.
This amount is very huge.If you use
This time is very high, and Tesla wants faster training speed.
This is also why Tesla wants to build its own supercomputer.
Tesla has 3 supercomputers, a total of 14,000 GPUs, of which 10,000 are used for training, and another 4,000 are used for automatic annotation.
All videos exist in a distributed video cache facility with a capacity of 30PB.
These data sets are not unchanged, but are changing. There are about 50 videos replacing flow in the cluster every day, and the system tracks 400,000 video instances per second.
. Tesla has also done a lot of work in optimizing video model training:
The result is: Through these accumulation and optimization, Tesla's current training speed of network occupation has increased by 2.3 times.
Early Tesla performed instance segmentation in 2D image space, and the neural network was also very simple, and could only identify different types of roads.
This relatively simple road modeling is suitable for highly structured roads.
Now, what Tesla wants to do is a system that can be applied to more complex road conditions, not only to generate a full set of instances of the channel but also to connect them.
junction pass is a good case.
A relatively big common problem with assisted driving vehicles is that there are normal roads, and there are no lanes to guide the vehicle.
What Tesla needs to do is to improve the performance of assisted driving in this area. So Tesla made its own neural network.
consists of three components:
Although it is just an ordinary map, although not a high-precision map, it provides a lot of basic attribute information,
such as channel topology, channel number, navigation route and other information.
This information is extended to: Tesla FSD uses maps, but it is an ordinary navigation map, not a high-precision map.
So don’t ask Tesla whether to use maps for autonomous driving.
inputs this dense tensor, and the output text is predicted to be a special language developed by Tesla itself. Let's call it Language of Lanes. Tesla uses it to encode the connection relationship of the channel.
What should I do specifically? Look at the picture:
This is the final channel network:
In short, this allows Tesla to have high-definition spatial positioning and a further visual range without high-precision maps and lidars.
Future behavior prediction of objects around & Path planning
Introduces us human drivers. In fact, when we are driving, we will subconsciously make similar predictions, pay attention to the movements of different traffic participants around us (such as pedestrians, self-operated vehicles, etc.), and then make the next step of vehicle control (acceleration, deceleration, and stop).
This Tesla mentioned two very good cases, which can help us better understand what Tesla has done in this area.
First:
Tesla was driving normally, and encountered a car that broke into a red plane and turned left.
In this process, Tesla has predicted all the actions this unit may do, and then decides what actions the unit will take based on the different actions this unit will do next.
The first one:
Although there are red lights in front of me, this one stops for some reason because it is very far away.
Tesla did not stop behind the plane very mechanically, but changed lanes in advance and turned to another lanes.
This operation is really detailed. Tesla FSD is becoming more humanized, so I will give you a good review.
Tesla is trying to build a real-time system, so it needs to maximize the frame rate on the object partial stack so that Autopilot can respond quickly to changing environments. Here, every millisecond is very important to minimize the latency of reasoning.
In this, the operation of Tesla's neural network is divided into two stages: the first stage is to determine the position of objects present in three-dimensional space; the second stage is to pull out tensors at these three-dimensional positions, attach other data on the vehicle, and then perform the rest of the processing.
This allows neural networks to focus their computing on the most important areas, thus providing better performance at a small portion of the latency cost.
Put them together, Tesla's Autopilot visual stack can not only predict how and motion, but also predict various semantics, making driving safer.
Now, the FSD road network is already running on the plane, and Tesla has also done a lot of work:
Moreover, there are not only mobile networks running on the plane, but also mobile networks, occupation networks, traffic control and road sign networks, and path planning networks....
may join more new networks in the future.
The following picture is a visualization of the neural network in Tesla:
It looks very shocking.
Tesla's set of things is considered a real digital brain for autonomous driving.
At the same time, Tesla has also done a lot of work in optimizing delays:
With so many neural networks, it requires huge amounts of data to feed. Next, let’s take a look at Tesla’s progress in automatically labeling data.
Tesla has all labeling frameworks to support various types of networks.
Take the Internet as an example. In order to successfully train and popularize this network to various places, it takes 1 million or even more data to travel tens of millions of times across the intersection.
However, data is not a problem for Tesla, after all, there are many sources. But a new challenge is to convert all this data into a training table.
In the past year, Tesla has tried many data annotation methods:
Nowadays, Tesla is using new automatic annotation technology, and the efficiency is greatly improved.
Previously, it took 10,000 times to mark the label, and if you change the manual label, it would take 5 million hours, but now it only takes 12 hours to complete it. How to do
? We unfold again and again. It is mainly divided into three steps:
The first step is to restore high-precision trajectory and structure through body cameras and visual inertial ranging method. All features, including the ground, are inferred from the video through neural networks, and then tracked and reconstructed in vector space.
Step 1: Multiple stroke reconstruction. This is also the core part. You can see how the previously displayed trip was rebuilt, and it is matched with other trips, and then it is rebuilt.
Then, the human analyst comes in and finalizes the tag, each step has been fully parallelized on the cluster. So the whole process usually only takes an hour.
The last step: automatically label the new itinerary. Just use the same multi-trip pair engine between the pre-built reconstruction and each new trip.
So this is much simpler than completely rebuilding all the fragments.
This is also the key to the scalability of this machine. This machine can easily scale as long as there are available calculation and travel data.
Of course, automatic annotation is not only used in the channel, but also planning, occupying the network, etc. Many are automatically annotated.
For autonomous driving, road test is very important because it can access real scenes.
But similarly, simulation is also one of the important ways and sources for obtaining autonomous driving data, and can provide a lot of difficult-to-get data.
However, 3D scenes are notoriously slow to make. Taking this simulation scenario as an example, it took two weeks for the renderer to build it, but with the new tools, Tesla can complete the construction of similar scenarios in 5 minutes, which is 1,000 times faster.
First, Tesla will transfer the automated ground real tags to the simulation world creation tool:
Then, a road grid is generated and re-topologically filled with a channel tag, including various important road information, such as the slope, materials of the junction, etc.:
Create a road line on the road:
Then fill in details to generate plants and buildings. At the same time, it also brings visual occlusion effect brought by plant buildings, etc.:
Then the introduction of traffic signals:
Then there are road signs and road guides:
Next add vehicles, pedestrians and other traffic participants:
Just need to move your fingers, Tesla can create various simulation environments you want:
Now, Tesla can easily generate most of the SF city street simulations:
to summarize.
Tesla showed a lot of content this year on FSD.
HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD1HD
This AI Day lets us see what Tesla has done on alternative lidar and high-precision maps.
At the same time, with the addition of new neural networks, Tesla FSD has become smarter and more humane in driving.
said: "Currently, FSD software can be applied to traffic conditions in various regions around the world. If local regulatory policies allow it, we can launch the FSD Beta version of the software globally by the end of this year."
Two years ago, Tesla took off the millimeter radar, and now Tesla is taking action on ultrasonic sensors.
In this way, Tesla is really on the road of pure visual autonomous driving.
Tesla's pure visual autonomous driving is gradually advancing according to the established goals.
The last question: Can Tesla use pure vision to realize autonomous driving? I can't answer this question for the time being.
I can only give you a current situation I see: when most companies turn to the lidar camp, Tesla is still a lone brave and pioneer on the pure visual route, providing more reference ideas for the implementation of autonomous driving.
Not everyone dares to bet on their future to take advantage of an uncertain tomorrow, but Tesla dares. This courage to innovate is admirable.
OK, the above is the progress of Tesla AI Day on FSD this time.
If you think the content is good, you are welcome to have three consecutive one click, which is of great help to my creation.