[New Zhiyuan Introduction] In order to make the majority of video call users better experience and to make more AR and VR users favor the meta universe, Meta's AI R&D team recently developed an AI model that can better handle virtual backgrounds.

2025/06/2614:29:38 hotcomm 1911

Editor: Yuan Xie Layan

[ New Zhiyuan Introduction] In order to give the majority of video call users a better experience and to make more AR and VR users favor the meta universe, Meta's AI R&D team recently developed an AI model that can better handle virtual backgrounds.

Since the beginning of the COVID-19 pandemic, most people have become accustomed to remote video calls with friends, colleagues and family. I have used virtual backgrounds during video chat.

Users change the background during video, which can give them the right to control the environment around them in virtual images, reduce distractions caused by the environment, protect privacy, and even make users look more energetic in the video.

But sometimes the effect presented by the virtual background may be different from the user's needs. Most people have experienced the virtual background blocking the face when they move, or the virtual background cannot recognize the boundary between the hand and the table.

Recently, Meta used the enhanced AI model to segment images, optimizing the AR effects of background blur function, virtual background function and other Meta product services. This allows you to better distinguish different parts of the photo and video.

Researchers and engineers from Meta AI, Reality Laboratory and other departments of Meta formed a cross-departmental team. Recently, the new image segmentation model has been developed, which has been used in real-time video calls on many platforms such as Portal, Messenger and Instagram and Spark AR's augmented reality applications.

The group also optimized the two-person image segmentation model, which has been applied on Instagram and Messenger.

How to make AI improve virtual background

In the process of promoting the optimization of image segmentation, the group mainly has the following three major challenges:

. It is necessary to let AI learn to recognize normally in different environments. For example, the environment is dark, the skin color of the characters is different, the skin color of the characters is close to the background color, the unusual body shape of the characters (such as bent over and tied shoelaces, or stretched), the characters are blocked, the characters are moving, etc.

. Make the edge position look smoother, stable and coherent. These characteristics are less discussed in the current study, but user feedback studies show that these factors greatly affect people's experience when using various background effects.

. It is necessary to ensure that the model can operate flexibly and efficiently in billions of smartphones around the world. It is not possible to use it only in a small number of state-of-the-art phones, which are often equipped with the latest processors.

Moreover, this model must be able to support mobile phones with various aspect ratios, so that the normal use of the model can be ensured in the laptop , Meta's portable video calling device, and people's mobile phone portrait mode and landscape mode.

The virtual background example processed with Meta's AI model is the head body image on the left and the full body image on the right.

Challenge of real-world personal image segmentation model

The concept of image segmentation is not difficult to understand, but it is difficult to obtain high-precision personal image segmentation results. To get good results, the model that processes images must be extremely consistent and have extremely low latency.

incorrectly segmented image output will lead to various effects of distracting video users using virtual backgrounds. More importantly, image segmentation errors will cause unnecessary exposure to the user's real physical environment.

Because of these, the accuracy of the image segmentation model must reach more than 90% before it can enter the actual market product application. Intersection ratio is a common standard measure of the ratio of the overlapping part of the image segmentation prediction value and the real value of the substrate.

Due to the massive complexity of usage scenarios and instances, the last 10% of the image segmentation model needs to be completed is much more difficult than all previous parts.Software engineers at

Meta found that when the interchange ratio has reached 90%, the measurable indicators of the image tend to be saturated, and it is difficult to improve the time consistency and spatial stability.

To overcome this obstacle, Meta developed a video-based measurement system that worked with several other metrics to solve this additional difficulty.

develops AI training and measurement strategies for real-world applications

AI model can only be learned from delivered data sets. Therefore, if you want to train a high-precision image segmentation model, it is not possible to simply enter a large number of video samples where users sit upright in a bright room. Sample types should be as rich as possible in the real world.

Meta AI Lab uses its own ClusterFit model to extract available data from massive samples of different genders, skin tones, ages, body postures, movements, complex backgrounds, and multiple people. The measurement value of the static image of

does not accurately reflect the quality of the model's real-time processing of dynamic videos, because real-time models usually need to have a tracking mode that depends on time information. To measure the real-time quality of the model, Meta AI Lab designed a quantitative video evaluation architecture that calculates the indicators of each frame when the model predicts the picture.

is different from the ideal situation in the paper. Meta's personal image segmentation model is judged by massive daily users. If there are jagged, twisted, or other unsatisfactory effects, it will be useless no matter how much better other performance than the benchmark value.

So Meta AI Laboratory directly asked its own product users about the evaluation of image segmentation effect. The result is that non-smoothing and blurred edges have the greatest impact on the user experience.

In response to this requirement, Meta AI Lab has added the new indicator of "edge cross-comparison" in the video evaluation architecture. When the normal interchange ratio of the picture exceeds 90% and is almost saturated, the edge interchange ratio is an indicator that needs more attention.

Moreover, the picture time consistency is insufficient, which will bring mixed effects on the edges of the graphics, which will also affect the user experience. Meta AI Laboratory uses two methods to measure the time consistency of the picture.

First of all, Meta researchers assume that the two frames of the time point are almost the same. Therefore, any prediction difference on the model means that there will be time inconsistencies in the final picture.

Secondly, Meta researchers started with the foreground actions of two frames of the picture immediately adjacent to the time point. The optical flow in the foreground can propel the model from the predicted value of frame Nth to frame N+1. The researchers then compared this prediction value with the real N+1 frame value. The difference calculated in the two methods is reflected in the cross-combination ratio of this measure.

Meta AI Labs used 1,100 video samples from more than 100 populations from 30 species to enter the AI model, and the classification includes all human characterization of gender and skin tone on the Fitzpatrick scale. The analysis results of

are that Meta's AI model has similar significant accuracy in the video processing effects of all population subclasses, with a confidence level of more than 95%. The difference between the interchange ratios between categories is basically around 0.5 percentage points, and the performance is excellent and reliable.

Videos of different skin colors and gender groups, Meta's AI model processed and interchange data

optimization model

architecture

Meta researchers used FBNet V3 as the backbone of the optimization model. This is a decoding structure formed by mixing multiple layers, each layer has the same spatial resolution.

researchers designed an architecture with a lightweight decoder weight encoder that can have better performance than a fully symmetrical design. The generated architecture is supported by neural architecture search and is highly optimized for the speed of operation on the device.

semantic segmentation model architecture.The green rectangle represents the convolutional layer, and the black circle represents the fusion points of each layer.

Data learning

Researchers used offline large-capacity PointRend model to generate a pseudo-standard real-value tag for unannotated data to increase the amount of training data. Similarly, researchers used a teacher-student semi-supervised model to eliminate bias in pseudo-labels.

aspect ratio related resampling

The traditional deep learning model will resample the image into a small square and input it into the neural network. Due to resampling, the image will be distorted. And since each frame of the image has a different aspect ratio, the amplitude of the distortion will also be different. The existence and degree of distortion of

will cause neural network AI to learn unstable low-level features. This distortion-induced limitation will be amplified in image segmentation applications.

This way, if most of the training images are portrait proportions, the model will perform much worse on real scene images and videos.

To solve this problem, the research team adopted the Detectron 2 aspect ratio-dependent subsampling method, which grouped images with similar aspect ratios and sampled them a second time to the same size.

The left is the baseline image with distortion caused by irregular aspect ratio, and the right is the improved image after AI model processing

Customized border complement

Aspect and width ratio related secondary sampling method requires the image with similar aspect ratio to fill the border, but the commonly used zero-frame complement method will produce artifact (artifact).

What's worse is that when the depth of the network continues to increase, the artifact will spread to other areas. The past approach was to use the method of multiplexing borders to remove these artifacts. A latest study by

shows that the reflective borders in convolution layer can further improve the quality of the model by minimizing artifact propagation, but correspondingly, the delay cost will also increase. The case of artifacts and examples of how to remove artifacts are as follows.

Tracking

Time inconsistent will cause predictive differences between frames and frames when AI processes graphics, resulting in flicker (flicker). Its appearance will greatly damage the user's experience.

To improve time consistency, the researchers designed a detection process called "Mask Detection". It gets three channels from the current frame image (YUV), and there is also a fourth channel.

For the first frame image, the fourth channel is just an empty matrix, while for the subsequent frame number, the fourth channel is a prediction of the previous frame.

researchers found that this strategy using fourth channel tracking significantly improved time consistency. At the same time, they also adopted some ideas from the most advanced tracking models, such as modeling strategies such as CRVOS and transform invariance CNN, to obtain a more stable segmentation model in time.

"Mask Detection" method flow chart

Boundary Cross-entropy

Constructing smooth and clear boundaries is crucial for the application of AR image segmentation. In addition to the standard cross-entropy loss that occurs when segmenting images, researchers must also consider boundary weighted losses.

Researchers found that the interior of the object is easier to be segmented, so the authors of the Unet model and most of its variants later recommended using ternary graph weighting losses to improve the quality of the model.

However, there is a limitation on the weighting loss of the ternary graph, which is that the ternary graph will only calculate the boundary region based on the standard real value, so it is insensitive to all misjudgments and is an asymmetric weighting loss.

Inspired by "boundary cross-combination ratio", the researchers used the cross-combination method to extract boundary regions for standard real values and various predictions, and established cross-entropy loss in these regions. The model trained on boundary cross-entropy is obviously better than the benchmark.

In addition to making the boundary area in the final mask output clearer, the new model has a lower false positive rate after applying the new method. The new AI model for the application of

Meta virtual background processor has higher new functions, more stable and more diverse. These optimizations will improve the quality and coherence of background filters, thereby improving the application effect of in product .

For example, an optimized segmentation model can be used to identify the whole body of multi-person scenes and characters, or to identify full-body portraits that are blocked by sofas, desks or dining tables.

In addition to using it in video calls, this technology can also add new dimensions to AR and VR technologies through the virtual environment and the combination of people and objects in the real world. This application is especially important when building the metaverse and creating an immersive experience.

Reference: https://ai.facebook.com/blog/creating-better-virtual-backdrops-for-video-calling-remote-presence-and-ar/

hotcomm