Selected from GitHub, author: Ting-ChunWang et al., compiled by Heart of Machines, participated in: Liu Xiaokun, Wang Shuting. At a time when the research field of image-to-image synthesis is booming, Nvidia has made a big move and jointly developed a direct video-to-video conver

2025/05/3114:14:35 hotcomm 1837

is selected from GitHub, author: Ting-Chun Wang, etc., compiled by Heart of Machines, participated in: Liu Xiaokun, Wang Shuting.

At the moment when the research field of image-to-image synthesis is in full swing, Nvidia has launched a big move and jointly developed a direct video-to-video conversion system with MIT CSAIL. This system can not only synthesize real street scene videos using semantic segmentation mask videos, with a resolution of 2K; it can synthesize real character videos with sketch videos; it can also synthesize real dance videos with posture pictures. What’s even more amazing is that under the semantic segmentation mask input, changing the mask color, the system can directly turn the trees in the street scene into buildings! The project is currently open source.

project display: https://tcwang0509.github.io/vid2vid/
project address: https://github.com/NVIDIA/vid2vid

Introduction

Simulation and reproduction of the dynamic visual world is essential for building an agent. In addition to purely scientific interests, learning to synthesize continuous visual experiences are widely used in the fields of computer vision, robotics and computer graphics. For example, for model-based reinforcement learning, video synthesis models approaching dynamic vision can train agents with less real experience data. Using learned video synthesis models, one can generate realistic videos without explicitly specifying scene geometry, materials, light transmission, and their transformations, which can be a lot of trouble when using standard graphics rendering techniques, but is necessary. There are many forms of

video synthesis, including future video prediction and unconditional video synthesis. In this article, the authors study a new form: video-to-video synthesis. At the core, the goal is to learn a mapping function that converts input video into output video. Judging from the current known research, although image-to-image synthesis research is in full swing, the general solution for video synthesis has not been explored. The author said that the method proposed in this paper was inspired by the previous special video synthesis method.

The author calls the video-to-video synthesis problem a distribution matching problem, with the goal of training a model such that a conditional distribution is constructed after a given input video to approximate the synthesis of real videos similar to the input video. To this end, they used the generative adversarial learning framework to complete this modeling process.

Given pairs of input and output videos, the author learns to map the input videos to the output domain. Through a well-designed network of generator and discriminator and a new learning objective function, this method can learn to synthesize high-resolution, time-coherent photo-level videos. In addition, the authors extended the method to multimodal video synthesis. Under the same input conditions, the model can make videos of different appearances.

authors conducted extensive experimental verification on several datasets, with the task of converting some column segmentation masks into photo-level videos. Both quantitative and qualitative results show that the lenses synthesized by this method look more realistic than the strong baseline. They further demonstrated that the method can generate realistic 2K resolution videos up to 30 seconds. It also allows users to have flexible and advanced control over the results of video generation. For example, a user can easily replace a building with a tree in a street view video. In addition, the authors extend the method to future predictions, and the results show that its method is superior to existing systems. The code, model and other results used by the author can be found on his website.

Figure 1: Cityscapes results. Among them, the upper left is the input image, the upper right is the image generated by pix2pixHD, the lower left is the image generated by COVST, and the lower right is the image generated by the method proposed in this paper.

Paper: Video-to-Video Synthesish

Paper Address: https://tcwang0509.github.io/vid2vid/paper_vid2vid.pdf

Abstract : We studied the synthesis problem of video to video, with the goal of learning the mapping function from the input source video (for example, a series of semantic segmentation masks) to the output photo-level video, and the output video accurately depicts the content of the source video.The corresponding image problem, namely the image-to-image synthesis problem, is a popular research topic at present, while the video-to-video synthesis problem rarely appears in the literature. Without understanding the time dynamics, applying existing image synthesis methods to input videos will usually lead to low visual effects and incoherent time videos. In this paper, we propose a novel video-to-video synthesis method under the generation adversarial network framework. Through careful design of the generator and discriminator architecture, combined with the space-time confrontation objective function, we generate high-resolution, time-coherent photo-level videos in a variety of input video formats, including segmentation masks, sketches, and pose maps. Experimental results on multiple benchmarks show that our approach is more superior compared to a strong baseline. In particular, our model can synthesize 2K resolution street view videos up to 30 seconds, with significant advantages over the current best video synthesis method. Finally, we apply this method to future video predictions, and the results go beyond multiple current best systems.

Experimental

Table 1: Comparison results of video to video synthesis method on the Cityscape Street View dataset.

Table 2: Research on control variables. The authors compare the performance of the proposed method with the 3 variants. The initial vs no background-foreground prior; the initial vs unconditional video discriminator; the initial vs no stream distortion.

Table 3: Comparative results of future video prediction methods on the Cityscapes dataset.

Figure 2: Results of Apolloscape. Left picture: pix2pixHD. Middle: COVST. Picture on the right: The method proposed by the author. The input semantic segmentation mask video is displayed in the lower left corner.