Monocular-to-3D Virtual Try-On using Deep Residual U-Net

Hasib Zunair¹

¹Concordia University, Montreal, QC, Canada.

COMP 6381, Fall 2021 Digital Geometric Modelling

[Paper]

[Video]

[Slides]

Here are some results of our method on out-of-distribution images. Given the reference person image (left) and target clothing image (middle), our method can reconstruct the 3D try-on mesh (right) with the clothing changed and person identity retained.

TL;DR: Res-M3D-VTON is a pipeline for monocular to 3D virtual try-on (VTON) for fashion clothing which uses residual learning to synthesize correct clothing parts, preserve logo of clothing and reduce artifacts to finally output better textured 3D try-on meshes.

Abstract

3D virtual try-on aims to synthetically fit a target clothing image onto a 3D human shape while preserving realistic details such as pose, identity of the person. Existing methods heavily depend on annotated 3D shapes and garment templates which limits their practical use. While 2D virtual try-on is another alternative, it ignores the 3D body information and cannot fully represent the human body. Recently, M3D-VTON was proposed to generate textured 3D try-on meshes only from 2D images of person and clothing by formulating the 3D try-on problem as 2D try-on and depth estimation. However, we find that the synthesis model in the M3D-VTON pipeline uses a simple U-Net architecture. We hypothesize that this is insufficient to synthesize body parts and model complex relation between front and back parts of clothing only from the 2D clothing image, ultimately leading to unrealistic 3D try-on results. We improve this by implementing residual units in the existing synthesis model. Studying it’s effect demonstrates that it improves 2D try-on outputs, mainly by differentiating between front and back part of clothing, preserving logo of clothing and reducing artifacts. This ultimately results in better textured 3D try-on mesh. Benchmarking our method on the MPV3D dataset shows that it performs better than previous works significantly.

Comparison of 2D and 3D try-on mesh outputs with recent state-of-the-art M3D-VTON.

Method

Overview of the proposed framework (left) with an illustration of a plain unit and it's residual counterpart (right). Left image taken from M3D-VTON paper. This is an overview of the 3D virtual try-on pipeline that we build on (left). We can see that there are many components involved. The major components are monocular prediction, depth refinement and texture fusion.

The monocular prediction module produces warped clothing, person segmentation and double depth maps which give a base 3D shape. The depth refinement module produces the refined depth maps which capture the warped clothing details as well as the high frequency details which the previous module oversmooths. The texture fusion module merges the warped clothing with unchanged person part to output 2D try-on results. After getting the 2D try-on and depth map, we unproject the front-view and back-view depth maps to get 3D point clouds and triangulate them with screened poisson reconstruction. Since the try-on image and depth maps are spatially aligned, the try-on image can be used to color the front side of the mesh. As for the back texture, the image is inpainting using fast marching method where the face area is filled with surrounding hair color and is then mirrored to finally texture the back side of the mesh. This allows us to achieve the monocular-to-3D conversion, producing the reconstructed 3D try-on mesh with the clothing changed and person identity retained.

We improve the texture fusion module as it combines all the previous outputs to get the final 2D try-on results and error in this step will adversely affect the final textured 3D try-on mesh. We do this by implementing residual connections, shown in right.

[GitHub] Demo on custom images

Extensive results for 2D virtual try-on

Visual comparison of 2D try-on outputs with M3D-VTON.

Here we show some examples of the final try-on outputs compared to previous work. In many cases, we see that the baseline model is unable to differentiate between the front and back of clothing. It also tends to change the skin color of the person. The baseline model also fails to preserve the logo of clothing image. This is due to the limited capability of the U-Net architecture employed in the baseline model. In comparison, the proposed method generates realistic try-on results which differentiates front and back part of clothing, preserve logo of clothing. It also reduces artifacts in non-target body parts such as skin.

Paper and Supplementary Material

H. Zunair
Monocular-to-3D Virtual Try-On using Deep Residual U-Net.
COMP 6381 Digitial Geometric Modeling Project Paper, Fall 2021.
(Report)

Acknowledgements

We thank Dr. Tiberiu Popa for useful discussions during the development of this project. We also thank Concordia University and Compute Canada for providing computational resources and support that contributed to these research results. Some of the computing for this project was performed on the Graham cluster. This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.