We want to construct novel views from one or more source images. Ie, given these images, can we move the camera?
You have a couple of options:
One way of representing 3D structure is depth. Assuming there is no ground-truth depth available for your images, maybe you can guess depth somehow. Getting good depth estimates from a single image is an entire subfield by itself! ("single image depth estimation")
In recent work like SynSin, we learn this by creating differentiable renderers. The renderer incorporates 3D priors into the pipeline.
An interesting approach would be to see if we can do the same thing without this renderer. If so, we'll need to represent 3D structure in another way, in order to feed it to the second part
This bit is more flexible :)
The main idea of SynSin is that there's a differentiable renderer which allows efficient learning of image-to-point-cloud feature correspondences. Using a point cloud (via a depth network) also allows you to do explicit 3D scenes.
Pretty cool!
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs (2018)
Can generate images of KITTI and CityScapes!
Monocular Neural Image based Rendering with Continuous View Control
In this paper, we propose a novel learning pipeline that de- termines the output pixels directly from the source color but forces the network to implicitly reason about the underlying geometry
To do this, we do two things:
A lot of these works cite [MaximTatarchenko,AlexeyDosovitskiy,andThomasBrox], who did this thing in 2016.
KITTI In total there are 18560 images for training and 4641 images for testing. We construct training pairs by randomly selecting target view among 10 nearest frames of source view. The relative transformation is obtained from the global camera poses
Self-supervised Single-view 3D Reconstruction via Semantic Consistency
Another way to do 3D reconstruction is to predict the depth from input images. This is a pretty common approach.