Implementation details

Both MDM and CDM adopt a linear beta schedule with 1000 time steps for the parameters of the Diffusion method. In MDM, the target prediction is x0, whereas in CDM, it predicts the target noise.

For sampling, both MDM and CDM optimize the sampling speed using the DDIM method. Specifically, MDM employs 2 sampling steps, while CDM uses 100 sampling steps. The value of eta is set to 0 for both MDM and CDM.

Dataset

We use the image rectangling dataset (DIR-D) [1], whose samples are triplets consisting of three parts: stitched images (𝑥), masks (M𝑥), and rectangling labels (𝑥R), where masks indicate the white border area within stitched images. According to their synthesizing strategy, the stitched images own diverse scenes, irregular boundaries, and accurate labels. More specifically, training pairs are produced in the following two ways:

  • Synthesize rectangular images and real stitched images. Images from the UDISD dataset [2] are spliced by ELA [3] to produce a large number of real stitched images. Then the corresponding rectangular images can be computed with He et al. algorithm [4]. After careful manual filtering, the first part of the training set is prepared.
  • Real rectangular images and synthesize stitched images. From the first step, various mesh deformations can be collected along with the inverse mesh deformation. Then it can be leveraged to inversely transform real images sampled from MS-COCO dataset [5] to stitched images. Likewise, a strict manual verification process is carried out to form the training part.

For DIR-D, a total of 5,839 training images and 519 testing images are prepared, with the size of 384×512. We train the diffusion model based on it.

[1] Deep rectangling for image stitching: a learning baseline, CVPR2022

[2] Unsupervised deep image stitching: Reconstructing stitched features to images, TIP

[3] Parallax-tolerant image stitching based on robust elastic warping, TMM

[4] Rectangling panoramic images via warping, TOG

[5] Microsoft {COCO}: Common objects in context, ECCV2014

Network Architecture for MDM

In Table 1, we present the network architecture for MDM.

Within the Encoder, each ResnetBlock combines the noised feature "x" (rectangling motion fields that are up/down-sampled) with the conditions "c" (stitching images and stitching mask from the dataset). The concatenated input is then passed through a spatial attention block and a down-sample layer to reduce resolution by half and double the channel number.

In the Intermediate layer, the channels and resolutions remain unchanged, and bottleneck features are extracted.

For the Decoder, a similar structure to the Encoder is employed. Features and conditions are initially processed through two ResNet Blocks, followed by a spatial attention block. Finally, the feature resolution is doubled while the channel count is halved.

CDM Network Architecture: Table 2

The CDM network takes an input with 7 channels and produces an output with 3 channels. The input includes the tensors x and c, where x represents noised rectangling images with 3 channels, and c comprises stitching images and corresponding masks from the dataset, consisting of 4 channels (3 RGB channels and 1 mask channel). While the network is similar to MDM, it differs in the number of blocks in the Encoder, Intermediate, and Decoder stages.