Implementation detailsBoth MDM and CDM adopt a linear beta schedule with 1000 time steps for the parameters of the Diffusion method. In MDM, the target prediction is x0, whereas in CDM, it predicts the target noise. For sampling, both MDM and CDM optimize the sampling speed using the DDIM method. Specifically, MDM employs 2 sampling steps, while CDM uses 100 sampling steps. The value of eta is set to 0 for both MDM and CDM. DatasetWe use the image rectangling dataset (DIR-D) [1], whose samples are triplets consisting of three parts: stitched images (𝑥), masks (M𝑥), and rectangling labels (𝑥R), where masks indicate the white border area within stitched images. According to their synthesizing strategy, the stitched images own diverse scenes, irregular boundaries, and accurate labels. More specifically, training pairs are produced in the following two ways:
For DIR-D, a total of 5,839 training images and 519 testing images are prepared, with the size of 384×512. We train the diffusion model based on it. [1] Deep rectangling for image stitching: a learning baseline, CVPR2022 [2] Unsupervised deep image stitching: Reconstructing stitched features to images, TIP [3] Parallax-tolerant image stitching based on robust elastic warping, TMM [4] Rectangling panoramic images via warping, TOG [5] Microsoft {COCO}: Common objects in context, ECCV2014 Network Architecture for MDMIn Table 1, we present the network architecture for MDM. Within the Encoder, each ResnetBlock combines the noised feature "x" (rectangling motion fields that are up/down-sampled) with the conditions "c" (stitching images and stitching mask from the dataset). The concatenated input is then passed through a spatial attention block and a down-sample layer to reduce resolution by half and double the channel number. In the Intermediate layer, the channels and resolutions remain unchanged, and bottleneck features are extracted. For the Decoder, a similar structure to the Encoder is employed. Features and conditions are initially processed through two ResNet Blocks, followed by a spatial attention block. Finally, the feature resolution is doubled while the channel count is halved. ![]() CDM Network Architecture: Table 2The CDM network takes an input with 7 channels and produces an output with 3 channels. The input includes the tensors x and c, where x represents noised rectangling images with 3 channels, and c comprises stitching images and corresponding masks from the dataset, consisting of 4 channels (3 RGB channels and 1 mask channel). While the network is similar to MDM, it differs in the number of blocks in the Encoder, Intermediate, and Decoder stages. ![]() |