Document Image Rectification using Stable Diffusion Transformer

Kumari, Pooja; Das, Sukhendu

Pooja Kumari, Sukhendu Das; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 3396-3405

Abstract

Document images captured using handheld devices often suffer from geometric distortions caused by perspective variations, warping, and lens-induced aberrations. These distortions negatively impact text readability, OCR accuracy, and automated document analysis, making effective rectification essential. Traditional approaches, such as 3D reconstruction-based flattening and convolutional neural network (CNN)-based warping prediction, have shown promising results but struggle with handling complex, non-uniform distortions and long-range dependencies in document structures. In this paper, we propose a novel Conditional Stable Diffusion Transformer based framework designed specifically for document image rectification. Unlike conventional UNet-based diffusion models, which rely on hierarchical convolutional operations, our transformer-based architecture provides a global receptive field through self-attention mechanisms, enabling precise structural preservation and text alignment. Furthermore, we incorporate cross-attention conditioning, allowing the model to integrate auxiliary information for improved rectification accuracy. To enhance efficiency and robustness, we introduce a coarse rectification using control points and thin plate spline that estimates an initial globally aligned structure before the diffusion-based refinement process. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art rectification accuracy while maintaining comparable inference time to existing deep learning-based solutions. Our proposed framework establishes a new paradigm for document image rectification by leveraging transformer-based modeling, generative diffusion processes, and conditional guidance, making it highly effective across a wide range of document distortions.

Related Material

[pdf]

[bibtex]

@InProceedings{Kumari_2025_CVPR, author = {Kumari, Pooja and Das, Sukhendu}, title = {Document Image Rectification using Stable Diffusion Transformer}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3396-3405} }