Masking Cascaded Self-Attentions for Few-Shot Font-Generation Transformer

Jing Ma, Xiang Xiang, Yan He; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2734-2750

Abstract


Few-shot Font Generation (FFG) is a practical technology widely used in designing artistic characters, handwriting imitation, and identification, etc., which aims to generate realistic font images with a few reference samples. Typically, convolutional neural networks (CNNs) are employed to learn the representations of style and content from the font images. However, owing to the locality of convolutional operations, CNNs are not good at capturing the global structure of fonts, which leads to the generated images with blurry components and distorted space layouts. To address this problem, we consider cascading self-attention modules to exploit long-range dependencies for font generation and propose a transformer-based approach called FGTr. Following the style-content disentanglement paradigm, FGTr contains two different transformer encoders to extract the style and content sequences. A multi-layer transformer decoder is adopted to merge the two sequences and generate target images. In order to smoothen the transition of patch edges, we utilize a Local Self-Attention Mask (LSAM) to restrict the attention scope of each patch to a fixed-size sliding window, which plugs into the Transformer with no extra parameters. We also propose an Auxiliary Generation Module (AGM) in favor of generating glyphs closer to the real. Extensive experiments demonstrate the effectiveness and superiority of our method compared with state-of-the-art CNN-based models.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ma_2024_ACCV, author = {Ma, Jing and Xiang, Xiang and He, Yan}, title = {Masking Cascaded Self-Attentions for Few-Shot Font-Generation Transformer}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2734-2750} }