Generating images with accurately represented text, particularly in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet or LoRa), have made strides towards addressing this issue. Yet, diffusion models often fall short in tasks requiring controlled text generation, like specifying particular fonts or producing text in small fonts. In our study, we introduce a novel approach for multilingual visual text synthesis, named JoyType, designed to maintain the font style of text throughout the image generation process. Our methodology began with assembling a training dataset comprising 1M pairs of data, each pair including an image, its description, and a canny map of the text within the image. We then developed a text glyph control network, Font ControlNet, tasked with extracting glyph information to steer the image generation process. To further improve our model's capability to preserve text glyphs, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows our model to direct the generation process using low-level descriptors, resulting in text that is both clearer and more legible. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art (SOTA) methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion base models on Civitai. This versatility has been successfully demonstrated across three application areas: creative cards, health cards, and marketing images, all yielding satisfactory outcomes.
Figure 1 introduces the whole framework of our method, including data collection, training pipeline, and inference pipeline. In the data collection phase, we leveraged the open-source CapOnImage2M dataset, selecting a subset of 1M images. For each selected image, we employed a Visual Language Model (e.g., CogVLM) to generate textual descriptions, thereby obtaining prompts associated with the images. We applied the canny algorithm to extract edges from text regions within the images, creating a canny map. The training pipeline comprises three primary components: the latent diffusion module, the Font ControlNet module, and the loss design module. More precisely, during training, the raw image, canny map, and prompt are fed into the Variational Autoencoder (VAE), Font ControlNet, and text encoder, respectively. The loss function is bifurcated into two segments: the latent space and the pixel space. Within the latent space, we utilize the loss function $L_{LDM}$ associated with Latent Diffusion Models as outlined in the source paper. The latent features are then decoded back into images via the VAE decoder. Within the pixel space, the text regions of both the predicted and the ground truth images are cropped and processed through an OCR model independently. We extract the convolutional layer features from the OCR model and compute the Mean Squared Error (MSE) loss between the features of each layer, thereby constituting the loss $L_{ocr}$. During the inference phase, the image prompt, textual content, and specified areas for text generation are input into the text encoder and Font ControlNet, respectively. The final image is then generated by the VAE decoder.
Table 1 and Table 2 demonstrate the ability of our JoyType model to maintain character shapes. Specifically, we employed 40 different fonts as hint conditions input into our model, generating creative images. Sentence Accuracy (Acc) and the Normalized Edit Distance (NED) are used as evaluation metrics to identify the recognizability of characters within an image. In this recognition process, we employ an open-source OCR model (DuGuangOCR ModelScope (2023)).
Typographic white Blackground Image refers to images printed with selected fonts on a white background, while Generated Image represents images generated by our JoyType. To ensure fairness in evaluating each font, both types of images use the same font size. The closer the performance metrics on the Generated Image are to those on the Typographic white Blackground Image, the stronger the model's ability to preserve the font character.
In Table 1, we used fonts that are generally more easily recognizable, while in Table 2, we used less recognizable artistic fonts. The experimental results show that our model has good character preservation capabilities on both types of fonts.
Table 3 presents a comparison between the current state-of-the-art (SOTA) methods and our JoyType. In this comparison, our method employs the ArialUnicode font to create hint-conditioned graphs. It is not hard to see that our JoyType outperforms other competitors in terms of ACC, NED, and FID metrics, achieving the best performance. The ACC and NED metrics indicate that the text generated by our JoyType is more recognizable, which is mainly due to JoyType's capability to preserve the shape of font characters. Further, we found that using more detailed prompt descriptions helps to generate higher quality images. We use CogVLM to generate captions for the images in benchmark database, and then use these new descriptions as prompts for input into the JoyType model, referred to as JoyType_w_cogvlm, which noticeably improves the FID metric, from 34 to 26.75.