GitHub - ldj7672/SDXL-LoRA-Fine-tuning-for-Ghibli-Style: 이 프로젝트는 Stable Diffusion XL (SDXL) 모델을 LoRA로 fine-tuning하여 지브리 스타일의 이미지를 생성하는 실험을 진행합니다.
Service

GitHub - ldj7672/SDXL-LoRA-Fine-tuning-for-Ghibli-Style: 이 프로젝트는 Stable Diffusion XL (SDXL) 모델을 LoRA로 fine-tuning하여 지브리 스타일의 이미지를 생성하는 실험을 진행합니다.

ldj7672
2025.06.08
·GitHub·by Anonymous
#SDXL#LoRA#Fine-tuning#Ghibli Style#Image Generation

Key Points

  • 1This project fine-tunes Stable Diffusion XL (SDXL) with LoRA to generate Ghibli-style images, sharing experimental results conducted with limited resources and a small dataset.
  • 2Despite using only 100 images, the fine-tuned model effectively captured Ghibli aesthetics in color palette and atmosphere, although character forms sometimes suffered due to LoRA's nature.
  • 3The study highlights limitations such as insufficient training from limited computing resources and restricted generalization performance due to the small dataset, emphasizing its purpose for educational and research reference.

This paper presents an experimental project focused on fine-tuning the Stable Diffusion XL (SDXL) model to generate images in the distinct Studio Ghibli artistic style using Low-Rank Adaptation (LoRA). The primary objective is to demonstrate the efficacy of LoRA in style transfer on a large pre-trained generative model, even with limited computational resources and a small dataset.

The core methodology revolves around LoRA, an efficient fine-tuning technique that adapts pre-trained models by injecting trainable low-rank matrices into the attention layers of the large language model or diffusion model. Instead of fine-tuning all model parameters, LoRA freezes the original pre-trained weights and optimizes only these new, much smaller, low-rank matrices. For a pre-trained weight matrix W0Rd×kW_0 \in \mathbb{R}^{d \times k}, LoRA introduces two low-rank matrices, ARd×rA \in \mathbb{R}^{d \times r} and BRr×kB \in \mathbb{R}^{r \times k}, where rmin(d,k)r \ll \min(d, k) is the LoRA rank. The update to the weight matrix is then represented as ΔW=BA\Delta W = BA, effectively reducing the number of trainable parameters significantly. During inference, the original and updated weights are combined as W0+BAW_0 + BA. The scaling factor, LoRA alpha (α\alpha), is typically used to scale the LoRA output, often as αrBA\frac{\alpha}{r} BA, where α\alpha normalizes the influence of the low-rank updates. This approach makes fine-tuning computationally less expensive and reduces the storage requirements for trained models.

The experimental setup utilized Stable Diffusion XL Base 1.0 as the foundational model. The dataset comprised 100 Ghibli-style images, which were gathered from the web. For each image, corresponding text captions were automatically generated using Google's Gemini model via the utils/creat_text_caption.py script, a crucial step for training text-to-image diffusion models. The fine-tuning process was conducted using the LoRA method with a fixed learning rate of 1×1041 \times 10^{-4}, a resolution of 512x512 pixels, a batch size of 1, and for 100 training epochs. Key experimental variables included the LoRA rank (rr) and LoRA alpha (α\alpha), which were tested across multiple combinations: ranks of 4, 8, and 16, and alpha values of 4, 8, 16, and 32.

The project structure includes dedicated Python scripts for training (train_sdxl_lora.py), inference and comparison (inference_sdxl_ghibli.py), and various utility scripts, including those for integration with Hugging Face (hf/hf_inference.py, hf/upload_to_hf.py).

Experimental results indicated that LoRA fine-tuning, despite the constraints of a small dataset, successfully imparted characteristic Ghibli stylistic elements. A visual comparison between images generated by the base model and the LoRA fine-tuned model demonstrated that the latter effectively captured the distinctive color palettes, line flow, and overall atmospheric qualities reminiscent of Ghibli animation. However, inherent limitations of the LoRA method were observed, occasionally resulting in distorted forms of characters or objects. The project also explored the impact of different LoRA rank and alpha combinations on the generated output, although specific quantitative or qualitative analyses for each combination are not detailed in the summary.

The paper acknowledges several limitations: the training might be insufficient due to restricted computing resources, and the small dataset size (100 images) inherently limits the model's generalization capabilities. The results are presented solely for educational and research reference. The fine-tuned LoRA model is made available on Hugging Face.