Chain-of-Zoom
Paper

Chain-of-Zoom

2025.06.08
ยทWebยทby Anonymous
#Super-Resolution#AI#VLM#RLHF#Autoregression

Key Points

  • 1Chain-of-Zoom (CoZ) is a novel, model-agnostic framework that enables extreme super-resolution (e.g., 256x) by autoregressively chaining a standard SR backbone, effectively decomposing the problem into tractable sub-problems.
  • 2To overcome diminishing visual cues at high magnifications, CoZ augments each zoom step with multi-scale-aware text prompts generated by a Vision-Language Model (VLM).
  • 3The prompt-extraction VLM is further fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM and specific penalties to align the generated text guidance with human preferences.

Chain-of-Zoom (CoZ) is a novel, model-agnostic framework designed to overcome the scalability limitations of single-image super-resolution (SISR) models, enabling them to achieve extreme magnifications (e.g., beyond 256x) far exceeding their original training scale factors. Conventional SISR models, when pushed beyond their trained magnification (e.g., 4x), tend to produce blurry images and artifacts due to the diminishing visual cues at higher resolutions.

CoZ addresses this by factorizing the extreme super-resolution task into an autoregressive chain of intermediate scale-states. Instead of performing a single large upscale, CoZ repeatedly re-uses a pre-trained backbone SR model in an iterative prompt-and-upscale cycle. This decomposes the intractable conditional probability P(HRโˆฃLRextreme)P(\text{HR}|\text{LR}_{\text{extreme}}) into a series of tractable sub-problems:
P(HRNโˆฃLR0)=P(HRNโˆฃHRNโˆ’1)ร—P(HRNโˆ’1โˆฃHRNโˆ’2)ร—โ‹ฏร—P(HR1โˆฃLR0)P(\text{HR}_{N}|\text{LR}_0) = P(\text{HR}_{N}|\text{HR}_{N-1}) \times P(\text{HR}_{N-1}|\text{HR}_{N-2}) \times \dots \times P(\text{HR}_1|\text{LR}_0)
Each step in this chain involves taking the current high-resolution output as the low-resolution input for the next step, gradually building up to the desired extreme magnification.

A critical component of CoZ is the integration of multi-scale-aware text prompts. As the magnification increases and visual information becomes sparser, these semantic prompts, generated by a Vision-Language Model (VLM), provide crucial contextual guidance to the SR backbone. For each iterative upscale step, a VLM analyzes the current image (which acts as the low-resolution input for that step) and generates a descriptive text prompt. This prompt, along with the image, is then fed to the SR backbone, guiding it to produce a more semantically consistent and visually sharp higher-resolution output.

To ensure the generated text prompts are concise, relevant, and accurately align with human preference, the prompt-extraction VLM itself undergoes a fine-tuning process using a novel Reinforcement Learning from Human Feedback (RLHF) pipeline, leveraging Generalized Reward Policy Optimization (GRPO). This RLHF setup employs a critic VLM to provide a reward signal, scoring the semantic quality and accuracy of the generated prompts. Additionally, specific penalty terms are introduced during GRPO training:

  1. Phrase-exclusion reward: This encourages the VLM to avoid generating predefined undesirable or inaccurate phrases.
  2. Repetition penalty: This penalizes redundant or repetitive words and phrases, promoting conciseness.
Through this GRPO-based fine-tuning, the prompt-extraction VLM learns to produce high-quality, actionable text guidance that prevents hallucinations and preserves semantic fidelity even at very high magnifications.

Experimental results demonstrate that a standard 4x diffusion-based SR model, when integrated into the CoZ framework, can effectively achieve magnifications beyond 256x, yielding images with superior perceptual quality and fidelity compared to direct one-step SR or nearest-neighbor interpolation. The effectiveness of the GRPO fine-tuning is validated by the convergence of the phrase-exclusion reward and repetition penalty to their desired values, along with a gradual increase in the critic reward, indicating improved prompt quality. Human preference studies (Mean-Opinion-Score tests) further confirm that GRPO-aligned VLM prompts lead to more human-preferred image and text generations.