How to Use Qwen-Image-2512

This guide explains what makes Qwen-Image-2512 stand out, what it’s best used for, and the exact settings/prompts worth copying for consistent results.

Qwen-Image-2512: quick facts (open-source + practical)

Type: text-to-image (Qwen-Image family)
Architecture / size: Qwen-Image is described in ComfyUI docs as a 20B-parameter MMDiT (Multimodal Diffusion Transformer) foundation model; Qwen-Image-2512 is the newer December checkpoint in the same model family.
License: Apache 2.0 (Qwen-Image repository)
What “2512” means: December checkpoint/update focused on realism + detail + text rendering
Best-known strengths:
- Human realism (less “AI face”)
- Natural texture detail (fur, foliage, water/mist)
- Text rendering + layout (posters, slides, infographics)

Official links:

Model card: https://huggingface.co/Qwen/Qwen-Image-2512
Online demo (Space): https://huggingface.co/spaces/Qwen/Qwen-Image-2512
GitHub (Qwen-Image family): https://github.com/QwenLM/Qwen-Image

Artifox does not support Qwen-Image-2512 yet.

This guide sends you to the official online demo today. If you want a clean, template-first workflow for deliverable assets (covers/posters/product visuals) without setting up environments, you can still use Image Creation. When we add Qwen-Image-2512 later, we’ll update this guide.

Want a clean workflow while you test Qwen-Image-2512?

Use Hugging Face for quick testing, and use templates + disciplined iteration in Image Creation for repeatable assets.

Open Image Creation

What’s new in Qwen-Image-2512 (vs the August base release)

Based on the official model card and Qwen-Image repo notes, the update is mainly about:

Enhanced human realism: richer facial detail, more natural hair strands, and better “everyday photo” context.
Finer natural detail: better texture fidelity for landscapes and animals.
Improved text rendering: stronger layout and more faithful text+image composition.

If you’re a designer or marketer, this is the key point: Qwen-Image-2512 is one of the few open models that can make “text inside images” feel like part of the composition—not just a messy overlay.

Best ways to use it (choose your path)

Option A: Try online first (fastest)

Use the official Space:

https://huggingface.co/spaces/Qwen/Qwen-Image-2512

When you test, don’t start with a long prompt. Start small, lock the direction, then iterate.

Option B: Run locally with Diffusers (repeatable + controllable)

The Qwen-Image repo recommends:

transformers >= 4.51.3 (for Qwen2.5-VL support)
Latest Diffusers (installed from GitHub)

Install:

pip install git+https://github.com/huggingface/diffusers

About weights and formats:

The official Qwen-Image family is published by Qwen (Hugging Face / GitHub).
In the ComfyUI ecosystem you may also see community distributions (e.g. bf16 / fp8 weights, and distilled variants with fewer steps). Treat those as community artifacts unless clearly marked official.

Minimal generation code (aligned with the official repo example):

from diffusers import QwenImagePipeline
import torch

if torch.cuda.is_available():
    torch_dtype = torch.bfloat16
    device = "cuda"
else:
    torch_dtype = torch.float32
    device = "cpu"

pipe = QwenImagePipeline.from_pretrained("Qwen/Qwen-Image-2512", torch_dtype=torch_dtype).to(device)

aspect_ratios = {
    "1:1": (1328, 1328),
    "16:9": (1664, 928),
    "9:16": (928, 1664),
    "4:3": (1472, 1104),
    "3:4": (1104, 1472),
    "3:2": (1584, 1056),
    "2:3": (1056, 1584),
}

prompt = "Portrait photo, natural skin texture, soft indoor lighting, shallow depth of field"
negative_prompt = "low resolution, low quality, deformed limbs, deformed fingers, oversaturated, waxy skin, blurry face, messy composition, blurry or distorted text"

width, height = aspect_ratios["4:3"]

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device=device).manual_seed(42),
).images[0]

image.save("qwen-image-2512.png")

The parameters that actually matter (start here)

num_inference_steps
- More steps = usually cleaner, slower.
- Start with 40–60.
true_cfg_scale
- Higher = follows prompt more aggressively.
- Too high can feel unnatural.
- Start around 3.5–5.0 (the official example uses 4.0).
width/height
- Use the provided aspect ratio presets first.

Prompt enhancing tools (recommended by the repo)

The Qwen-Image repo explicitly recommends using its prompt enhancing utilities for 2512:

src/examples/tools/prompt_utils_2512.py

If you plan to generate a lot of production assets, it’s worth checking—because the “prompt scaffolding” is part of how you keep outputs stable.

Where Qwen-Image-2512 shines (use cases)

1) Photorealistic people (less “AI face”)

Use camera language and “everyday photo” constraints.

Prompt (copy/paste)

A candid iPhone photo portrait, natural skin texture with subtle imperfections, soft indoor ambient lighting, shallow depth of field, unposed, realistic colors, clean background

Negative prompt

waxy skin, plastic face, overly smooth, airbrushed, extra fingers, deformed hands

2) Nature texture (water/mist/foliage)

Prompt

A turquoise river winding through a lush canyon, detailed moss and dense ferns, multiple waterfalls with fine mist, midday sunlight filtering through canopy, no humans, no text, photorealistic

3) Fur / material details (pets + product textures)

Prompt

Ultra-realistic close-up photo of a golden retriever outdoors, fur strands clearly separated, soft daylight, sharp moist eyes, gentle bokeh background

4) Text rendering: posters, slides, infographics

This is where Qwen-Image-2512 is unusually strong for an open model.

Poster (short headline)

A minimal poster design about (topic). Clean composition. Leave a large empty area for the headline. High contrast. Add a short headline in clear sans-serif font: "(3–6 words)"

Slide / roadmap timeline

A modern tech slide, dark blue gradient background. Title at top center: "Qwen-Image-2512". A glowing horizontal timeline with 3 nodes. Each node connects to a rounded rectangle label with clear white text. Clean spacing, consistent alignment, high readability.

Official English example prompts (copy/paste)

The following prompts are adapted from the official Qwen-Image-2512 model card showcase. They’re long on purpose: great for stress-testing realism, instruction following, and environment detail.

Human realism: close-up selfie in a dorm

A Chinese female college student, around 20 years old, with a very short haircut that conveys a gentle, artistic vibe. Her hair naturally falls to partially cover her cheeks, projecting a tomboyish yet charming demeanor. She has cool-toned fair skin and delicate features, with a slightly shy yet subtly confident expression—her mouth crooked in a playful, youthful smirk. She wears an off-shoulder top, revealing one shoulder, with a well-proportioned figure. The image is framed as a close-up selfie: she dominates the foreground, while the background clearly shows her dormitory—a neatly made bed with white linens on the top bunk, a tidy study desk with organized stationery, and wooden cabinets and drawers. The photo is captured on a smartphone under soft, even ambient lighting, with natural tones, high clarity, and a bright, lively atmosphere full of youthful, everyday energy.

Human realism: “iPhone snapshot” at an anime convention

A 20-year-old East Asian girl with delicate, charming features and large, bright brown eyes—expressive and lively, with a cheerful or subtly smiling expression. Her naturally wavy long hair is either loose or tied in twin ponytails. She has fair skin and light makeup accentuating her youthful freshness. She wears a modern, cute dress or relaxed outfit in bright, soft colors—lightweight fabric, minimalist cut. She stands indoors at an anime convention, surrounded by banners, posters, or stalls. Lighting is typical indoor illumination—no staged lighting—and the image resembles a casual iPhone snapshot: unpretentious composition, yet brimming with vivid, fresh, youthful charm.

Instruction following: teenager leaning slightly forward

An East Asian teenage boy, aged 15–18, with soft, fluffy black short hair and refined facial contours. His large, warm brown eyes sparkle with energy. His fair skin and sunny, open smile convey an approachable, friendly demeanor—no makeup or blemishes. He wears a blue-and-white summer uniform shirt, slightly unbuttoned, made of thin breathable fabric, with black headphones hanging around his neck. His hands are in his pockets, body leaning slightly forward in a relaxed pose, as if engaged in conversation. Behind him lies a summer school playground: lush green grass and a red rubber track in the foreground, blurred school buildings in the distance, a clear blue sky with fluffy white clouds. The bright, airy lighting evokes a joyful, carefree adolescent atmosphere.

Age realism: elderly couple in a tidy home kitchen

An elderly Chinese couple in their 70s in a clean, organized home kitchen. The woman has a kind face and a warm smile, wearing a patterned apron; the man stands behind her, also smiling, as they both gaze at a steaming pot of buns on the stove. The kitchen is bright and tidy, exuding warmth and harmony. The scene is captured with a wide-angle lens to fully show the subjects and their surroundings.

Nature texture: canyon river + moss + waterfalls

A turquoise river winds through a lush canyon. Thick moss and dense ferns blanket the rocky walls; multiple waterfalls cascade from above, enveloped in mist. At noon, sunlight filters through the dense canopy, dappling the river surface with shimmering light. The atmosphere is humid and fresh, pulsing with primal jungle vitality. No humans, text, or artificial traces present.

Fur detail: golden retriever close-up

An ultra-realistic close-up of a golden retriever outdoors under soft daylight. Hair is exquisitely detailed: strands distinct, color transitioning naturally from warm gold to light cream, light glinting delicately at the tips; a gentle breeze adds subtle volume. Undercoat is soft and dense; guard hairs are long and well-defined, with visible layering. Eyes are moist, expressive; nose is slightly damp with fine specular highlights. Background is softly blurred to emphasize the dog’s tangible texture and vivid expression.

Rugged wildlife texture: male argali sheep

A male argali stands atop a barren, rocky mountainside. Its coarse, dense grey-brown coat covers a powerful, muscular body. Most striking are its massive, thick, outward-spiraling horns—a symbol of wild strength. Its gaze is alert and sharp. The background reveals steep alpine terrain: jagged peaks, sparse low vegetation, and abundant sunlight—conveying the harsh yet majestic wilderness and the animal’s resilient vitality.

For readable text:

Layout first, copy second
Short copy (headlines) works much better than long paragraphs
Keep the background clean and high-contrast

Limitations (so you don’t waste runs)

Long paragraphs inside an image are still hard for most models.
Tiny fonts + dense tables will distort.
Hands / complex poses can still fail—start with close-up or half-body.

FAQ

Is Qwen-Image-2512 open-source? What’s the license?

It’s released publicly on Hugging Face, and the Qwen-Image family repository states the license is Apache 2.0.

Where can I try it online?

Official Space:

https://huggingface.co/spaces/Qwen/Qwen-Image-2512

What’s the “best first run” setup?

Start with 4:3 or 1:1
num_inference_steps = 50
true_cfg_scale = 4.0
Use a short prompt + a simple negative prompt

Can I use it inside Artifox?

Not yet. Use the official online demo for now. We’ll update the guide when support lands.

Turn tests into a repeatable workflow

Templates + disciplined iteration is the fastest way to get consistent, deliverable results.

Open /studio/image

Key Takeaways

Qwen-Image-2512 is an open text-to-image model update focused on realism, texture detail, and text rendering
Online demo: https://huggingface.co/spaces/Qwen/Qwen-Image-2512
Local Diffusers: QwenImagePipeline + bfloat16 (CUDA) + num_inference_steps ~ 50 + true_cfg_scale ~ 4.0
Best results: camera language, layout-first prompts for text, and short copy