Qwen-Picture is a strong, open supply new AI picture generator

Need smarter insights in your inbox? Join our weekly newsletters to get solely what issues to enterprise AI, information, and safety leaders. Subscribe Now

After seizing the summer time with a blitz of highly effective, freely out there new open supply language and coding centered AI fashions that matched or in some circumstances bested closed-source/proprietary U.S. rivals, Alibaba’s crack “Qwen Staff” of AI researchers is again once more immediately with the discharge of a extremely ranked new AI picture generator mannequin — additionally open supply.

Qwen-Picture stands out in a crowded discipline of generative picture fashions as a consequence of its emphasis on rendering textual content precisely inside visuals — an space the place many rivals nonetheless wrestle.

Supporting each alphabetic and logographic scripts, the mannequin is especially adept at managing advanced typography, multi-line layouts, paragraph-level semantics, and bilingual content material (e.g., English-Chinese language).

In follow, this enables customers to generate content material like film posters, presentation slides, storefront scenes, handwritten poetry, and stylized infographics — with crisp textual content that aligns with their prompts.

The AI Impression Collection Returns to San Francisco – August 5

The subsequent section of AI is right here – are you prepared? Be a part of leaders from Block, GSK, and SAP for an unique take a look at how autonomous brokers are reshaping enterprise workflows – from real-time decision-making to end-to-end automation.

Safe your spot now – house is proscribed: https://bit.ly/3GuuPLF

Qwen-Picture’s output examples embody all kinds of real-world use circumstances:

Advertising and marketing & Branding: Bilingual posters with model logos, stylistic calligraphy, and constant design motifs
Presentation Design: Structure-aware slide decks with title hierarchies and theme-appropriate visuals
Schooling: Technology of classroom supplies that includes diagrams and exactly rendered tutorial textual content
Retail & E-commerce: Storefront scenes the place product labels, signage, and environmental context should all be readable
Inventive Content material: Handwritten poetry, scene narratives, anime-style illustration with embedded story textual content

Customers can work together with the mannequin on the Qwen Chat web site by choosing “Picture Technology” mode from the buttons beneath the immediate entry discipline.

Nevertheless, my transient preliminary assessments revealed the textual content and immediate adherence was not noticeably higher than Midjourney, the favored proprietary AI picture generator from the U.S. firm of the identical identify. My session via Qwen chat produced a number of errors in immediate comprehension and textual content constancy, a lot to my disappointment, even after repeated makes an attempt and immediate rewording:

But Midjourney solely presents a restricted variety of free generations and requires subscriptions for any extra, in comparison with Qwen Picture, which, due to its open supply licensing and weights posted on Hugging Face, will be adopted by any enterprise or third-party supplier free-of-charge.

Licensing and availability

Qwen-Picture is distributed below the Apache 2.0 license, permitting business and non-commercial use, redistribution, and modification — although attribution and inclusion of the license textual content are required for by-product works.

This will likely make it engaging to enterprises on the lookout for an open supply picture era software to make use of for making inside or external-facing collateral like flyers, advertisements, notices, newsletters, and different digital communications.

However the truth that the mannequin’s coaching information stays a tightly guarded secret — like with most different main AI picture turbines — could bitter some enterprises on the thought of utilizing it.

Qwen, not like Adobe Firefly or OpenAI’s GPT-4o native picture era, for instance, doesn’t supply indemnification for business makes use of of its product (i.e., if a person will get sued for copyright infringement, Adobe and OpenAI will assist help them in courtroom).

The mannequin and related property — together with demo notebooks, analysis instruments, and fine-tuning scripts — can be found via a number of repositories:

As well as, a stay analysis portal known as AI Area permits customers to check picture generations in pairwise rounds, contributing to a public Elo-style leaderboard.

Coaching and growth

Behind Qwen-Picture’s efficiency is an intensive coaching course of grounded in progressive studying, multi-modal job alignment, and aggressive information curation, in accordance with the technical paper the analysis workforce launched immediately.

The coaching corpus contains billions of image-text pairs sourced from 4 domains: pure imagery, human portraits, creative and design content material (similar to posters and UI layouts), and artificial text-focused information. The Qwen Staff didn’t specify the scale of the coaching information corpus, other than “billions of image-text pairs.” They did present a breakdown of the tough proportion of every class of content material it included:

Nature: ~55%
Design (UI, posters, artwork): ~27%
Folks (portraits, human exercise): ~13%
Artificial textual content rendering information: ~5%

Notably, Qwen emphasizes that every one artificial information was generated in-house, and no photos created by different AI fashions had been used. Regardless of the detailed curation and filtering levels described, the documentation doesn’t make clear whether or not any of the information was licensed or drawn from public or proprietary datasets.

Not like many generative fashions that exclude artificial textual content as a consequence of noise dangers, Qwen-Picture makes use of tightly managed artificial rendering pipelines to enhance character protection — particularly for low-frequency characters in Chinese language.

A curriculum-style technique is employed: the mannequin begins with easy captioned photos and non-text content material, then advances to layout-sensitive textual content situations, mixed-language rendering, and dense paragraphs. This gradual publicity is proven to assist the mannequin generalize throughout scripts and formatting sorts.

Qwen-Picture integrates three key modules:

Qwen2.5-VL, the multimodal language mannequin, extracts contextual which means and guides era via system prompts.
VAE Encoder/Decoder, skilled on high-resolution paperwork and real-world layouts, handles detailed visible representations, particularly small or dense textual content.
MMDiT, the diffusion mannequin spine, coordinates joint studying throughout picture and textual content modalities. A novel MSRoPE (Multimodal Scalable Rotary Positional Encoding) system improves spatial alignment between tokens.

Collectively, these parts permit Qwen-Picture to function successfully in duties that contain picture understanding, era, and exact modifying.

Efficiency benchmarks

Qwen-Picture was evaluated in opposition to a number of public benchmarks:

GenEval and DPG for prompt-following and object attribute consistency
OneIG-Bench and TIIF for compositional reasoning and structure constancy
CVTG-2K, ChineseWord, and LongText-Bench for textual content rendering, particularly in multilingual contexts

In almost each case, Qwen-Picture both matches or surpasses current closed-source fashions like GPT Picture 1 [High], Seedream 3.0, and FLUX.1 Kontext [Pro]. Notably, its efficiency on Chinese language textual content rendering was considerably higher than all in contrast techniques.

On the general public AI Area leaderboard — based mostly on 10,000+ human pairwise comparisons — Qwen-Picture ranks third general and is the highest open-source mannequin.

Implications for enterprise technical decision-makers

For enterprise AI groups managing advanced multimodal workflows, Qwen-Picture introduces a number of purposeful benefits that align with the operational wants of various roles.

These managing the lifecycle of vision-language fashions — from coaching to deployment — will discover worth in Qwen-Picture’s constant output high quality and its integration-ready parts. The open-source nature reduces licensing prices, whereas the modular structure (Qwen2.5-VL + VAE + MMDiT) facilitates adaptation to customized datasets or fine-tuning for domain-specific outputs.

The curriculum-style coaching information and clear benchmark outcomes assist groups consider health for function. Whether or not deploying advertising and marketing visuals, doc renderings, or e-commerce product graphics, Qwen-Picture permits speedy experimentation with out proprietary constraints.

Engineers tasked with constructing AI pipelines or deploying fashions throughout distributed techniques will recognize the detailed infrastructure documentation. The mannequin has been skilled utilizing a Producer-Shopper structure, helps scalable multi-resolution processing (256p to 1328p), and is constructed to run with Megatron-LM and tensor parallelism. This makes Qwen-Picture a candidate for deployment in hybrid cloud environments the place reliability and throughput matter.

Furthermore, help for image-to-image modifying workflows (TI2I) and task-specific prompts allows its use in real-time or interactive purposes.

Professionals centered on information ingestion, validation, and transformation can use Qwen-Picture as a software to generate artificial datasets for coaching or augmenting pc imaginative and prescient fashions. Its skill to generate high-resolution photos with embedded, multilingual annotations can enhance efficiency in downstream OCR, object detection, or structure parsing duties.

Since Qwen-Picture was additionally skilled to keep away from artifacts like QR codes, distorted textual content, and watermarks, it presents higher-quality artificial enter than many public fashions — serving to enterprise groups protect coaching set integrity.

In search of suggestions and alternatives to collaborate

The Qwen Staff emphasizes openness and neighborhood collaboration within the mannequin’s launch.

Builders are inspired to check and fine-tune Qwen-Picture, supply pull requests, and take part within the analysis leaderboard. Suggestions on textual content rendering, modifying constancy, and multilingual use circumstances will form future iterations.

With a acknowledged purpose to “decrease the technical boundaries to visible content material creation,” the workforce hopes Qwen-Picture will serve not simply as a mannequin, however as a basis for additional analysis and sensible deployment throughout industries.

Every day insights on enterprise use circumstances with VB Every day

If you wish to impress your boss, VB Every day has you coated. We provide the inside scoop on what corporations are doing with generative AI, from regulatory shifts to sensible deployments, so you’ll be able to share insights for max ROI.

Learn our Privateness Coverage

Thanks for subscribing. Take a look at extra VB newsletters right here.

An error occured.