MarkTechPost@AI 08月19日
Qwen Team Introduces Qwen-Image-Edit: The Image Editing Version of Qwen-Image with Advanced Capabilities for Semantic and Appearance Editing
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Alibaba的Qwen团队发布了Qwen-Image-Edit,这是Qwen-Image的升级版,专注于指令驱动的图像编辑。该模型基于200亿参数的Qwen-Image,在语义编辑(如风格迁移、新视角合成)和外观编辑(如精确修改物体)方面表现出色,同时继承了Qwen-Image强大的中英文文本渲染能力。通过结合Qwen Chat,它极大地降低了专业内容创作的门槛,适用于IP设计、错误修正等多种场景。Qwen-Image-Edit采用了创新的MMDiT架构和双编码技术,实现了语义一致性和视觉保真度的平衡,并在多项基准测试中取得了领先地位。

🌟 **强大的语义与外观编辑能力**:Qwen-Image-Edit不仅支持低级别的外观编辑,如添加、删除或修改图像局部细节而不影响周围区域,还能进行高级别的语义编辑,包括IP创作、物体旋转和风格迁移,确保在像素级别改动的同时保持内容的语义一致性。

💬 **精准的双语文本编辑**:该模型在图像文本编辑方面表现卓越,能够精确地添加、删除或修改中英文文本,并能保持原始文本的字体、大小和风格,极大地提升了图像内容的可编辑性和灵活性。

🚀 **领先的基准测试表现**:Qwen-Image-Edit在多项公开的图像编辑基准测试中取得了最先进的成果,包括GEdit-Bench-EN/CN和ImgEdit等,在物体替换、风格变化等任务上均表现优异,显示了其作为基础模型的强大实力。

💡 **创新的架构与训练策略**:模型采用了增强的MMDiT架构,通过双编码技术平衡了语义特征和视觉细节,并引入了MSRoPE来处理不同图像帧。其多任务训练范式和七阶段数据过滤流水线,结合先进的训练技术,确保了模型的高效和鲁棒性。

🌐 **易于部署与广泛应用**:Qwen-Image-Edit可通过Hugging Face Diffusers轻松部署,并提供Alibaba Cloud Model Studio的API接入。其Apache 2.0许可和开源代码,为开发者提供了极大的便利,预示着AI驱动设计领域的广阔前景。

In the domain of multimodal AI, instruction-based image editing models are transforming how users interact with visual content. Just released in August 2025 by Alibaba’s Qwen Team, Qwen-Image-Edit builds on the 20B-parameter Qwen-Image foundation to deliver advanced editing capabilities. This model excels in semantic editing (e.g., style transfer and novel view synthesis) and appearance editing (e.g., precise object modifications), while preserving Qwen-Image’s strength in complex text rendering for both English and Chinese. Integrated with Qwen Chat and available via Hugging Face, it lowers barriers for professional content creation, from IP design to error correction in generated artwork.

Architecture and Key Innovations

Qwen-Image-Edit extends the Multimodal Diffusion Transformer (MMDiT) architecture of Qwen-Image, which comprises a Qwen2.5-VL multimodal large language model (MLLM) for text conditioning, a Variational AutoEncoder (VAE) for image tokenization, and the MMDiT backbone for joint modeling. For editing, it introduces dual encoding: the input image is processed by Qwen2.5-VL for high-level semantic features and the VAE for low-level reconstructive details, concatenated in the MMDiT’s image stream. This enables balanced semantic coherence (e.g., maintaining object identity during pose changes) and visual fidelity (e.g., preserving unmodified regions).

The Multimodal Scalable RoPE (MSRoPE) positional encoding is augmented with a frame dimension to differentiate pre- and post-edit images, supporting tasks like text-image-to-image (TI2I) editing. The VAE, fine-tuned on text-rich data, achieves superior reconstruction with 33.42 PSNR on general images and 36.63 on text-heavy ones, outperforming FLUX-VAE and SD-3.5-VAE. These enhancements allow Qwen-Image-Edit to handle bilingual text edits while retaining original font, size, and style.

Key Features of Qwen-Image-Edit

Training and Data Pipeline

Leveraging Qwen-Image’s curated dataset of billions of image-text pairs across Nature (55%), Design (27%), People (13%), and Synthetic (5%) domains, Qwen-Image-Edit employs a multi-task training paradigm unifying T2I, I2I, and TI2I objectives. A seven-stage filtering pipeline refines data for quality and balance, incorporating synthetic text rendering strategies (Pure, Compositional, Complex) to address long-tail issues in Chinese characters.

Training uses flow matching with a Producer-Consumer framework for scalability, followed by supervised fine-tuning and reinforcement learning (DPO and GRPO) for preference alignment. For editing-specific tasks, it integrates novel view synthesis and depth estimation, using DepthPro as a teacher model. This results in robust performance, such as correcting calligraphy errors through chained edits.

Advanced Editing Capabilities

Qwen-Image-Edit shines in semantic editing, enabling IP creation like generating MBTI-themed emojis from a mascot (e.g., Capybara) while preserving character consistency. It supports 180-degree novel view synthesis, rotating objects or scenes with high fidelity, achieving 15.11 PSNR on GSO—surpassing specialized models like CRM. Style transfer transforms portraits into artistic forms, such as Studio Ghibli, maintaining semantic integrity.

For appearance editing, it adds elements like signboards with realistic reflections or removes fine details like hair strands without altering surroundings. Bilingual text editing is precise: changing “Hope” to “Qwen” on posters or correcting Chinese characters in calligraphy via bounding boxes. Chained editing allows iterative corrections, e.g., fixing “稽” step-by-step until accurate.

Benchmark Results and Evaluations

Qwen-Image-Edit leads editing benchmarks, scoring 7.56 overall on GEdit-Bench-EN and 7.52 on CN, outperforming GPT Image 1 (7.53 EN, 7.30 CN) and FLUX.1 Kontext [Pro] (6.56 EN, 1.23 CN). On ImgEdit, it achieves 4.27 overall, excelling in tasks like object replacement (4.66) and style changes (4.81). Depth estimation yields 0.078 AbsRel on KITTI, competitive with DepthAnything v2.

Human evaluations on AI Arena position its base model third among APIs, with strong text rendering advantages. These metrics highlight its superiority in instruction-following and multilingual fidelity.

Deployment and Practical Usage

Qwen-Image-Edit is deployable via Hugging Face Diffusers:

from diffusers import QwenImageEditPipelineimport torchfrom PIL import Imagepipeline = QwenImageEditPipeline.from_pretrained("Qwen/Qwen-Image-Edit")pipeline.to(torch.bfloat16).to("cuda")image = Image.open("input.png").convert("RGB")prompt = "Change the rabbit's color to purple, with a flash light background."output = pipeline(image=image, prompt=prompt, num_inference_steps=50, true_cfg_scale=4.0).imagesoutput.save("output.png")

Alibaba Cloud’s Model Studio offers API access for scalable inference. Licensed under Apache 2.0, the GitHub repository provides training code.

Future Implications

Qwen-Image-Edit advances vision-language interfaces, enabling seamless content manipulation for creators. Its unified approach to understanding and generation suggests potential extensions to video and 3D, fostering innovative applications in AI-driven design.


Check out the Technical Details, Models on Hugging Face and Try the Chat here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Qwen Team Introduces Qwen-Image-Edit: The Image Editing Version of Qwen-Image with Advanced Capabilities for Semantic and Appearance Editing appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Qwen-Image-Edit AI图像编辑 多模态AI 文本渲染 Alibaba
相关文章