Wed, June 1700:51Model/APIModel releases Chinese models Robotics & embodied

Qwen-RobotSuite: Three Embodied AI Models for Vision-Language Manipulation, Video Modeling, and Navigation

Decision Brief

What changedQwen-RobotSuite includes RobotManip, RobotWorld, and RobotNav, targeting manipulation, video world modeling, and navigation.

Why it mattersKnow these models to assess their potential in vision-language action tasks and technical innovations.

Who should careTeams building on model APIs

Affected stackQwen

Builder actionEvaluate

Source confidenceMedium · Reliable media or first-hand reporting

Qwen-RobotSuite is a collection of three embodied AI models by the Qwen team. RobotManip is a vision-language-action (VLA) model based on Qwen3.5-4B, focusing on physical manipulation. RobotWorld is a language-conditioned video world model with a 60-layer MMDiT, capturing video scene properties. RobotNav is a navigation model in 2B, 4B, and 8B variants based on Qwen3-VL. The article details their architecture, data pipelines, and benchmarks, demonstrating superior performance and application potential.

Summary basis: official / RSS sourceUnless it says 'full article read', this summary is based only on publicly available content — it never pretends to have read restricted originals.

Sources

MarkTechPost
Fast research-paper and ML tooling summaries, useful for infra and agent updates.
MarkTechPost

Decision Brief

Sources

Related intel