arXiv 2026
Mohamad H. Danesh  ·  Chenhao Li  ·  Amin Abyaneh  ·  Anas Houssaini
Kirsty Ellis  ·  Glen Berseth  ·  Marco Hutter  ·  Hsiu-Chin Lin
McGill University
McGill University
Mila
Mila – Quebec AI Institute
ETH Zürich
ETH Zürich
Université de Montréal
Université de Montréal

Abstract

World models promise a paradigm shift in robotics, where an agent learns the underlying physics of its environment once to enable efficient planning and behavior learning. However, current world models are often hardware-locked specialists: a model trained on a Boston Dynamics Spot robot fails catastrophically on a Unitree Go1 due to the mismatch in kinematic and dynamic properties, as the model overfits to specific embodiment constraints rather than capturing the universal locomotion dynamics. Consequently, a slight change in actuator dynamics or limb length necessitates training a new model from scratch. In this work, we take a step towards a framework for training a generalizable Quadrupedal World Model (QWM) that disentangles environmental dynamics from robot morphology. We address the limitations of implicit system identification, where treating static physical properties (like mass or limb length) as latent variables to be inferred from motion history creates an adaptation lag that can compromise zero-shot safety and efficiency. Instead, we explicitly condition the generative dynamics on the robot's engineering specifications. By integrating a physical morphology encoder and a reward normalizer, we enable the model to serve as a neural simulator capable of generalizing across morphologies. This capability unlocks zero-shot control across a range of embodiments. Since the policy is conditioned on generalizable latent dynamics provided by the world model, we can deploy the agent on entirely unseen quadrupeds without fine-tuning, adaptation, or warm-up periods. We introduce, for the first time, a world model that enables zero-shot generalization to new morphologies for locomotion. While we carefully study the limitations of our method, QWM operates as a distribution-bounded interpolator within the quadrupedal morphology family rather than a universal physics engine, this work represents a significant step toward morphology-conditioned world models for legged locomotion.

Overview

QWM framework overview
Overview of the QWM Framework. Left (WM Learning): We train a single generalizable WM across diverse morphologies. The Physical Morphology Encoder (PME) derives a static embedding $\mu$ from the robots' USD, which explicitly conditions both the encoder and the recurrent state $h_t$ via dashed lines. The model utilizes previous actions $a_t$ and discrete stochastic states $z_t$ to predict future states, rewards $\hat{r}_t$, and continuation probabilities $\hat{c}_t$. To handle heterogeneous reward scales, we employ an Adaptive Reward Normalizer (ARN) with standard DreamerV3 backbone components. Middle (Behavior Learning): Policies are learned entirely in imagination. By freezing the components in the generalized WM and injecting the $\mu$ of any robot, we can train an actor and critic specifically for a new morphology without any physical interaction. Right (Unified Deployment): By freezing the generalized WM and policy and injecting the $\mu$ of any target robot (e.g., ANYmal-B), the WM creates a morphology-aligned latent space that allows the policy to adapt its control strategy immediately without further training.
Heterogeneous robot cohort
The heterogeneous morphology cohort used in experiments, illustrating the variance in physical scale and configuration. QWM is trained on seven robots while holding out one for zero-shot evaluation.

Method

QWM extends DreamerV3 with three targeted architectural changes to handle cross-morphology generalization:

Physical Morphology Encoder (PME): Extracts normalized features across four categories: kinematics & topology (hip offset, limb lengths, knee configuration), geometry (stance dimensions), dynamics (log-scaled mass), and actuation (torque density). Processed by a dedicated 2-layer MLP that runs parallel to the proprioceptive encoder, preventing static context from being overwhelmed by dynamic signals.
Morphology-Conditioned Recurrent Dynamics: The morphology embedding $\mu$ is injected at every recurrent step: $h_t = f(h_{t-1}, z_{t-1}, a_{t-1}, \mu)$. This allows the recurrent state to focus on dynamic execution while explicit conditioning handles static embodiment properties.
Adaptive Reward Normalizer (ARN): Quantile-based scaling using exponential moving averages tracks per-robot reward distributions, dynamically normalizing heterogeneous reward signals so no single morphology dominates training.

Training QWM required running eight different robot morphologies in parallel within a single simulator, something Isaac Lab does not support out of the box. To enable this, we built Hetero-Isaac, an extension to NVIDIA Isaac Lab that assigns distinct robot morphologies, collision geometries, and kinematic trees to different environment subsets while keeping all physics fidelity intact. The full technical details of this infrastructure, including joint-order unification, index mapping, and padded reward functions, are described in the accompanying blog post: Heterogeneous Environments in Isaac Lab.

Real-World Experiments

Both Unitree Go1 and ANYmal-D were held out during training. By injecting the correct morphology embedding, the frozen policy achieves stable locomotion on both platforms with zero falls across 20 trials (10 per platform, 60 seconds each).

ANYmal-D zero-shot deployment
ANYmal-D: zero-shot, held out during training
Unitree Go1 zero-shot deployment
Unitree Go1: zero-shot, held out during training
Multi-Robot Training
Hetero-Isaac: 8 robots training in parallel
Open-loop imagination rollouts
Open-loop imagination rollouts vs. ground truth physics

Multi-Morphology Mastery

A single QWM is trained simultaneously on the full heterogeneous cohort of eight quadrupeds and compared against world model baselines (DreamerV3, PWM, TWISTER) as well as a model-free oracle (PME-PPO).

Learning curves comparing QWM against baselines on heterogeneous robot cohort
Learning curves comparing QWM against baselines trained simultaneously on the full heterogeneous cohort. Left: mean reward. Right: mean episode length. Shaded regions are standard deviation across 5 seeds.

BibTeX

@misc{danesh2026qwm, title = {Toward Hardware-Agnostic Quadrupedal World Models via Morphology Conditioning}, author = {Danesh, Mohamad H. and Li, Chenhao and Abyaneh, Amin and Houssaini, Anas and Ellis, Kirsty and Berseth, Glen and Hutter, Marco and Lin, Hsiu-Chin}, year = {2026}, eprint = {2604.08780}, archivePrefix = {arXiv}, primaryClass = {cs.RO}, url = {https://arxiv.org/abs/2604.08780} }