Head Software

Embodied AI: The Brain Behind the Robot

By Tech Buzz China March 14, 2026

Hardware alone doesn't make a humanoid robot useful — the AI brain does. This report examines the Vision-Language-Action (VLA) model architectures that are becoming the dominant paradigm for robot intelligence, explains why 7 billion parameters has emerged as a key threshold, and profiles the Chinese AI companies and robot makers building the embodied AI stack that will determine who leads the next decade of robotics.

The "Brain / Cerebellum / Body" Framework

Chinese industry and government research (notably CAICT) organizes humanoid robot technology into three layers — a framework worth understanding because it's how policy, funding, and R&D priorities are structured in China:

  • "大脑" (Brain): Environment perception, task planning, human interaction, decision-making — powered by large AI models. This is the ICT layer.
  • "小脑" (Cerebellum): Motion planning and control — translating high-level decisions into coordinated physical movement. Uses reinforcement learning, imitation learning, or traditional model-based control (MPC, ZMP). Currently shifting from model-based to learning-based approaches.
  • "肢体" (Body): The physical hardware — actuators, sensors, materials, power systems. This is the industrial equipment layer.

CAICT argues these two domains (ICT for brain/cerebellum, industrial for body) must develop in tandem, but with different priorities at different stages: in early commercialization, the "brain" matters most (can the robot actually do useful things?); at mass scale, the "body" matters most (can it be built cheaply and reliably?).

CAICT's 5-Level Capability Scale

CAICT has defined a maturity framework for humanoid robots that maps well to where the industry stands today:

  • Lv1 — Basic capability: Stable walking, running, jumping, basic interaction. Where most full-body humanoids are today.
  • Lv2 — Task-specific intelligence: Can perform defined tasks in specific scenarios. Wheeled humanoids and some bipedals are approaching this.
  • Lv3 — Scene intelligence: Can handle most unstructured tasks within a given scenario. Some generalization ability.
  • Lv4 — Multi-scene adaptation: Works across 3+ different environments handling unstructured tasks.
  • Lv5 — Full embodied intelligence: True general-purpose capability — learns new tasks from minimal instruction.

The jump from Lv1 to Lv2 is largely a "brain" problem — better AI and perception. The jump from Lv2 to Lv3+ requires both better AI and much better hardware at lower cost.

Four Parallel "Brain" AI Technology Paths

CAICT identifies four concurrent technology paths for the robot "brain," which are gradually converging toward end-to-end models:

  1. LLM + VFM (Vision Foundation Model): Language interaction + visual understanding for task planning. The most mature path today. (Example: Google SayCan)
  2. VLM (Vision-Language Model): Bridges language and visual understanding for more accurate planning. (Example: Tsinghua's CoPa)
  3. VLA (Vision-Language-Action): Adds motor control output to VLM, solving the motion trajectory problem. This is where the industry is converging. (Example: Google RT-H)
  4. Multimodal Large Models: Full sensory integration — vision, hearing, touch — for complete physical world perception. The future direction. (Example: MIT/IBM MultiPLY)

Notably, Unitree's March 2026 IPO prospectus identifies two additional architecture paradigms gaining traction. WMA (World Model-Action) models build an internal "world model" that predicts future states of the physical environment, then uses those predictions to generate safer, more efficient action strategies — essentially giving the robot the ability to simulate outcomes before acting. Dual-system architectures mimic human "fast and slow thinking" — a large multimodal model serves as the "slow system" for cross-scenario generalization and planning, while a lightweight VLA or action-expert policy serves as the "fast system" for real-time motor control. Unitree has invested in both WMA and VLA paths, open-sourcing UnifoLM-WMA-0 (Sep 2025) and UnifoLM-VLA-0 (Jan 2026), and deploying an industrial-grade model (UnifoLM-X1-0) in its own factory for motor assembly tasks — one of the first confirmed real-world industrial deployments of an embodied AI model globally.

VLA Models Overview

Vision-Language-Action (VLA) models are the dominant AI architecture for humanoid robots in 2025–2026. They take visual inputs (camera feeds), language instructions ("pick up the red cup"), and sensor data as inputs, and output motor commands as actions — effectively bridging natural language understanding, visual perception, and physical control in a single end-to-end model. VLA represents a significant advance over earlier modular approaches that treated perception, planning, and control as separate systems.

Compared to earlier modular approaches — where separate systems handled perception, planning, and control — VLA models learn a direct mapping from visual input and language instruction to motor commands, eliminating hand-designed interfaces between subsystems. The key model families include Google's RT-2 (which demonstrated that scaling language model parameters improves robot generalization), Physical Intelligence's π0 (Pi Zero, which showed strong manipulation performance), and several Chinese models now competing head-to-head. The key benchmarks evaluate generalization across six axes: lighting variation, distractor objects, object position, table height, background, and object category. Generalization — the ability to handle situations the model was never explicitly trained on — remains the hardest unsolved problem in embodied AI.

Benchmark Highlight

Galbot's GraspVLA outperforms OpenVLA, π0, RT-2, and RDT on generalization across six axes: lighting variation, distractor objects, object position, table height, background, and object category — while being the first model fully pre-trained on synthetic data.

Key VLA Architectures in China

China's major humanoid robot companies have all developed or adopted proprietary VLA-style architectures. AgiBot's GO-1 uses a ViLLA architecture (VLM + Mixture of Experts), combining a vision-language model backbone with specialized expert modules for different action types. Galbot's GraspVLA achieves strong generalization by training entirely on synthetic data generated via NVIDIA Isaac simulation. Zhipingfang's GOVLA is described as "globally leading" on several benchmarks. This subsection will compare these architectures and assess their respective strengths and weaknesses.

The 7B Parameter Threshold

Within the embodied AI community, 7 billion parameters has emerged as an important practical threshold for VLA models. Models below 7B tend to struggle with generalization — they can be trained to perform specific tasks reliably, but fail when conditions change even slightly. Models at or above 7B show markedly better zero-shot and few-shot generalization, meaning they can handle novel situations without additional training. However, larger models also require more compute and have higher inference latency — a critical constraint for real-time robot control.

The research evidence is consistent across multiple studies: below 7B parameters, VLA models tend to overfit to their training scenarios — performing well on familiar objects and settings but failing when conditions change even slightly. At 7B+, models develop emergent generalization capabilities, handling novel objects, lighting conditions, and task phrasings with markedly better success rates. The real-world tradeoff is latency: a 7B model running on current edge hardware (NVIDIA Jetson Orin, ~275 TOPS) takes 100–200ms per inference step — acceptable for manipulation but borderline for high-speed locomotion. Emerging techniques are helping bridge this gap: distillation trains a smaller "student" model to mimic a larger "teacher," preserving 80–90% of capability at 3–5x faster inference; quantization reduces model precision from 32-bit to 8-bit or 4-bit, halving memory and accelerating inference; and MoE (Mixture of Experts) activates only a subset of model parameters for each input, reducing effective compute cost.

Why It Matters

A robot that can only do what it was explicitly trained to do is useful in exactly one factory. A robot that generalizes to new tasks from language instructions can be deployed anywhere — that's the commercial prize behind the 7B parameter race.

On-Robot Compute Constraints

Running a 7B+ parameter model in real time on a robot requires significant onboard compute — and the power to run it (see our Batteries & Power report for the energy tradeoffs). Current humanoid robots typically carry NVIDIA Jetson Orin modules (275 TOPS, ~60W) for edge inference, supplemented by cloud offloading for non-latency-critical planning tasks. Some Chinese platforms are evaluating domestic alternatives: Horizon Robotics' Journey 5 chip and Huawei's Ascend 310 offer competitive inference performance at lower cost, though the NVIDIA CUDA ecosystem remains dominant for model development. The emerging trend is a "big model in the cloud, small model on the robot" architecture — where a large (70B+) model handles task planning and reasoning via low-latency 5G connection, while a distilled on-robot model (3–7B) handles real-time motor control.

Training Data & Infrastructure

Training data is the fuel of embodied AI — and getting enough of it is one of the field's central challenges. Unlike text or images, robot training data requires physical demonstrations in the real world: a human teleoperation operator controlling a robot arm to perform a task, frame by frame. This is expensive and slow. The industry is therefore pursuing two parallel strategies: scaling up real-world data collection and generating synthetic data via simulation.

China's scale advantage in data collection is becoming a key differentiator. AgiBot deployed 100 robots to generate 1 million+ action trajectories — the AgiBot World dataset — covering hundreds of manipulation tasks across diverse environments. Galbot took a different approach, generating a billion-scale manipulation dataset in a single week via NVIDIA Isaac simulation — demonstrating that synthetic data can bootstrap VLA training at dramatically lower cost. Government-funded "embodied intelligence data collection centers" represent one of the largest sources of industrial humanoid robot orders in 2025 — these facilities purchase robots specifically to generate training data at scale, creating a virtuous cycle where robot sales fund the AI development that makes robots more capable.

Company Dataset Scale Approach
AgiBot AgiBot World 1M+ trajectories 100 robots, real-world collection
Galbot Synthetic manipulation 1B+ interactions NVIDIA Isaac simulation
Unitree Full-body motion + manipulation Open-sourced (43 GitHub repos) Real robot motion capture; sim-to-real via deep RL; IPO prospectus discloses UnifoLM-X1-0 deployed in own factory
UBTECH Factory real-environment dataset "World's largest" (claimed) Walker series factory deployments

Chinese AI Companies

Beyond the robot makers themselves, a distinct cohort of pure-play embodied AI companies is emerging in China — focused entirely on the AI brain rather than the hardware body. Daxiao Robot (大晓机器人), spun out of SenseTime by co-founder Wang Xiaogang, develops the Enlightenment World Model 3.0 and has no hardware products at all. Zibianliang Robot (自变量机器人) — backed uniquely by ByteDance, Alibaba, AND Meituan simultaneously — develops the WALL-A end-to-end model for robot operation.

The strategic question facing the industry is whether AI-brain and hardware-body should be developed together or separately. Integrated players like AgiBot and Unitree argue that tight hardware-software coupling produces better performance — the AI should be designed for the specific sensor suite and actuator characteristics of its body. Pure-play AI companies counter that a general-purpose robot brain, like a general-purpose operating system, creates more value by working across many hardware platforms. The major Chinese tech giants are hedging both ways: Baidu has an embodied intelligence unit developing robot AI, Alibaba has invested in multiple robotics startups, and Huawei's Pangu embodied AI model powers Leju Robot's Kuafu series — suggesting the tech giants see embodied AI as a platform opportunity rather than a hardware play.

  • Galbot (银河通用): GraspVLA — first model fully pre-trained on synthetic data; ¥20B+ valuation as of March 2026
  • Zhipingfang (智平方): GOVLA model; 12 funding rounds in one year; Shenzhen's first ¥10B+ embodied intelligence unicorn
  • Daxiao Robot: Enlightenment World Model 3.0; SenseTime spin-off; AI-only, no hardware
  • Zibianliang (自变量): WALL-A model; unique ByteDance + Alibaba + Meituan backing
Looking Ahead

Embodied AI is where the 2020s robotics boom will ultimately be won or lost. The companies that build the most generalizable robot brains — ones that can perform new tasks from language instructions alone — will define the commercial landscape for the next decade. China's combination of massive data collection scale, strong AI research talent, and government support positions it as a genuine global contender.