robotics · pi · VLA · VLM

Robotics - A deep dive on the history of robotics and the future of humanoids

May 3, 2026

original: https://github.com/adam-maj/robotics?tab=readme-ov-file&__readwiseLocation=

What are state of the art robotics currently capable of?

Fundamentals of Robotics

→ Building systems that can alter the physical world to accomplish arbitrary goals

ideas → action

Robotic systems need to:

Observe & understand state of their environment
Plan the actions they need to accomplish their goals
Know how to execute actions with their hardware

3 essential functions of robotic systems:

Perception
Planning ← actually the easiest / most solved
Control

General purpose robotics: fully autonomous, broadly capable & generally intelligent robotics systems

Hardware

→ Not the current bottleneck to progress

3 critical functions:

Cameras, LIDAR, IMU & other sensors for perception
Actuators that let the robot move at its joints (control)
Compute: planning and execution

Hardware Constraints

Degrees of freedom: freedom of movement
Computation Complexity: not be overly complex
Weight Ratio: ability to lift heavy things & not weighing down too much
Safety
Cost (cost & energy to mass produce)

Companies that maintain the same hardware over time will be able to take advantage of compounding effects:

Deploying robots in the world
Collecting diverse real world datasets
Iterative cycles on training robots with the collected data

Hardware platforms must be sufficiently general to avoid needing to alter the hardware too much.

Hardware Capabilities of Modern Humanoid Systems

High degree of freedom: hands
Cameras only vision (cost optimization)
AI compute
Battery life

Software is the real bottleneck

Maravec's Paradox: → Planning is the easiest → Control effective → Eric Jang? Motor Control is the hardest

Perception

Structure of the environment
Presence & location of objects in the environment
Its own position & orientation → Internal representation of the environment

SLAM

Simultaneous localization and mapping Monocular SLAM → single camera, no LIDAR SLAM w/ Deep Learning → → Mostly algorithms, small contributions by DL.

Breakthroughs

Early SLAM
Monocular SLAM
SLAM with Deep Learning

Planning

Path planning Task planning → VLMs that are fine tuned, → relatively solved.

Convert the high level goal of the robot into sub tasks and eventually individual motor routines

Control

Approaches:

Classical Control: manual modeling of dynamics in the env
Deep RL
Robotic Transformers

Breakthrough #1: Classical Control

physics based models usually involved directly modelling forces on objects

Breakthrough #2: Deep RL

Breakthrough #3: Simulation

training robotic control policies in simulation offers the advantage of parallelization and scale far exceeding whats possible in reality
MuJoCo -- and open source simulator build specifically with attention to the concerns of robotics

Breakthrough #4: End-to-end Learning

Initiall deep learning based robotics systems strained vision and motor separately, which restricted the flow of information between perception and control systems
End-to-end Visuomotor policies have been introduced to enable jointly trained vision and motor control systems with a single objective
- the models learn the most optimal flow of information between systems

Breakthrough #5: Tele-operation & Imitation Learning

Teleoperation : Humans operating real world robots
Imitation Learning: Train the robots based on the demonstrations by tele-operation
Data is a huge problem in robotics - difficult to get "internet scale" data which has made modern LLMs so succesful

Breakthrough #6: Robotic Transformers

GDM Robotics Transformer 1 (RT1) showed succesful use of Transformer architecture for robotics
RT2 and SayCan demonstrated multi-modal VLM can be finetuned for robotic planning and control tasks
- RT2 introduces the VLA (Vision-Language-Action)
ACT (Action Chunking Transformer): Allowed control policies to predict next series of actions over multiple time-stepms, allowing for smoother and coordinated actuator control

VLMs

It's hard to overestimate how much value VLMs have brought to robotic planning and reasoning capabilities; this has been a major unlock on the path toward general-purpose robotics.

Breakthrough #7: Cross-embodiment

Physical Intelligence pi0 introduces architectural and training improvements by training their model on data from many different robotics hardware systems
- this is known as a cross-embodiment dataset
then generalize to new hardware via fine-tuning

Generalization

The frontier of robotics is now converging on on end-to-end transformer models trained with internet scale pretraining and manually collected tele-operation datasets
Generalization Capabilities
1. Object Recognition - VLAs can recognize wide array of objects
2. Envrionemnts - VLAs can operate in a diverse set of environments
3. Reasoning - LLMs provide sufficient problem solving abilities for most real-world tasks
4. Hardware - Cross-embodiment results from pi is promising indicator we may be able to create foundation models that can operate across hardware
5. Manipulation - Still far a way off from generalized manipulation capabilities

Future

fully autonomous and general purpose robots are now a deep learning problem
Utilize the authors 7 constraints of deep learning progress as a lens to asses what is holding us back from progress in robotics

Simulation

Training in simulation is the most viable path to generating internet scale data to train foundation models on
Achieving sufficient complexity in simulation would require constructing simulated worlds that approach the complexity of the real world
- we need better world models to make the simulated training data useful

Wrap up

Author predicts that Tesla will win the humanoid arm race because of its track record at highly technical projects, and deep pockets to fund the initial research,

My thoughts

Overall, good writeup. I have no prior experience in robotics (coming at this from an interest in world models and experience with LLMs) so I learned quite a bit on the history and terminology used in robotics
Don't feel like the paper presents enough evidence to jump at the conclusion that "Tesla will win", but to be fair, this is just the authors personal opinion
Claiming that Reasoning is essentially solved for most real world problems is a pretty strong claim to make. There are a number of tasks for which LLMs simply dont perform well, such as anything requiring coherence over a long time horizon and the ability to learn at inference time.