Robotics - A deep dive on the history of robotics and the future of humanoids
original: https://github.com/adam-maj/robotics?tab=readme-ov-file&__readwiseLocation=
What are state of the art robotics currently capable of?
Fundamentals of Robotics
→ Building systems that can alter the physical world to accomplish arbitrary goals
ideas → action
Robotic systems need to:
- Observe & understand state of their environment
- Plan the actions they need to accomplish their goals
- Know how to execute actions with their hardware
3 essential functions of robotic systems:
- Perception
- Planning ← actually the easiest / most solved
- Control
General purpose robotics: fully autonomous, broadly capable & generally intelligent robotics systems
Hardware
→ Not the current bottleneck to progress
3 critical functions:
- Cameras, LIDAR, IMU & other sensors for perception
- Actuators that let the robot move at its joints (control)
- Compute: planning and execution
Hardware Constraints
- Degrees of freedom: freedom of movement
- Computation Complexity: not be overly complex
- Weight Ratio: ability to lift heavy things & not weighing down too much
- Safety
- Cost (cost & energy to mass produce)
Companies that maintain the same hardware over time will be able to take advantage of compounding effects:
- Deploying robots in the world
- Collecting diverse real world datasets
- Iterative cycles on training robots with the collected data
Hardware platforms must be sufficiently general to avoid needing to alter the hardware too much.
Hardware Capabilities of Modern Humanoid Systems
- High degree of freedom: hands
- Cameras only vision (cost optimization)
- AI compute
- Battery life
Software is the real bottleneck
Maravec's Paradox: → Planning is the easiest → Control effective → Eric Jang? Motor Control is the hardest
Perception
- Structure of the environment
- Presence & location of objects in the environment
- Its own position & orientation → Internal representation of the environment
SLAM
Simultaneous localization and mapping Monocular SLAM → single camera, no LIDAR SLAM w/ Deep Learning → → Mostly algorithms, small contributions by DL.
Breakthroughs
- Early SLAM
- Monocular SLAM
- SLAM with Deep Learning
Planning
Path planning Task planning → VLMs that are fine tuned, → relatively solved.
- Convert the high level goal of the robot into sub tasks and eventually individual motor routines
Control
Approaches:
- Classical Control: manual modeling of dynamics in the env
- Deep RL
- Robotic Transformers
Breakthrough #1: Classical Control
- physics based models usually involved directly modelling forces on objects
Breakthrough #2: Deep RL
Breakthrough #3: Simulation
- training robotic control policies in simulation offers the advantage of parallelization and scale far exceeding whats possible in reality
- MuJoCo -- and open source simulator build specifically with attention to the concerns of robotics
Breakthrough #4: End-to-end Learning
- Initiall deep learning based robotics systems strained vision and motor separately, which restricted the flow of information between perception and control systems
- End-to-end Visuomotor policies have been introduced to enable jointly trained vision and motor control systems with a single objective
- the models learn the most optimal flow of information between systems
Breakthrough #5: Tele-operation & Imitation Learning
- Teleoperation : Humans operating real world robots
- Imitation Learning: Train the robots based on the demonstrations by tele-operation
- Data is a huge problem in robotics - difficult to get "internet scale" data which has made modern LLMs so succesful
Breakthrough #6: Robotic Transformers
- GDM Robotics Transformer 1 (RT1) showed succesful use of Transformer architecture for robotics
- RT2 and SayCan demonstrated multi-modal VLM can be finetuned for robotic planning and control tasks
- RT2 introduces the VLA (Vision-Language-Action)
- ACT (Action Chunking Transformer): Allowed control policies to predict next series of actions over multiple time-stepms, allowing for smoother and coordinated actuator control
It's hard to overestimate how much value VLMs have brought to robotic planning and reasoning capabilities; this has been a major unlock on the path toward general-purpose robotics.
Breakthrough #7: Cross-embodiment
- Physical Intelligence pi0 introduces architectural and training improvements by training their model on data from many different robotics hardware systems
- this is known as a cross-embodiment dataset
- then generalize to new hardware via fine-tuning

Generalization
- The frontier of robotics is now converging on on end-to-end transformer models trained with internet scale pretraining and manually collected tele-operation datasets
- Generalization Capabilities
- Object Recognition - VLAs can recognize wide array of objects
- Envrionemnts - VLAs can operate in a diverse set of environments
- Reasoning - LLMs provide sufficient problem solving abilities for most real-world tasks
- Hardware - Cross-embodiment results from pi is promising indicator we may be able to create foundation models that can operate across hardware
- Manipulation - Still far a way off from generalized manipulation capabilities
Future
- fully autonomous and general purpose robots are now a deep learning problem
- Utilize the authors 7 constraints of deep learning progress as a lens to asses what is holding us back from progress in robotics
Simulation
- Training in simulation is the most viable path to generating internet scale data to train foundation models on
- Achieving sufficient complexity in simulation would require constructing simulated worlds that approach the complexity of the real world
- we need better world models to make the simulated training data useful
Wrap up
- Author predicts that Tesla will win the humanoid arm race because of its track record at highly technical projects, and deep pockets to fund the initial research,
My thoughts
- Overall, good writeup. I have no prior experience in robotics (coming at this from an interest in world models and experience with LLMs) so I learned quite a bit on the history and terminology used in robotics
- Don't feel like the paper presents enough evidence to jump at the conclusion that "Tesla will win", but to be fair, this is just the authors personal opinion
- Claiming that Reasoning is essentially solved for most real world problems is a pretty strong claim to make. There are a number of tasks for which LLMs simply dont perform well, such as anything requiring coherence over a long time horizon and the ability to learn at inference time.