notes
robotics · pi · VLA · VLM

Robotics - A deep dive on the history of robotics and the future of humanoids

original: https://github.com/adam-maj/robotics?tab=readme-ov-file&__readwiseLocation=

What are state of the art robotics currently capable of?

Fundamentals of Robotics

→ Building systems that can alter the physical world to accomplish arbitrary goals

ideas → action

Robotic systems need to:

  1. Observe & understand state of their environment
  2. Plan the actions they need to accomplish their goals
  3. Know how to execute actions with their hardware

3 essential functions of robotic systems:

  1. Perception
  2. Planning ← actually the easiest / most solved
  3. Control

General purpose robotics: fully autonomous, broadly capable & generally intelligent robotics systems

Hardware

→ Not the current bottleneck to progress

3 critical functions:

  1. Cameras, LIDAR, IMU & other sensors for perception
  2. Actuators that let the robot move at its joints (control)
  3. Compute: planning and execution

Hardware Constraints

  1. Degrees of freedom: freedom of movement
  2. Computation Complexity: not be overly complex
  3. Weight Ratio: ability to lift heavy things & not weighing down too much
  4. Safety
  5. Cost (cost & energy to mass produce)

Companies that maintain the same hardware over time will be able to take advantage of compounding effects:

  1. Deploying robots in the world
  2. Collecting diverse real world datasets
  3. Iterative cycles on training robots with the collected data

Hardware platforms must be sufficiently general to avoid needing to alter the hardware too much.

Hardware Capabilities of Modern Humanoid Systems

  • High degree of freedom: hands
  • Cameras only vision (cost optimization)
  • AI compute
  • Battery life

Software is the real bottleneck

Maravec's Paradox: → Planning is the easiest → Control effective → Eric Jang? Motor Control is the hardest

Perception

  1. Structure of the environment
  2. Presence & location of objects in the environment
  3. Its own position & orientation → Internal representation of the environment

SLAM

Simultaneous localization and mapping Monocular SLAM → single camera, no LIDAR SLAM w/ Deep Learning → → Mostly algorithms, small contributions by DL.

Breakthroughs

  1. Early SLAM
  2. Monocular SLAM
  3. SLAM with Deep Learning

Planning

Path planning Task planning → VLMs that are fine tuned, → relatively solved.

  • Convert the high level goal of the robot into sub tasks and eventually individual motor routines

Control

Approaches:

  1. Classical Control: manual modeling of dynamics in the env
  2. Deep RL
  3. Robotic Transformers

Breakthrough #1: Classical Control

  • physics based models usually involved directly modelling forces on objects

Breakthrough #2: Deep RL

Breakthrough #3: Simulation

  • training robotic control policies in simulation offers the advantage of parallelization and scale far exceeding whats possible in reality
  • MuJoCo -- and open source simulator build specifically with attention to the concerns of robotics

Breakthrough #4: End-to-end Learning

  • Initiall deep learning based robotics systems strained vision and motor separately, which restricted the flow of information between perception and control systems
  • End-to-end Visuomotor policies have been introduced to enable jointly trained vision and motor control systems with a single objective
    • the models learn the most optimal flow of information between systems

Breakthrough #5: Tele-operation & Imitation Learning

  • Teleoperation : Humans operating real world robots
  • Imitation Learning: Train the robots based on the demonstrations by tele-operation
  • Data is a huge problem in robotics - difficult to get "internet scale" data which has made modern LLMs so succesful

Breakthrough #6: Robotic Transformers

  • GDM Robotics Transformer 1 (RT1) showed succesful use of Transformer architecture for robotics
  • RT2 and SayCan demonstrated multi-modal VLM can be finetuned for robotic planning and control tasks
    • RT2 introduces the VLA (Vision-Language-Action)
  • ACT (Action Chunking Transformer): Allowed control policies to predict next series of actions over multiple time-stepms, allowing for smoother and coordinated actuator control
VLMs

It's hard to overestimate how much value VLMs have brought to robotic planning and reasoning capabilities; this has been a major unlock on the path toward general-purpose robotics.

Breakthrough #7: Cross-embodiment

  • Physical Intelligence pi0 introduces architectural and training improvements by training their model on data from many different robotics hardware systems
    • this is known as a cross-embodiment dataset
  • then generalize to new hardware via fine-tuning

Generalization

  • The frontier of robotics is now converging on on end-to-end transformer models trained with internet scale pretraining and manually collected tele-operation datasets
  • Generalization Capabilities
    1. Object Recognition - VLAs can recognize wide array of objects
    2. Envrionemnts - VLAs can operate in a diverse set of environments
    3. Reasoning - LLMs provide sufficient problem solving abilities for most real-world tasks
    4. Hardware - Cross-embodiment results from pi is promising indicator we may be able to create foundation models that can operate across hardware
    5. Manipulation - Still far a way off from generalized manipulation capabilities

Future

  • fully autonomous and general purpose robots are now a deep learning problem
  • Utilize the authors 7 constraints of deep learning progress as a lens to asses what is holding us back from progress in robotics

Simulation

  • Training in simulation is the most viable path to generating internet scale data to train foundation models on
  • Achieving sufficient complexity in simulation would require constructing simulated worlds that approach the complexity of the real world
    • we need better world models to make the simulated training data useful

Wrap up

  • Author predicts that Tesla will win the humanoid arm race because of its track record at highly technical projects, and deep pockets to fund the initial research,

My thoughts

  • Overall, good writeup. I have no prior experience in robotics (coming at this from an interest in world models and experience with LLMs) so I learned quite a bit on the history and terminology used in robotics
  • Don't feel like the paper presents enough evidence to jump at the conclusion that "Tesla will win", but to be fair, this is just the authors personal opinion
  • Claiming that Reasoning is essentially solved for most real world problems is a pretty strong claim to make. There are a number of tasks for which LLMs simply dont perform well, such as anything requiring coherence over a long time horizon and the ability to learn at inference time.