Physical AI System Architecture

Purpose

This chapter presents the layered architecture of Physical AI systems, explaining how perception, planning, control, and hardware components integrate into cohesive autonomous platforms.

Architectural Overview

Physical AI systems follow a hierarchical architecture with distinct functional layers:

┌─────────────────────────────────────────────────┐
│          Application Layer (Task Logic)         │
│        (Pick object, Navigate to goal)          │
└────────────────────┬────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────┐
│      Deliberative Layer (High-Level Planning)   │
│    (Path planning, Task planning, Learning)     │
└────────────────────┬────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────┐
│    Executive Layer (Coordination & Monitoring)  │
│     (Behavior arbitration, Resource mgmt)       │
└────────────────────┬────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────┐
│     Reactive Layer (Low-Level Control)          │
│   (Motor control, Reflexes, Safety checks)      │
└────────────────────┬────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────┐
│        Hardware Layer (Sensors & Actuators)     │
│      (Motors, Cameras, LiDAR, Encoders)         │
└─────────────────────────────────────────────────┘

This architecture separates concerns by time scale and abstraction level.

Functional Layers

1. Hardware Layer

Function: Interface between digital computation and physical world.

Components:

Sensors: Convert physical phenomena → digital signals
Actuators: Convert digital commands → physical motion
Power: Batteries, voltage regulators, power distribution
Compute: Embedded controllers, GPUs, CPUs

Characteristics:

Time Scale: Microseconds to milliseconds
Update Rate: 1-100 kHz (motor control)
Determinism: Hard real-time requirements

Example: Motor driver receives torque command (digital), drives current through motor coils (analog), produces joint motion (mechanical).

2. Reactive Layer (Control)

Function: Real-time closed-loop control for stability and safety.

Components:

Motor Controllers: PID, torque control, velocity control
Balance Controllers: Stabilization for bipedal/aerial robots
Reflex Behaviors: Emergency stops, collision avoidance
Sensor Processing: Filtering, calibration, low-level fusion

Characteristics:

Time Scale: 1-10 milliseconds
Update Rate: 100-1000 Hz
Paradigm: Reactive (stimulus → response)

Example: Humanoid detects loss of balance (IMU) → immediately adjusts ankle torque to prevent fall (no planning involved).

Algorithms:

PID control for joint position/velocity
Impedance control for compliant interaction
Admittance control for force tracking
Virtual model control for balance

3. Executive Layer (Coordination)

Function: Coordinate multiple behaviors, manage resources, monitor execution.

Components:

Behavior Arbitration: Select which behavior to execute (priority-based, auction-based)
Resource Allocation: Assign sensors/actuators to tasks
State Machine: Manage transitions between modes (idle → walking → grasping)
Health Monitoring: Detect faults, trigger recovery

Characteristics:

Time Scale: 10-100 milliseconds
Update Rate: 10-100 Hz
Paradigm: Hybrid (reactive + deliberative)

Example: Robot executing "pick and place" switches between behaviors: approach object → align gripper → close gripper → lift → transport → place → open gripper.

Design Patterns:

Subsumption Architecture: Lower layers can override higher layers (safety first)
Hierarchical State Machines: Nested states for complex task decomposition
Behavior Trees: Modular, composable task representation

4. Deliberative Layer (Planning)

Function: High-level reasoning, planning, learning.

Components:

Path Planning: Compute collision-free paths (RRT, A*, Dijkstra)
Motion Planning: Trajectory optimization for smooth motion
Task Planning: Sequence of actions to achieve goal (PDDL, HTN)
Perception: Object detection, segmentation, tracking
Localization & Mapping: SLAM, visual odometry

Characteristics:

Time Scale: 100 milliseconds to seconds
Update Rate: 1-10 Hz
Paradigm: Deliberative (search, optimization, prediction)

Example: Robot plans path from current position to goal, avoiding obstacles, optimizing for time or energy.

Algorithms:

Sampling-based motion planning (RRT, PRM)
Optimization-based planning (trajectory optimization, MPC)
Search-based planning (A*, D*)
Learning-based planning (neural planners, value iteration networks)

5. Application Layer (Task Logic)

Function: High-level task specification and execution.

Components:

Task Interface: User commands, scripts, demonstrations
World Model: Semantic understanding of environment
Decision Making: Task sequencing, goal reasoning
Human-Robot Interaction: Speech, gestures, GUI

Characteristics:

Time Scale: Seconds to minutes
Update Rate: 0.1-1 Hz
Paradigm: Symbolic reasoning, language-based

Example: User says "Clean the table." System interprets command, identifies table location, plans sequence: navigate to table → detect objects → grasp object → move to bin → release → repeat.

Technologies:

Large language models (task understanding)
Vision-language models (grounding language in vision)
Symbolic planners (STRIPS, PDDL)
Dialogue systems (clarification, confirmation)

Data Flow Through Layers

Bottom-Up (Perception)

Sensors → Signal Processing → Feature Extraction → State Estimation → World Model

Example: Object Detection Pipeline:

Camera captures 1920×1080 RGB image (30 Hz)
Preprocessing resizes to 640×640, normalizes pixel values
Neural Network (YOLO, Faster R-CNN) detects objects → bounding boxes + class labels
3D Estimation fuses with depth sensor → object poses in world coordinates
Tracking associates detections across frames → object trajectories
World Model updates object database (positions, velocities, identities)

Latency: 30-100ms (camera lag + inference + processing)

Top-Down (Action)

Task Goal → Plan → Behavior Selection → Control Commands → Actuator Signals

Example: Pick-and-Place Execution:

Task: "Pick up red cup"
Planning: Compute arm trajectory from current pose to pre-grasp pose
Behavior: Select "approach" behavior
Control: Generate joint velocity commands (inverse kinematics + motion profile)
Actuation: Motor drivers execute velocity commands → arm moves

Update Rate: 1 Hz (planning) → 10 Hz (behavior) → 100 Hz (control) → 1 kHz (actuation)

Communication Architecture

Centralized vs. Distributed

Centralized (Single Computer):

All computation on one powerful machine
Simplifies software architecture
Single point of failure
Limited I/O bandwidth

Distributed (Multiple Processors):

Sensors/actuators connected to dedicated microcontrollers
High-level planning on powerful CPU/GPU
Low-level control on real-time processors (FPGA, microcontrollers)
Robust to partial failures

Hybrid (Common in Modern Robots):

Central computer for perception, planning, learning
Distributed microcontrollers for motor control, sensor acquisition
Communication via field bus (CAN, EtherCAT, ROS 2)

Middleware Frameworks

ROS 2 (Robot Operating System 2)

Function: Standardized communication framework for robot components.

Key Features:

Publish-Subscribe: Nodes publish data to topics, others subscribe
Services: Request-response communication
Actions: Long-running tasks with feedback
Real-Time: DDS (Data Distribution Service) backend for deterministic communication

Example Architecture:

Camera Node (30 Hz) → /camera/image_raw → Object Detector Node
                                          ↓
                                       /detected_objects → Task Planner
                                                           ↓
                                                        /arm/trajectory → Arm Controller
                                                                          ↓
                                                                       Motors

Advantages:

Modularity: Swap components easily
Language agnostic: C++, Python, Rust
Rich ecosystem: Drivers, algorithms, visualization (RViz)

Disadvantages:

Overhead (not suitable for hard real-time control loops)
Complexity (learning curve)

Compute Architecture

Processing Units

CPU (Central Processing Unit):

Use: High-level planning, logic, coordination
Example: Intel i7, AMD Ryzen, ARM Cortex-A

GPU (Graphics Processing Unit):

Use: Deep learning inference, parallel sensor processing
Example: NVIDIA Jetson (embedded), RTX 4090 (desktop)

FPGA (Field-Programmable Gate Array):

Use: Ultra-low latency sensor processing, custom control loops
Example: Xilinx Zynq, Intel Altera

Microcontroller (MCU):

Use: Motor control, sensor acquisition, real-time tasks
Example: STM32, Arduino, Teensy

TPU (Tensor Processing Unit):

Use: Accelerated neural network inference
Example: Google Edge TPU, Coral

Typical Humanoid Compute Stack

Example: Tesla Optimus:

Main Computer: Custom SoC (System-on-Chip) with CPU + GPU
- Planning, perception (vision models)
Motor Controllers: Distributed microcontrollers (one per joint)
- Real-time torque control
Sensor Hubs: Dedicated processors for camera, IMU fusion
Communication: Custom field bus (low latency)

Power Budget: 100-500W total (majority for compute, rest for motors).

Perception Architecture

Sensor Fusion Pipeline

Goal: Combine multiple sensors for robust state estimation.

Example: Mobile Robot Localization:

Sensors:

Wheel Encoders: Odometry (dead reckoning)
IMU: Orientation, acceleration
LiDAR: 2D/3D point cloud
Camera: Visual features

Fusion Algorithm (Extended Kalman Filter):

Prediction: Use odometry to predict position
Update: Correct prediction using LiDAR scan matching
Refinement: Incorporate IMU for orientation stability
Validation: Visual odometry detects wheel slip

Output: Robot pose (x, y, θ) at 50 Hz with 5cm accuracy.

Alternative: Factor graph optimization (GTSAM library), particle filter (Monte Carlo localization).

Perception Modules

Object Detection:

Input: RGB image
Output: Bounding boxes, class labels, confidence scores
Model: YOLO, Faster R-CNN, DETR
Latency: 10-50ms

Depth Estimation:

Input: Stereo images or monocular image
Output: Depth map
Model: Stereo matching, monocular depth network (MiDaS)
Latency: 20-100ms

Semantic Segmentation:

Input: RGB image
Output: Pixel-wise class labels
Model: DeepLabV3, Mask R-CNN
Latency: 50-200ms

Pose Estimation:

Input: RGB image + object model
Output: 6D pose (position + orientation)
Model: PoseCNN, DenseFusion
Latency: 30-100ms

Control Architecture

Layered Control Hierarchy

High-Level Control (1-10 Hz):

Input: Desired end-effector pose or velocity
Output: Joint position/velocity commands
Algorithm: Inverse kinematics, trajectory generation

Mid-Level Control (10-100 Hz):

Input: Joint position/velocity commands
Output: Joint torque commands
Algorithm: Impedance control, admittance control

Low-Level Control (100-1000 Hz):

Input: Joint torque commands
Output: Motor current commands
Algorithm: Torque control (current control loop)

Hardware Control (1-10 kHz):

Input: Motor current commands
Output: PWM signals to motor driver
Implementation: Microcontroller firmware

Whole-Body Control (Humanoid Specific)

Challenge: Coordinate 30+ degrees of freedom while maintaining balance.

Approach: Quadratic Programming (QP) optimization

Formulation:

minimize: || q̈ - q̈_desired ||²  (track desired accelerations)
subject to:
  - Contact forces within friction cone
  - Joint torque limits
  - Zero Moment Point (ZMP) inside support polygon

Output: Joint accelerations for all 30 DOF simultaneously.

Update Rate: 100-1000 Hz.

Libraries: Drake (MIT), Pinocchio, RBDL.

Key Architecture Patterns

1. Sense-Plan-Act Cycle

Traditional robotics paradigm:

Sense: Gather all sensor data
Plan: Compute optimal action
Act: Execute action
Repeat

Limitation: Planning takes time (100ms-1s), world changes during planning.

Solution: Asynchronous planning (plan while executing previous plan).

2. Subsumption Architecture

Principle: Layered behaviors, lower layers can suppress higher layers.

Example:

Layer 0 (highest priority): Collision avoidance (reflex)
Layer 1: Follow wall (reactive)
Layer 2: Explore (deliberative)

Behavior: If obstacle detected, Layer 0 suppresses Layers 1-2 and executes avoidance.

Advantage: Robust, real-time, graceful degradation.

Disadvantage: Difficult to design for complex tasks.

3. Hybrid Deliberative/Reactive

Principle: Combine planning (slow, optimal) with reactive control (fast, safe).

Implementation:

Deliberative layer: Plan path every 1 second
Reactive layer: Adjust trajectory every 10ms for obstacles

Example: Self-driving car plans route (GPS navigation) but reactively avoids pedestrians (no time to replan full route).

4. Model Predictive Control (MPC)

Principle: Optimize actions over future time horizon, execute first action, replan.

Algorithm:

Predict future states (0-2 seconds ahead)
Optimize control to minimize cost (collision, energy, time)
Execute first 100ms of plan
Repeat (receding horizon)

Advantage: Anticipatory, handles constraints.

Disadvantage: Computationally expensive (requires fast optimization).

Use Case: Drone trajectory tracking, humanoid walking.

Practical Example: Warehouse AMR Architecture

Application: Autonomous mobile robot for warehouse logistics.

Hardware:

Compute: NVIDIA Jetson AGX Xavier (GPU for perception)
Sensors: 2D LiDAR, stereo camera, wheel encoders, IMU
Actuators: Differential drive motors
Power: 48V battery (8 hours runtime)

Software Stack:

Layer 1 (Hardware):

Motor drivers (CAN bus)
LiDAR driver (Ethernet)
Camera driver (USB)

Layer 2 (Reactive):

Velocity controller (100 Hz)
Emergency stop (detects obstacles within 0.5m)

Layer 3 (Executive):

State machine (idle → navigating → docking → charging)
Behavior arbitration (prioritize safety > efficiency)

Layer 4 (Deliberative):

SLAM (map building + localization, 10 Hz)
Path planning (A* on occupancy grid, replan every 1s)
Object detection (pallet recognition, 5 Hz)

Layer 5 (Application):

Task manager (receive pick/place orders from warehouse system)
Fleet coordination (communicate with other robots)

Communication: ROS 2 (internal), REST API (external fleet management).

Key Takeaways

Physical AI systems use hierarchical layered architecture separating hardware, reactive control, coordination, planning, and application logic.
Time scales vary by layer: hardware (µs-ms), reactive (ms), planning (100ms-s), application (s-min).
Data flows bottom-up (perception) and top-down (action) with sensor fusion at lower layers and symbolic reasoning at higher layers.
Communication architectures range from centralized (single computer) to distributed (multiple processors) with middleware like ROS 2 enabling modular design.
Compute architectures combine CPUs (logic), GPUs (learning/perception), microcontrollers (real-time control), and specialized accelerators (FPGAs, TPUs).
Perception pipelines integrate object detection, depth estimation, segmentation, and pose estimation at 10-30 Hz for real-time scene understanding.
Control hierarchies span high-level task planning (Hz), mid-level impedance control (10s Hz), and low-level torque control (100s Hz-kHz).
Key patterns include sense-plan-act, subsumption, hybrid deliberative/reactive, and model predictive control trading off optimality, reactivity, and robustness.

Next Chapter: Data flow in Physical AI systems—how information moves from sensors through perception, planning, control, and back to actuators.

Purpose​

Architectural Overview​

Functional Layers​

1. Hardware Layer​

2. Reactive Layer (Control)​

3. Executive Layer (Coordination)​

4. Deliberative Layer (Planning)​

5. Application Layer (Task Logic)​

Data Flow Through Layers​

Bottom-Up (Perception)​

Top-Down (Action)​

Communication Architecture​

Centralized vs. Distributed​

Middleware Frameworks​

ROS 2 (Robot Operating System 2)​

Compute Architecture​

Processing Units​

Typical Humanoid Compute Stack​

Perception Architecture​

Sensor Fusion Pipeline​

Perception Modules​

Control Architecture​

Layered Control Hierarchy​

Whole-Body Control (Humanoid Specific)​

Key Architecture Patterns​

1. Sense-Plan-Act Cycle​

2. Subsumption Architecture​

3. Hybrid Deliberative/Reactive​

4. Model Predictive Control (MPC)​

Practical Example: Warehouse AMR Architecture​

Key Takeaways​

Purpose

Architectural Overview

Functional Layers

1. Hardware Layer

2. Reactive Layer (Control)

3. Executive Layer (Coordination)

4. Deliberative Layer (Planning)

5. Application Layer (Task Logic)

Data Flow Through Layers

Bottom-Up (Perception)

Top-Down (Action)

Communication Architecture

Centralized vs. Distributed

Middleware Frameworks

ROS 2 (Robot Operating System 2)

Compute Architecture

Processing Units

Typical Humanoid Compute Stack

Perception Architecture

Sensor Fusion Pipeline

Perception Modules

Control Architecture

Layered Control Hierarchy

Whole-Body Control (Humanoid Specific)

Key Architecture Patterns

1. Sense-Plan-Act Cycle

2. Subsumption Architecture

3. Hybrid Deliberative/Reactive

4. Model Predictive Control (MPC)

Practical Example: Warehouse AMR Architecture

Key Takeaways