Physical AI System Architecture
Purpose
This chapter presents the layered architecture of Physical AI systems, explaining how perception, planning, control, and hardware components integrate into cohesive autonomous platforms.
Architectural Overview
Physical AI systems follow a hierarchical architecture with distinct functional layers:
┌─────────────────────────────────────────────────┐
│ Application Layer (Task Logic) │
│ (Pick object, Navigate to goal) │
└────────────────────┬────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Deliberative Layer (High-Level Planning) │
│ (Path planning, Task planning, Learning) │
└────────────────────┬────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Executive Layer (Coordination & Monitoring) │
│ (Behavior arbitration, Resource mgmt) │
└────────────────────┬────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Reactive Layer (Low-Level Control) │
│ (Motor control, Reflexes, Safety checks) │
└────────────────────┬────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Hardware Layer (Sensors & Actuators) │
│ (Motors, Cameras, LiDAR, Encoders) │
└─────────────────────────────────────────────────┘
This architecture separates concerns by time scale and abstraction level.
Functional Layers
1. Hardware Layer
Function: Interface between digital computation and physical world.
Components:
- Sensors: Convert physical phenomena → digital signals
- Actuators: Convert digital commands → physical motion
- Power: Batteries, voltage regulators, power distribution
- Compute: Embedded controllers, GPUs, CPUs
Characteristics:
- Time Scale: Microseconds to milliseconds
- Update Rate: 1-100 kHz (motor control)
- Determinism: Hard real-time requirements
Example: Motor driver receives torque command (digital), drives current through motor coils (analog), produces joint motion (mechanical).
2. Reactive Layer (Control)
Function: Real-time closed-loop control for stability and safety.
Components:
- Motor Controllers: PID, torque control, velocity control
- Balance Controllers: Stabilization for bipedal/aerial robots
- Reflex Behaviors: Emergency stops, collision avoidance
- Sensor Processing: Filtering, calibration, low-level fusion
Characteristics:
- Time Scale: 1-10 milliseconds
- Update Rate: 100-1000 Hz
- Paradigm: Reactive (stimulus → response)
Example: Humanoid detects loss of balance (IMU) → immediately adjusts ankle torque to prevent fall (no planning involved).
Algorithms:
- PID control for joint position/velocity
- Impedance control for compliant interaction
- Admittance control for force tracking
- Virtual model control for balance
3. Executive Layer (Coordination)
Function: Coordinate multiple behaviors, manage resources, monitor execution.
Components:
- Behavior Arbitration: Select which behavior to execute (priority-based, auction-based)
- Resource Allocation: Assign sensors/actuators to tasks
- State Machine: Manage transitions between modes (idle → walking → grasping)
- Health Monitoring: Detect faults, trigger recovery
Characteristics:
- Time Scale: 10-100 milliseconds
- Update Rate: 10-100 Hz
- Paradigm: Hybrid (reactive + deliberative)
Example: Robot executing "pick and place" switches between behaviors: approach object → align gripper → close gripper → lift → transport → place → open gripper.
Design Patterns:
- Subsumption Architecture: Lower layers can override higher layers (safety first)
- Hierarchical State Machines: Nested states for complex task decomposition
- Behavior Trees: Modular, composable task representation
4. Deliberative Layer (Planning)
Function: High-level reasoning, planning, learning.
Components:
- Path Planning: Compute collision-free paths (RRT, A*, Dijkstra)
- Motion Planning: Trajectory optimization for smooth motion
- Task Planning: Sequence of actions to achieve goal (PDDL, HTN)
- Perception: Object detection, segmentation, tracking
- Localization & Mapping: SLAM, visual odometry
Characteristics:
- Time Scale: 100 milliseconds to seconds
- Update Rate: 1-10 Hz
- Paradigm: Deliberative (search, optimization, prediction)
Example: Robot plans path from current position to goal, avoiding obstacles, optimizing for time or energy.
Algorithms:
- Sampling-based motion planning (RRT, PRM)
- Optimization-based planning (trajectory optimization, MPC)
- Search-based planning (A*, D*)
- Learning-based planning (neural planners, value iteration networks)
5. Application Layer (Task Logic)
Function: High-level task specification and execution.
Components:
- Task Interface: User commands, scripts, demonstrations
- World Model: Semantic understanding of environment
- Decision Making: Task sequencing, goal reasoning
- Human-Robot Interaction: Speech, gestures, GUI
Characteristics:
- Time Scale: Seconds to minutes
- Update Rate: 0.1-1 Hz
- Paradigm: Symbolic reasoning, language-based
Example: User says "Clean the table." System interprets command, identifies table location, plans sequence: navigate to table → detect objects → grasp object → move to bin → release → repeat.
Technologies:
- Large language models (task understanding)
- Vision-language models (grounding language in vision)
- Symbolic planners (STRIPS, PDDL)
- Dialogue systems (clarification, confirmation)
Data Flow Through Layers
Bottom-Up (Perception)
Sensors → Signal Processing → Feature Extraction → State Estimation → World Model
Example: Object Detection Pipeline:
- Camera captures 1920×1080 RGB image (30 Hz)
- Preprocessing resizes to 640×640, normalizes pixel values
- Neural Network (YOLO, Faster R-CNN) detects objects → bounding boxes + class labels
- 3D Estimation fuses with depth sensor → object poses in world coordinates
- Tracking associates detections across frames → object trajectories
- World Model updates object database (positions, velocities, identities)
Latency: 30-100ms (camera lag + inference + processing)
Top-Down (Action)
Task Goal → Plan → Behavior Selection → Control Commands → Actuator Signals
Example: Pick-and-Place Execution:
- Task: "Pick up red cup"
- Planning: Compute arm trajectory from current pose to pre-grasp pose
- Behavior: Select "approach" behavior
- Control: Generate joint velocity commands (inverse kinematics + motion profile)
- Actuation: Motor drivers execute velocity commands → arm moves
Update Rate: 1 Hz (planning) → 10 Hz (behavior) → 100 Hz (control) → 1 kHz (actuation)
Communication Architecture
Centralized vs. Distributed
Centralized (Single Computer):
- All computation on one powerful machine
- Simplifies software architecture
- Single point of failure
- Limited I/O bandwidth
Distributed (Multiple Processors):
- Sensors/actuators connected to dedicated microcontrollers
- High-level planning on powerful CPU/GPU
- Low-level control on real-time processors (FPGA, microcontrollers)
- Robust to partial failures
Hybrid (Common in Modern Robots):
- Central computer for perception, planning, learning
- Distributed microcontrollers for motor control, sensor acquisition
- Communication via field bus (CAN, EtherCAT, ROS 2)
Middleware Frameworks
ROS 2 (Robot Operating System 2)
Function: Standardized communication framework for robot components.
Key Features:
- Publish-Subscribe: Nodes publish data to topics, others subscribe
- Services: Request-response communication
- Actions: Long-running tasks with feedback
- Real-Time: DDS (Data Distribution Service) backend for deterministic communication
Example Architecture:
Camera Node (30 Hz) → /camera/image_raw → Object Detector Node
↓
/detected_objects → Task Planner
↓
/arm/trajectory → Arm Controller
↓
Motors
Advantages:
- Modularity: Swap components easily
- Language agnostic: C++, Python, Rust
- Rich ecosystem: Drivers, algorithms, visualization (RViz)
Disadvantages:
- Overhead (not suitable for hard real-time control loops)
- Complexity (learning curve)
Compute Architecture
Processing Units
CPU (Central Processing Unit):
- Use: High-level planning, logic, coordination
- Example: Intel i7, AMD Ryzen, ARM Cortex-A
GPU (Graphics Processing Unit):
- Use: Deep learning inference, parallel sensor processing
- Example: NVIDIA Jetson (embedded), RTX 4090 (desktop)
FPGA (Field-Programmable Gate Array):
- Use: Ultra-low latency sensor processing, custom control loops
- Example: Xilinx Zynq, Intel Altera
Microcontroller (MCU):
- Use: Motor control, sensor acquisition, real-time tasks
- Example: STM32, Arduino, Teensy
TPU (Tensor Processing Unit):
- Use: Accelerated neural network inference
- Example: Google Edge TPU, Coral
Typical Humanoid Compute Stack
Example: Tesla Optimus:
- Main Computer: Custom SoC (System-on-Chip) with CPU + GPU
- Planning, perception (vision models)
- Motor Controllers: Distributed microcontrollers (one per joint)
- Real-time torque control
- Sensor Hubs: Dedicated processors for camera, IMU fusion
- Communication: Custom field bus (low latency)
Power Budget: 100-500W total (majority for compute, rest for motors).
Perception Architecture
Sensor Fusion Pipeline
Goal: Combine multiple sensors for robust state estimation.
Example: Mobile Robot Localization:
Sensors:
- Wheel Encoders: Odometry (dead reckoning)
- IMU: Orientation, acceleration
- LiDAR: 2D/3D point cloud
- Camera: Visual features
Fusion Algorithm (Extended Kalman Filter):
- Prediction: Use odometry to predict position
- Update: Correct prediction using LiDAR scan matching
- Refinement: Incorporate IMU for orientation stability
- Validation: Visual odometry detects wheel slip
Output: Robot pose (x, y, θ) at 50 Hz with 5cm accuracy.
Alternative: Factor graph optimization (GTSAM library), particle filter (Monte Carlo localization).
Perception Modules
Object Detection:
- Input: RGB image
- Output: Bounding boxes, class labels, confidence scores
- Model: YOLO, Faster R-CNN, DETR
- Latency: 10-50ms
Depth Estimation:
- Input: Stereo images or monocular image
- Output: Depth map
- Model: Stereo matching, monocular depth network (MiDaS)
- Latency: 20-100ms
Semantic Segmentation:
- Input: RGB image
- Output: Pixel-wise class labels
- Model: DeepLabV3, Mask R-CNN
- Latency: 50-200ms
Pose Estimation:
- Input: RGB image + object model
- Output: 6D pose (position + orientation)
- Model: PoseCNN, DenseFusion
- Latency: 30-100ms
Control Architecture
Layered Control Hierarchy
High-Level Control (1-10 Hz):
- Input: Desired end-effector pose or velocity
- Output: Joint position/velocity commands
- Algorithm: Inverse kinematics, trajectory generation
Mid-Level Control (10-100 Hz):
- Input: Joint position/velocity commands
- Output: Joint torque commands
- Algorithm: Impedance control, admittance control
Low-Level Control (100-1000 Hz):
- Input: Joint torque commands
- Output: Motor current commands
- Algorithm: Torque control (current control loop)
Hardware Control (1-10 kHz):
- Input: Motor current commands
- Output: PWM signals to motor driver
- Implementation: Microcontroller firmware
Whole-Body Control (Humanoid Specific)
Challenge: Coordinate 30+ degrees of freedom while maintaining balance.
Approach: Quadratic Programming (QP) optimization
Formulation:
minimize: || q̈ - q̈_desired ||² (track desired accelerations)
subject to:
- Contact forces within friction cone
- Joint torque limits
- Zero Moment Point (ZMP) inside support polygon
Output: Joint accelerations for all 30 DOF simultaneously.
Update Rate: 100-1000 Hz.
Libraries: Drake (MIT), Pinocchio, RBDL.
Key Architecture Patterns
1. Sense-Plan-Act Cycle
Traditional robotics paradigm:
- Sense: Gather all sensor data
- Plan: Compute optimal action
- Act: Execute action
- Repeat
Limitation: Planning takes time (100ms-1s), world changes during planning.
Solution: Asynchronous planning (plan while executing previous plan).
2. Subsumption Architecture
Principle: Layered behaviors, lower layers can suppress higher layers.
Example:
- Layer 0 (highest priority): Collision avoidance (reflex)
- Layer 1: Follow wall (reactive)
- Layer 2: Explore (deliberative)
Behavior: If obstacle detected, Layer 0 suppresses Layers 1-2 and executes avoidance.
Advantage: Robust, real-time, graceful degradation.
Disadvantage: Difficult to design for complex tasks.
3. Hybrid Deliberative/Reactive
Principle: Combine planning (slow, optimal) with reactive control (fast, safe).
Implementation:
- Deliberative layer: Plan path every 1 second
- Reactive layer: Adjust trajectory every 10ms for obstacles
Example: Self-driving car plans route (GPS navigation) but reactively avoids pedestrians (no time to replan full route).
4. Model Predictive Control (MPC)
Principle: Optimize actions over future time horizon, execute first action, replan.
Algorithm:
- Predict future states (0-2 seconds ahead)
- Optimize control to minimize cost (collision, energy, time)
- Execute first 100ms of plan
- Repeat (receding horizon)
Advantage: Anticipatory, handles constraints.
Disadvantage: Computationally expensive (requires fast optimization).
Use Case: Drone trajectory tracking, humanoid walking.
Practical Example: Warehouse AMR Architecture
Application: Autonomous mobile robot for warehouse logistics.
Hardware:
- Compute: NVIDIA Jetson AGX Xavier (GPU for perception)
- Sensors: 2D LiDAR, stereo camera, wheel encoders, IMU
- Actuators: Differential drive motors
- Power: 48V battery (8 hours runtime)
Software Stack:
Layer 1 (Hardware):
- Motor drivers (CAN bus)
- LiDAR driver (Ethernet)
- Camera driver (USB)
Layer 2 (Reactive):
- Velocity controller (100 Hz)
- Emergency stop (detects obstacles within 0.5m)
Layer 3 (Executive):
- State machine (idle → navigating → docking → charging)
- Behavior arbitration (prioritize safety > efficiency)
Layer 4 (Deliberative):
- SLAM (map building + localization, 10 Hz)
- Path planning (A* on occupancy grid, replan every 1s)
- Object detection (pallet recognition, 5 Hz)
Layer 5 (Application):
- Task manager (receive pick/place orders from warehouse system)
- Fleet coordination (communicate with other robots)
Communication: ROS 2 (internal), REST API (external fleet management).
Key Takeaways
-
Physical AI systems use hierarchical layered architecture separating hardware, reactive control, coordination, planning, and application logic.
-
Time scales vary by layer: hardware (µs-ms), reactive (ms), planning (100ms-s), application (s-min).
-
Data flows bottom-up (perception) and top-down (action) with sensor fusion at lower layers and symbolic reasoning at higher layers.
-
Communication architectures range from centralized (single computer) to distributed (multiple processors) with middleware like ROS 2 enabling modular design.
-
Compute architectures combine CPUs (logic), GPUs (learning/perception), microcontrollers (real-time control), and specialized accelerators (FPGAs, TPUs).
-
Perception pipelines integrate object detection, depth estimation, segmentation, and pose estimation at 10-30 Hz for real-time scene understanding.
-
Control hierarchies span high-level task planning (Hz), mid-level impedance control (10s Hz), and low-level torque control (100s Hz-kHz).
-
Key patterns include sense-plan-act, subsumption, hybrid deliberative/reactive, and model predictive control trading off optimality, reactivity, and robustness.
Next Chapter: Data flow in Physical AI systems—how information moves from sensors through perception, planning, control, and back to actuators.