Perception Pipelines for Physical AI

Problem Framing

Perception transforms raw sensor data into actionable representations. Unlike virtual AI where input is clean and structured, Physical AI perception must handle:

Noisy sensors: Gaussian noise, outliers, drift
Partial observability: Occlusions, limited field of view
Real-time constraints: 30-100ms latency budgets
Multimodal fusion: Combining vision, depth, tactile, proprioceptive data

Core Challenge: Build robust perception under uncertainty while meeting hard real-time deadlines.

Vision Pipeline Architecture

RGB Image Processing

Pipeline Stages:

Camera → Debayering → Undistortion → Preprocessing → Neural Network → Post-processing → Output

1. Debayering: Convert Bayer pattern to RGB (10ms)

Raw sensor data is mosaic of R/G/B pixels
Interpolation reconstructs full RGB image
Quality vs speed: bilinear (fast) vs edge-aware (slow)

2. Undistortion: Correct lens distortion (5ms)

Apply camera calibration parameters
Remap pixels using lookup table (fast) or per-pixel computation (accurate)
Critical for metric measurements

3. Preprocessing: Normalize for neural network (2ms)

Resize to network input (e.g., 640×640)
Normalize pixel values [0,255] → [0,1] or standardize (mean=0, std=1)
Color space conversion if needed (RGB → BGR for some models)

4. Neural Network Inference: Object detection/segmentation (20-100ms)

YOLO: Single-shot detector, 10-50ms, real-time capable
Mask R-CNN: Instance segmentation, 50-200ms, high accuracy
Vision Transformers (ViT): 30-100ms, SOTA accuracy but compute-heavy

5. Post-processing: Filter and refine outputs (5ms)

Non-maximum suppression (remove duplicate detections)
Confidence thresholding (filter low-confidence detections)
Tracking (associate detections across frames)

Total Latency: 40-120ms (too slow for low-level control, sufficient for planning)

Depth Estimation

Stereo Vision:

Left Image + Right Image → Disparity Map → Depth Map

Algorithm: Block matching or semi-global matching (SGM)

Compare image patches between left/right images
Disparity d = baseline × focal_length / depth
Accuracy: 1-5% of distance
Latency: 20-50ms (GPU-accelerated)

Monocular Depth:

Single RGB Image → Neural Network → Depth Map

Models: MiDaS, DPT (Dense Prediction Transformer)

Learned depth from cues (perspective, occlusion, texture)
Advantage: Single camera, no calibration
Limitation: Relative depth only (not metric), scale ambiguity

LiDAR Processing

Point Cloud Pipeline:

LiDAR → Raw Points → Ground Removal → Clustering → Object Detection → Tracking

1. Ground Removal: Segment ground plane (10ms)

RANSAC plane fitting
Elevation-based filtering
Output: Obstacle points only

2. Clustering: Group points into objects (15ms)

Euclidean clustering (DBSCAN)
Distance threshold (e.g., 0.5m)
Output: Point clusters per object

3. Object Detection: Classify clusters (20ms)

Bounding box fitting (oriented or axis-aligned)
Shape-based classification (car, pedestrian, cyclist)
Output: Object class + pose + dimensions

4. Tracking: Associate objects across scans (5ms)

Kalman filter prediction + data association
Handle occlusions, object ID management
Output: Tracked objects with velocities

Total Latency: 50ms @ 20 Hz (typical for automotive LiDAR)

Multimodal Sensor Fusion

Architecture Patterns

Early Fusion: Combine raw sensor data before processing

RGB + Depth → Concatenate → Neural Network → Output

Advantage: Network learns joint features
Disadvantage: Requires synchronized sensors, hard to handle missing modalities

Late Fusion: Process each modality separately, combine outputs

RGB → Detector A → Detections
Depth → Detector B → Detections  → Fusion → Final Detections

Advantage: Modular, handles missing sensors gracefully
Disadvantage: Misses cross-modal correlations

Hierarchical Fusion: Multi-stage combination

RGB + Depth → Feature Fusion → Mid-level Features → Detection Head → Detections

Advantage: Balance between early/late fusion
Example: BEVFusion (autonomous driving), PointPainting (LiDAR + camera)

Fusion Algorithms

Kalman Filter Fusion:

Use Case: Fuse IMU + visual odometry for localization
Principle: Weighted average based on sensor noise covariance
Update: prediction (IMU) + correction (vision)
Advantage: Optimal for Gaussian noise, real-time
Limitation: Assumes linear dynamics, Gaussian distributions

Particle Filter Fusion:

Use Case: Non-linear systems, multi-modal distributions
Principle: Represent belief as weighted particle cloud
Update: Resample particles based on sensor likelihood
Advantage: Handles non-Gaussian noise, multi-hypothesis tracking
Limitation: Computationally expensive (1000s of particles)

Neural Fusion:

Use Case: Complex cross-modal correlations
Principle: Learned attention weights across modalities
Architecture: Transformer-based cross-attention
Advantage: End-to-end learned, handles novel correlations
Limitation: Requires large training data, opaque reasoning

Practical Example: Grasping Pipeline

Sensors:

RGB-D camera (RealSense D435)
Wrist force/torque sensor
Joint encoders

Pipeline:

Phase 1: Pre-Grasp (Vision-Dominant)

RGB-D → Object detection + 6D pose estimation (50ms)
Point cloud → Grasp candidate generation (30ms)
Grasp ranking → Select top grasp (10ms)

Phase 2: Approach (Vision + Proprioception)

Visual servoing: Track object in image → velocity commands (30 Hz)
Joint encoders → Forward kinematics → End-effector pose (1 kHz)
Fuse visual + kinematic estimates → Kalman filter (100 Hz)

Phase 3: Contact (Tactile-Dominant)

Force sensor → Detect contact (1 kHz)
Adjust gripper force based on contact (impedance control)
Vision confirms grasp success (post-grasp verification)

Key Insight: Modality priorities shift across task phases.

Latency and Accuracy Tradeoffs

System-Level Latency Budget

Example: Humanoid Walking

Control Loop: 1 kHz (1ms cycle time)
State Estimation: 100 Hz (10ms latency acceptable)
Vision Processing: 10 Hz (100ms latency acceptable)

Allocation:

IMU → State estimator: 1ms (critical path)
Vision → Object detector: 100ms (non-critical)
Planner → Footstep planner: 1000ms (runs asynchronously)

Principle: Slower modalities inform higher-level decisions, fast modalities enable low-level control.

Accuracy vs Compute Tradeoff

Object Detection Example:

Model	Input Size	Latency (Jetson Xavier)	mAP	Use Case
YOLOv5-nano	320×320	8ms	28.4	Lightweight, embedded
YOLOv5-s	640×640	25ms	37.4	Balanced
YOLOv5-x	640×640	150ms	50.7	High accuracy, GPU
EfficientDet-D7	1536×1536	300ms	52.2	Offline processing

Engineering Decision:

Mobile robot navigation: YOLOv5-nano (real-time, object presence matters more than precision)
Manipulation: YOLOv5-x (accuracy critical, 150ms acceptable for planning)
Warehouse picking: EfficientDet-D7 (offline scene understanding)

Resolution vs Speed

Camera Resolution:

640×480 (VGA): 30 FPS, low compute, small objects missed
1920×1080 (FHD): 30 FPS, medium compute, good for manipulation
3840×2160 (4K): 15-30 FPS, high compute, inspection tasks

Depth Sensor Resolution:

320×240: Real-time, coarse geometry
640×480: Standard, balanced
1280×720: High-res, slow, precise measurements

Tradeoff: Higher resolution enables finer details but reduces frame rate and increases latency.

Real-World Failure Modes

1. Lighting Variation

Problem: Vision models trained on well-lit indoor scenes fail in:

Direct sunlight (overexposure, shadows)
Low light (underexposure, noise)
Backlighting (silhouettes, lost detail)

Mitigation:

Hardware: HDR cameras (120dB dynamic range)
Software: Adaptive exposure control, tone mapping
Training: Data augmentation with lighting variation
Redundancy: LiDAR (lighting-invariant) + camera

Example: Outdoor delivery robot

Camera fails to detect obstacles in direct sunlight
LiDAR provides backup obstacle detection
Fusion: Use LiDAR when camera confidence low

2. Occlusion and Partial Views

Problem: Object detectors assume full object visibility.

Partial occlusion → low confidence or misclassification
Total occlusion → missed detection

Mitigation:

Multi-view fusion: Cameras at different angles
Temporal integration: Track objects across time (object persistence)
Predictive models: Estimate occluded object location

Example: Warehouse picking

Object partially hidden behind another
Single viewpoint: 40% detection rate
Two viewpoints (left + right): 85% detection rate

3. Dynamic Range Limitations

Problem: Sensors have limited dynamic range.

Bright + dark regions in same scene → saturated or underexposed
Example: Looking from indoor (dark) to outdoor (bright) through window

Mitigation:

HDR imaging: Capture multiple exposures, merge
Active lighting: Structured light, flash
Sensor selection: Event cameras (120dB vs 60dB for standard)

4. Motion Blur

Problem: Moving camera or fast-moving objects → blurred images.

Standard cameras: Rolling shutter artifacts
Latency: Blur reduces feature quality, degrades tracking

Mitigation:

Faster shutter speed: Reduce exposure time (requires more light or sensor gain)
Global shutter cameras: Entire frame exposed simultaneously (expensive)
Event cameras: No motion blur (pixel-level temporal resolution)
Multi-frame fusion: Combine multiple blurred images

Example: Quadcopter

High-speed motion → motion blur
IMU provides motion compensation for visual odometry
Event camera backup for aggressive maneuvers

5. Sensor Failures and Degradation

Problem: Sensors fail or degrade over time.

Dirty lens (dust, water, oil)
Sensor drift (IMU bias, encoder wear)
Hardware failure (cable disconnection, power loss)

Mitigation:

Health Monitoring: Track sensor statistics (noise level, update rate)
Anomaly Detection: Detect out-of-distribution sensor readings
Graceful Degradation: Reduce functionality vs complete failure
Redundancy: Multiple sensors per modality

Example: Autonomous vehicle

Front camera dirty → reduced confidence
System switches to side cameras + LiDAR
Alerts operator to clean camera
Does not attempt highway driving (reduced capability mode)

Engineering Best Practices

1. Calibration Pipeline

Camera Calibration:

Intrinsic: Focal length, principal point, distortion coefficients
Extrinsic: Position and orientation relative to robot base
Frequency: Initial + after mechanical impacts + quarterly
Tool: Checkerboard pattern, OpenCV calibration

Multi-Camera Calibration:

Spatial: Relative poses between cameras
Temporal: Time synchronization (critical for fusion)
Challenge: No overlapping field of view → calibration target sequencing

Sensor-to-Robot Calibration:

Hand-eye calibration: Camera on robot arm → solve AX=XB problem
Validation: Repeatability test (move to same pose, measure consistency)

2. Temporal Synchronization

Problem: Sensors operate at different rates with different delays.

Camera: 30 Hz, 33ms latency
LiDAR: 10 Hz, 50ms latency
IMU: 200 Hz, 5ms latency

Solution: Timestamp-based fusion

Assign timestamp to each measurement
Interpolate/extrapolate to common time
Account for latency in fusion algorithm

Implementation:

def fuse_sensors(camera_data, lidar_data, target_time):
    # Interpolate camera data to target time
    cam_interpolated = interpolate(camera_data, target_time)
    # Use LiDAR measurement closest to target time
    lidar_closest = find_closest(lidar_data, target_time)
    # Fuse
    return fusion_algorithm(cam_interpolated, lidar_closest)

3. Uncertainty Quantification

Perception outputs should include confidence:

Object detection: Bounding box + class + confidence score
Pose estimation: 6D pose + covariance matrix
Depth estimation: Depth map + uncertainty map

Use in downstream modules:

Path planning: Avoid uncertain obstacles
Grasp planning: Reject low-confidence grasp candidates
Human oversight: Request human confirmation for uncertain decisions

4. Computational Resource Management

Heterogeneous Compute:

CPU: Sequential logic, scheduling, I/O
GPU: Parallel inference (deep learning), image processing
FPGA: Ultra-low latency sensor processing, custom algorithms

Pipeline Optimization:

Batching: Process multiple frames together (GPU efficiency)
Quantization: INT8 inference (4× speedup vs FP32, minimal accuracy loss)
Model pruning: Remove redundant weights (smaller model, faster inference)
Early exit: Skip expensive processing if early stages fail (e.g., no objects detected)

Key Takeaways

Perception pipelines transform raw sensor data into structured representations through stages: preprocessing → inference → post-processing, with total latency 40-200ms.
Multimodal fusion combines vision, depth, LiDAR, proprioceptive, and tactile sensing using early fusion (joint features), late fusion (modular), or hierarchical fusion (balanced).
Latency-accuracy tradeoffs require careful engineering: allocate fast processing (1-10ms) to control-critical loops, slower processing (100ms+) to high-level planning.
Real-world failure modes include lighting variation, occlusion, dynamic range limits, motion blur, and sensor degradation—requiring HDR cameras, multi-view fusion, event cameras, and redundancy.
Calibration, temporal synchronization, uncertainty quantification, and compute optimization are essential engineering practices for robust production systems.
Modality priorities shift across task phases: vision-dominant pre-grasp, vision+proprioception during approach, tactile-dominant at contact.
System-level latency budgets allocate 1ms for critical control loops, 10-100ms for perception, 1000ms for planning—enabling real-time operation under tight constraints.

Next Chapter: Control systems—PID, MPC, and learning-based control for Physical AI.

Problem Framing​

Vision Pipeline Architecture​

RGB Image Processing​

Depth Estimation​

LiDAR Processing​

Multimodal Sensor Fusion​

Architecture Patterns​

Fusion Algorithms​

Practical Example: Grasping Pipeline​

Latency and Accuracy Tradeoffs​

System-Level Latency Budget​

Accuracy vs Compute Tradeoff​

Resolution vs Speed​

Real-World Failure Modes​

1. Lighting Variation​

2. Occlusion and Partial Views​

3. Dynamic Range Limitations​

4. Motion Blur​

5. Sensor Failures and Degradation​

Engineering Best Practices​

1. Calibration Pipeline​

2. Temporal Synchronization​

3. Uncertainty Quantification​

4. Computational Resource Management​

Key Takeaways​

Problem Framing

Vision Pipeline Architecture

RGB Image Processing

Depth Estimation

LiDAR Processing

Multimodal Sensor Fusion

Architecture Patterns

Fusion Algorithms

Practical Example: Grasping Pipeline

Latency and Accuracy Tradeoffs

System-Level Latency Budget

Accuracy vs Compute Tradeoff

Resolution vs Speed

Real-World Failure Modes

1. Lighting Variation

2. Occlusion and Partial Views

3. Dynamic Range Limitations

4. Motion Blur

5. Sensor Failures and Degradation

Engineering Best Practices

1. Calibration Pipeline

2. Temporal Synchronization

3. Uncertainty Quantification

4. Computational Resource Management

Key Takeaways