Perception Pipelines for Physical AI
Problem Framing
Perception transforms raw sensor data into actionable representations. Unlike virtual AI where input is clean and structured, Physical AI perception must handle:
- Noisy sensors: Gaussian noise, outliers, drift
- Partial observability: Occlusions, limited field of view
- Real-time constraints: 30-100ms latency budgets
- Multimodal fusion: Combining vision, depth, tactile, proprioceptive data
Core Challenge: Build robust perception under uncertainty while meeting hard real-time deadlines.
Vision Pipeline Architecture
RGB Image Processing
Pipeline Stages:
Camera → Debayering → Undistortion → Preprocessing → Neural Network → Post-processing → Output
1. Debayering: Convert Bayer pattern to RGB (10ms)
- Raw sensor data is mosaic of R/G/B pixels
- Interpolation reconstructs full RGB image
- Quality vs speed: bilinear (fast) vs edge-aware (slow)
2. Undistortion: Correct lens distortion (5ms)
- Apply camera calibration parameters
- Remap pixels using lookup table (fast) or per-pixel computation (accurate)
- Critical for metric measurements
3. Preprocessing: Normalize for neural network (2ms)
- Resize to network input (e.g., 640×640)
- Normalize pixel values [0,255] → [0,1] or standardize (mean=0, std=1)
- Color space conversion if needed (RGB → BGR for some models)
4. Neural Network Inference: Object detection/segmentation (20-100ms)
- YOLO: Single-shot detector, 10-50ms, real-time capable
- Mask R-CNN: Instance segmentation, 50-200ms, high accuracy
- Vision Transformers (ViT): 30-100ms, SOTA accuracy but compute-heavy
5. Post-processing: Filter and refine outputs (5ms)
- Non-maximum suppression (remove duplicate detections)
- Confidence thresholding (filter low-confidence detections)
- Tracking (associate detections across frames)
Total Latency: 40-120ms (too slow for low-level control, sufficient for planning)
Depth Estimation
Stereo Vision:
Left Image + Right Image → Disparity Map → Depth Map
Algorithm: Block matching or semi-global matching (SGM)
- Compare image patches between left/right images
- Disparity d = baseline × focal_length / depth
- Accuracy: 1-5% of distance
- Latency: 20-50ms (GPU-accelerated)
Monocular Depth:
Single RGB Image → Neural Network → Depth Map
Models: MiDaS, DPT (Dense Prediction Transformer)
- Learned depth from cues (perspective, occlusion, texture)
- Advantage: Single camera, no calibration
- Limitation: Relative depth only (not metric), scale ambiguity
LiDAR Processing
Point Cloud Pipeline:
LiDAR → Raw Points → Ground Removal → Clustering → Object Detection → Tracking
1. Ground Removal: Segment ground plane (10ms)
- RANSAC plane fitting
- Elevation-based filtering
- Output: Obstacle points only
2. Clustering: Group points into objects (15ms)
- Euclidean clustering (DBSCAN)
- Distance threshold (e.g., 0.5m)
- Output: Point clusters per object
3. Object Detection: Classify clusters (20ms)
- Bounding box fitting (oriented or axis-aligned)
- Shape-based classification (car, pedestrian, cyclist)
- Output: Object class + pose + dimensions
4. Tracking: Associate objects across scans (5ms)
- Kalman filter prediction + data association
- Handle occlusions, object ID management
- Output: Tracked objects with velocities
Total Latency: 50ms @ 20 Hz (typical for automotive LiDAR)
Multimodal Sensor Fusion
Architecture Patterns
Early Fusion: Combine raw sensor data before processing
RGB + Depth → Concatenate → Neural Network → Output
- Advantage: Network learns joint features
- Disadvantage: Requires synchronized sensors, hard to handle missing modalities
Late Fusion: Process each modality separately, combine outputs
RGB → Detector A → Detections
Depth → Detector B → Detections → Fusion → Final Detections
- Advantage: Modular, handles missing sensors gracefully
- Disadvantage: Misses cross-modal correlations
Hierarchical Fusion: Multi-stage combination
RGB + Depth → Feature Fusion → Mid-level Features → Detection Head → Detections
- Advantage: Balance between early/late fusion
- Example: BEVFusion (autonomous driving), PointPainting (LiDAR + camera)
Fusion Algorithms
Kalman Filter Fusion:
- Use Case: Fuse IMU + visual odometry for localization
- Principle: Weighted average based on sensor noise covariance
- Update: prediction (IMU) + correction (vision)
- Advantage: Optimal for Gaussian noise, real-time
- Limitation: Assumes linear dynamics, Gaussian distributions
Particle Filter Fusion:
- Use Case: Non-linear systems, multi-modal distributions
- Principle: Represent belief as weighted particle cloud
- Update: Resample particles based on sensor likelihood
- Advantage: Handles non-Gaussian noise, multi-hypothesis tracking
- Limitation: Computationally expensive (1000s of particles)
Neural Fusion:
- Use Case: Complex cross-modal correlations
- Principle: Learned attention weights across modalities
- Architecture: Transformer-based cross-attention
- Advantage: End-to-end learned, handles novel correlations
- Limitation: Requires large training data, opaque reasoning
Practical Example: Grasping Pipeline
Sensors:
- RGB-D camera (RealSense D435)
- Wrist force/torque sensor
- Joint encoders
Pipeline:
Phase 1: Pre-Grasp (Vision-Dominant)
- RGB-D → Object detection + 6D pose estimation (50ms)
- Point cloud → Grasp candidate generation (30ms)
- Grasp ranking → Select top grasp (10ms)
Phase 2: Approach (Vision + Proprioception)
- Visual servoing: Track object in image → velocity commands (30 Hz)
- Joint encoders → Forward kinematics → End-effector pose (1 kHz)
- Fuse visual + kinematic estimates → Kalman filter (100 Hz)
Phase 3: Contact (Tactile-Dominant)
- Force sensor → Detect contact (1 kHz)
- Adjust gripper force based on contact (impedance control)
- Vision confirms grasp success (post-grasp verification)
Key Insight: Modality priorities shift across task phases.
Latency and Accuracy Tradeoffs
System-Level Latency Budget
Example: Humanoid Walking
- Control Loop: 1 kHz (1ms cycle time)
- State Estimation: 100 Hz (10ms latency acceptable)
- Vision Processing: 10 Hz (100ms latency acceptable)
Allocation:
- IMU → State estimator: 1ms (critical path)
- Vision → Object detector: 100ms (non-critical)
- Planner → Footstep planner: 1000ms (runs asynchronously)
Principle: Slower modalities inform higher-level decisions, fast modalities enable low-level control.
Accuracy vs Compute Tradeoff
Object Detection Example:
| Model | Input Size | Latency (Jetson Xavier) | mAP | Use Case |
|---|---|---|---|---|
| YOLOv5-nano | 320×320 | 8ms | 28.4 | Lightweight, embedded |
| YOLOv5-s | 640×640 | 25ms | 37.4 | Balanced |
| YOLOv5-x | 640×640 | 150ms | 50.7 | High accuracy, GPU |
| EfficientDet-D7 | 1536×1536 | 300ms | 52.2 | Offline processing |
Engineering Decision:
- Mobile robot navigation: YOLOv5-nano (real-time, object presence matters more than precision)
- Manipulation: YOLOv5-x (accuracy critical, 150ms acceptable for planning)
- Warehouse picking: EfficientDet-D7 (offline scene understanding)
Resolution vs Speed
Camera Resolution:
- 640×480 (VGA): 30 FPS, low compute, small objects missed
- 1920×1080 (FHD): 30 FPS, medium compute, good for manipulation
- 3840×2160 (4K): 15-30 FPS, high compute, inspection tasks
Depth Sensor Resolution:
- 320×240: Real-time, coarse geometry
- 640×480: Standard, balanced
- 1280×720: High-res, slow, precise measurements
Tradeoff: Higher resolution enables finer details but reduces frame rate and increases latency.
Real-World Failure Modes
1. Lighting Variation
Problem: Vision models trained on well-lit indoor scenes fail in:
- Direct sunlight (overexposure, shadows)
- Low light (underexposure, noise)
- Backlighting (silhouettes, lost detail)
Mitigation:
- Hardware: HDR cameras (120dB dynamic range)
- Software: Adaptive exposure control, tone mapping
- Training: Data augmentation with lighting variation
- Redundancy: LiDAR (lighting-invariant) + camera
Example: Outdoor delivery robot
- Camera fails to detect obstacles in direct sunlight
- LiDAR provides backup obstacle detection
- Fusion: Use LiDAR when camera confidence low
2. Occlusion and Partial Views
Problem: Object detectors assume full object visibility.
- Partial occlusion → low confidence or misclassification
- Total occlusion → missed detection
Mitigation:
- Multi-view fusion: Cameras at different angles
- Temporal integration: Track objects across time (object persistence)
- Predictive models: Estimate occluded object location
Example: Warehouse picking
- Object partially hidden behind another
- Single viewpoint: 40% detection rate
- Two viewpoints (left + right): 85% detection rate
3. Dynamic Range Limitations
Problem: Sensors have limited dynamic range.
- Bright + dark regions in same scene → saturated or underexposed
- Example: Looking from indoor (dark) to outdoor (bright) through window
Mitigation:
- HDR imaging: Capture multiple exposures, merge
- Active lighting: Structured light, flash
- Sensor selection: Event cameras (120dB vs 60dB for standard)
4. Motion Blur
Problem: Moving camera or fast-moving objects → blurred images.
- Standard cameras: Rolling shutter artifacts
- Latency: Blur reduces feature quality, degrades tracking
Mitigation:
- Faster shutter speed: Reduce exposure time (requires more light or sensor gain)
- Global shutter cameras: Entire frame exposed simultaneously (expensive)
- Event cameras: No motion blur (pixel-level temporal resolution)
- Multi-frame fusion: Combine multiple blurred images
Example: Quadcopter
- High-speed motion → motion blur
- IMU provides motion compensation for visual odometry
- Event camera backup for aggressive maneuvers
5. Sensor Failures and Degradation
Problem: Sensors fail or degrade over time.
- Dirty lens (dust, water, oil)
- Sensor drift (IMU bias, encoder wear)
- Hardware failure (cable disconnection, power loss)
Mitigation:
- Health Monitoring: Track sensor statistics (noise level, update rate)
- Anomaly Detection: Detect out-of-distribution sensor readings
- Graceful Degradation: Reduce functionality vs complete failure
- Redundancy: Multiple sensors per modality
Example: Autonomous vehicle
- Front camera dirty → reduced confidence
- System switches to side cameras + LiDAR
- Alerts operator to clean camera
- Does not attempt highway driving (reduced capability mode)
Engineering Best Practices
1. Calibration Pipeline
Camera Calibration:
- Intrinsic: Focal length, principal point, distortion coefficients
- Extrinsic: Position and orientation relative to robot base
- Frequency: Initial + after mechanical impacts + quarterly
- Tool: Checkerboard pattern, OpenCV calibration
Multi-Camera Calibration:
- Spatial: Relative poses between cameras
- Temporal: Time synchronization (critical for fusion)
- Challenge: No overlapping field of view → calibration target sequencing
Sensor-to-Robot Calibration:
- Hand-eye calibration: Camera on robot arm → solve AX=XB problem
- Validation: Repeatability test (move to same pose, measure consistency)
2. Temporal Synchronization
Problem: Sensors operate at different rates with different delays.
- Camera: 30 Hz, 33ms latency
- LiDAR: 10 Hz, 50ms latency
- IMU: 200 Hz, 5ms latency
Solution: Timestamp-based fusion
- Assign timestamp to each measurement
- Interpolate/extrapolate to common time
- Account for latency in fusion algorithm
Implementation:
def fuse_sensors(camera_data, lidar_data, target_time):
# Interpolate camera data to target time
cam_interpolated = interpolate(camera_data, target_time)
# Use LiDAR measurement closest to target time
lidar_closest = find_closest(lidar_data, target_time)
# Fuse
return fusion_algorithm(cam_interpolated, lidar_closest)
3. Uncertainty Quantification
Perception outputs should include confidence:
- Object detection: Bounding box + class + confidence score
- Pose estimation: 6D pose + covariance matrix
- Depth estimation: Depth map + uncertainty map
Use in downstream modules:
- Path planning: Avoid uncertain obstacles
- Grasp planning: Reject low-confidence grasp candidates
- Human oversight: Request human confirmation for uncertain decisions
4. Computational Resource Management
Heterogeneous Compute:
- CPU: Sequential logic, scheduling, I/O
- GPU: Parallel inference (deep learning), image processing
- FPGA: Ultra-low latency sensor processing, custom algorithms
Pipeline Optimization:
- Batching: Process multiple frames together (GPU efficiency)
- Quantization: INT8 inference (4× speedup vs FP32, minimal accuracy loss)
- Model pruning: Remove redundant weights (smaller model, faster inference)
- Early exit: Skip expensive processing if early stages fail (e.g., no objects detected)
Key Takeaways
-
Perception pipelines transform raw sensor data into structured representations through stages: preprocessing → inference → post-processing, with total latency 40-200ms.
-
Multimodal fusion combines vision, depth, LiDAR, proprioceptive, and tactile sensing using early fusion (joint features), late fusion (modular), or hierarchical fusion (balanced).
-
Latency-accuracy tradeoffs require careful engineering: allocate fast processing (1-10ms) to control-critical loops, slower processing (100ms+) to high-level planning.
-
Real-world failure modes include lighting variation, occlusion, dynamic range limits, motion blur, and sensor degradation—requiring HDR cameras, multi-view fusion, event cameras, and redundancy.
-
Calibration, temporal synchronization, uncertainty quantification, and compute optimization are essential engineering practices for robust production systems.
-
Modality priorities shift across task phases: vision-dominant pre-grasp, vision+proprioception during approach, tactile-dominant at contact.
-
System-level latency budgets allocate 1ms for critical control loops, 10-100ms for perception, 1000ms for planning—enabling real-time operation under tight constraints.
Next Chapter: Control systems—PID, MPC, and learning-based control for Physical AI.