Skip to main content

Perception Pipelines for Physical AI

Problem Framing

Perception transforms raw sensor data into actionable representations. Unlike virtual AI where input is clean and structured, Physical AI perception must handle:

  • Noisy sensors: Gaussian noise, outliers, drift
  • Partial observability: Occlusions, limited field of view
  • Real-time constraints: 30-100ms latency budgets
  • Multimodal fusion: Combining vision, depth, tactile, proprioceptive data

Core Challenge: Build robust perception under uncertainty while meeting hard real-time deadlines.

Vision Pipeline Architecture

RGB Image Processing

Pipeline Stages:

Camera → Debayering → Undistortion → Preprocessing → Neural Network → Post-processing → Output

1. Debayering: Convert Bayer pattern to RGB (10ms)

  • Raw sensor data is mosaic of R/G/B pixels
  • Interpolation reconstructs full RGB image
  • Quality vs speed: bilinear (fast) vs edge-aware (slow)

2. Undistortion: Correct lens distortion (5ms)

  • Apply camera calibration parameters
  • Remap pixels using lookup table (fast) or per-pixel computation (accurate)
  • Critical for metric measurements

3. Preprocessing: Normalize for neural network (2ms)

  • Resize to network input (e.g., 640×640)
  • Normalize pixel values [0,255] → [0,1] or standardize (mean=0, std=1)
  • Color space conversion if needed (RGB → BGR for some models)

4. Neural Network Inference: Object detection/segmentation (20-100ms)

  • YOLO: Single-shot detector, 10-50ms, real-time capable
  • Mask R-CNN: Instance segmentation, 50-200ms, high accuracy
  • Vision Transformers (ViT): 30-100ms, SOTA accuracy but compute-heavy

5. Post-processing: Filter and refine outputs (5ms)

  • Non-maximum suppression (remove duplicate detections)
  • Confidence thresholding (filter low-confidence detections)
  • Tracking (associate detections across frames)

Total Latency: 40-120ms (too slow for low-level control, sufficient for planning)

Depth Estimation

Stereo Vision:

Left Image + Right Image → Disparity Map → Depth Map

Algorithm: Block matching or semi-global matching (SGM)

  • Compare image patches between left/right images
  • Disparity d = baseline × focal_length / depth
  • Accuracy: 1-5% of distance
  • Latency: 20-50ms (GPU-accelerated)

Monocular Depth:

Single RGB Image → Neural Network → Depth Map

Models: MiDaS, DPT (Dense Prediction Transformer)

  • Learned depth from cues (perspective, occlusion, texture)
  • Advantage: Single camera, no calibration
  • Limitation: Relative depth only (not metric), scale ambiguity

LiDAR Processing

Point Cloud Pipeline:

LiDAR → Raw Points → Ground Removal → Clustering → Object Detection → Tracking

1. Ground Removal: Segment ground plane (10ms)

  • RANSAC plane fitting
  • Elevation-based filtering
  • Output: Obstacle points only

2. Clustering: Group points into objects (15ms)

  • Euclidean clustering (DBSCAN)
  • Distance threshold (e.g., 0.5m)
  • Output: Point clusters per object

3. Object Detection: Classify clusters (20ms)

  • Bounding box fitting (oriented or axis-aligned)
  • Shape-based classification (car, pedestrian, cyclist)
  • Output: Object class + pose + dimensions

4. Tracking: Associate objects across scans (5ms)

  • Kalman filter prediction + data association
  • Handle occlusions, object ID management
  • Output: Tracked objects with velocities

Total Latency: 50ms @ 20 Hz (typical for automotive LiDAR)

Multimodal Sensor Fusion

Architecture Patterns

Early Fusion: Combine raw sensor data before processing

RGB + Depth → Concatenate → Neural Network → Output
  • Advantage: Network learns joint features
  • Disadvantage: Requires synchronized sensors, hard to handle missing modalities

Late Fusion: Process each modality separately, combine outputs

RGB → Detector A → Detections
Depth → Detector B → Detections → Fusion → Final Detections
  • Advantage: Modular, handles missing sensors gracefully
  • Disadvantage: Misses cross-modal correlations

Hierarchical Fusion: Multi-stage combination

RGB + Depth → Feature Fusion → Mid-level Features → Detection Head → Detections
  • Advantage: Balance between early/late fusion
  • Example: BEVFusion (autonomous driving), PointPainting (LiDAR + camera)

Fusion Algorithms

Kalman Filter Fusion:

  • Use Case: Fuse IMU + visual odometry for localization
  • Principle: Weighted average based on sensor noise covariance
  • Update: prediction (IMU) + correction (vision)
  • Advantage: Optimal for Gaussian noise, real-time
  • Limitation: Assumes linear dynamics, Gaussian distributions

Particle Filter Fusion:

  • Use Case: Non-linear systems, multi-modal distributions
  • Principle: Represent belief as weighted particle cloud
  • Update: Resample particles based on sensor likelihood
  • Advantage: Handles non-Gaussian noise, multi-hypothesis tracking
  • Limitation: Computationally expensive (1000s of particles)

Neural Fusion:

  • Use Case: Complex cross-modal correlations
  • Principle: Learned attention weights across modalities
  • Architecture: Transformer-based cross-attention
  • Advantage: End-to-end learned, handles novel correlations
  • Limitation: Requires large training data, opaque reasoning

Practical Example: Grasping Pipeline

Sensors:

  • RGB-D camera (RealSense D435)
  • Wrist force/torque sensor
  • Joint encoders

Pipeline:

Phase 1: Pre-Grasp (Vision-Dominant)

  1. RGB-D → Object detection + 6D pose estimation (50ms)
  2. Point cloud → Grasp candidate generation (30ms)
  3. Grasp ranking → Select top grasp (10ms)

Phase 2: Approach (Vision + Proprioception)

  1. Visual servoing: Track object in image → velocity commands (30 Hz)
  2. Joint encoders → Forward kinematics → End-effector pose (1 kHz)
  3. Fuse visual + kinematic estimates → Kalman filter (100 Hz)

Phase 3: Contact (Tactile-Dominant)

  1. Force sensor → Detect contact (1 kHz)
  2. Adjust gripper force based on contact (impedance control)
  3. Vision confirms grasp success (post-grasp verification)

Key Insight: Modality priorities shift across task phases.

Latency and Accuracy Tradeoffs

System-Level Latency Budget

Example: Humanoid Walking

  • Control Loop: 1 kHz (1ms cycle time)
  • State Estimation: 100 Hz (10ms latency acceptable)
  • Vision Processing: 10 Hz (100ms latency acceptable)

Allocation:

  • IMU → State estimator: 1ms (critical path)
  • Vision → Object detector: 100ms (non-critical)
  • Planner → Footstep planner: 1000ms (runs asynchronously)

Principle: Slower modalities inform higher-level decisions, fast modalities enable low-level control.

Accuracy vs Compute Tradeoff

Object Detection Example:

ModelInput SizeLatency (Jetson Xavier)mAPUse Case
YOLOv5-nano320×3208ms28.4Lightweight, embedded
YOLOv5-s640×64025ms37.4Balanced
YOLOv5-x640×640150ms50.7High accuracy, GPU
EfficientDet-D71536×1536300ms52.2Offline processing

Engineering Decision:

  • Mobile robot navigation: YOLOv5-nano (real-time, object presence matters more than precision)
  • Manipulation: YOLOv5-x (accuracy critical, 150ms acceptable for planning)
  • Warehouse picking: EfficientDet-D7 (offline scene understanding)

Resolution vs Speed

Camera Resolution:

  • 640×480 (VGA): 30 FPS, low compute, small objects missed
  • 1920×1080 (FHD): 30 FPS, medium compute, good for manipulation
  • 3840×2160 (4K): 15-30 FPS, high compute, inspection tasks

Depth Sensor Resolution:

  • 320×240: Real-time, coarse geometry
  • 640×480: Standard, balanced
  • 1280×720: High-res, slow, precise measurements

Tradeoff: Higher resolution enables finer details but reduces frame rate and increases latency.

Real-World Failure Modes

1. Lighting Variation

Problem: Vision models trained on well-lit indoor scenes fail in:

  • Direct sunlight (overexposure, shadows)
  • Low light (underexposure, noise)
  • Backlighting (silhouettes, lost detail)

Mitigation:

  • Hardware: HDR cameras (120dB dynamic range)
  • Software: Adaptive exposure control, tone mapping
  • Training: Data augmentation with lighting variation
  • Redundancy: LiDAR (lighting-invariant) + camera

Example: Outdoor delivery robot

  • Camera fails to detect obstacles in direct sunlight
  • LiDAR provides backup obstacle detection
  • Fusion: Use LiDAR when camera confidence low

2. Occlusion and Partial Views

Problem: Object detectors assume full object visibility.

  • Partial occlusion → low confidence or misclassification
  • Total occlusion → missed detection

Mitigation:

  • Multi-view fusion: Cameras at different angles
  • Temporal integration: Track objects across time (object persistence)
  • Predictive models: Estimate occluded object location

Example: Warehouse picking

  • Object partially hidden behind another
  • Single viewpoint: 40% detection rate
  • Two viewpoints (left + right): 85% detection rate

3. Dynamic Range Limitations

Problem: Sensors have limited dynamic range.

  • Bright + dark regions in same scene → saturated or underexposed
  • Example: Looking from indoor (dark) to outdoor (bright) through window

Mitigation:

  • HDR imaging: Capture multiple exposures, merge
  • Active lighting: Structured light, flash
  • Sensor selection: Event cameras (120dB vs 60dB for standard)

4. Motion Blur

Problem: Moving camera or fast-moving objects → blurred images.

  • Standard cameras: Rolling shutter artifacts
  • Latency: Blur reduces feature quality, degrades tracking

Mitigation:

  • Faster shutter speed: Reduce exposure time (requires more light or sensor gain)
  • Global shutter cameras: Entire frame exposed simultaneously (expensive)
  • Event cameras: No motion blur (pixel-level temporal resolution)
  • Multi-frame fusion: Combine multiple blurred images

Example: Quadcopter

  • High-speed motion → motion blur
  • IMU provides motion compensation for visual odometry
  • Event camera backup for aggressive maneuvers

5. Sensor Failures and Degradation

Problem: Sensors fail or degrade over time.

  • Dirty lens (dust, water, oil)
  • Sensor drift (IMU bias, encoder wear)
  • Hardware failure (cable disconnection, power loss)

Mitigation:

  • Health Monitoring: Track sensor statistics (noise level, update rate)
  • Anomaly Detection: Detect out-of-distribution sensor readings
  • Graceful Degradation: Reduce functionality vs complete failure
  • Redundancy: Multiple sensors per modality

Example: Autonomous vehicle

  • Front camera dirty → reduced confidence
  • System switches to side cameras + LiDAR
  • Alerts operator to clean camera
  • Does not attempt highway driving (reduced capability mode)

Engineering Best Practices

1. Calibration Pipeline

Camera Calibration:

  • Intrinsic: Focal length, principal point, distortion coefficients
  • Extrinsic: Position and orientation relative to robot base
  • Frequency: Initial + after mechanical impacts + quarterly
  • Tool: Checkerboard pattern, OpenCV calibration

Multi-Camera Calibration:

  • Spatial: Relative poses between cameras
  • Temporal: Time synchronization (critical for fusion)
  • Challenge: No overlapping field of view → calibration target sequencing

Sensor-to-Robot Calibration:

  • Hand-eye calibration: Camera on robot arm → solve AX=XB problem
  • Validation: Repeatability test (move to same pose, measure consistency)

2. Temporal Synchronization

Problem: Sensors operate at different rates with different delays.

  • Camera: 30 Hz, 33ms latency
  • LiDAR: 10 Hz, 50ms latency
  • IMU: 200 Hz, 5ms latency

Solution: Timestamp-based fusion

  • Assign timestamp to each measurement
  • Interpolate/extrapolate to common time
  • Account for latency in fusion algorithm

Implementation:

def fuse_sensors(camera_data, lidar_data, target_time):
# Interpolate camera data to target time
cam_interpolated = interpolate(camera_data, target_time)
# Use LiDAR measurement closest to target time
lidar_closest = find_closest(lidar_data, target_time)
# Fuse
return fusion_algorithm(cam_interpolated, lidar_closest)

3. Uncertainty Quantification

Perception outputs should include confidence:

  • Object detection: Bounding box + class + confidence score
  • Pose estimation: 6D pose + covariance matrix
  • Depth estimation: Depth map + uncertainty map

Use in downstream modules:

  • Path planning: Avoid uncertain obstacles
  • Grasp planning: Reject low-confidence grasp candidates
  • Human oversight: Request human confirmation for uncertain decisions

4. Computational Resource Management

Heterogeneous Compute:

  • CPU: Sequential logic, scheduling, I/O
  • GPU: Parallel inference (deep learning), image processing
  • FPGA: Ultra-low latency sensor processing, custom algorithms

Pipeline Optimization:

  • Batching: Process multiple frames together (GPU efficiency)
  • Quantization: INT8 inference (4× speedup vs FP32, minimal accuracy loss)
  • Model pruning: Remove redundant weights (smaller model, faster inference)
  • Early exit: Skip expensive processing if early stages fail (e.g., no objects detected)

Key Takeaways

  1. Perception pipelines transform raw sensor data into structured representations through stages: preprocessing → inference → post-processing, with total latency 40-200ms.

  2. Multimodal fusion combines vision, depth, LiDAR, proprioceptive, and tactile sensing using early fusion (joint features), late fusion (modular), or hierarchical fusion (balanced).

  3. Latency-accuracy tradeoffs require careful engineering: allocate fast processing (1-10ms) to control-critical loops, slower processing (100ms+) to high-level planning.

  4. Real-world failure modes include lighting variation, occlusion, dynamic range limits, motion blur, and sensor degradation—requiring HDR cameras, multi-view fusion, event cameras, and redundancy.

  5. Calibration, temporal synchronization, uncertainty quantification, and compute optimization are essential engineering practices for robust production systems.

  6. Modality priorities shift across task phases: vision-dominant pre-grasp, vision+proprioception during approach, tactile-dominant at contact.

  7. System-level latency budgets allocate 1ms for critical control loops, 10-100ms for perception, 1000ms for planning—enabling real-time operation under tight constraints.


Next Chapter: Control systems—PID, MPC, and learning-based control for Physical AI.