// Foundation Model Research — Robotics

The Future of Robotic Manipulation.

EvManip builds self-supervised foundation models that fuse event camera streams with RGB-D sensor data — enabling robots to manipulate objects with human-level dexterity in any environment.

Explore Research View Technology
EV
CORE
Event Camera Fusion Self-Supervised Learning RGB-D Integration Foundation Model Architecture Real-Time Manipulation Sensor Fusion Pipeline Zero-Shot Generalization Event Camera Fusion Self-Supervised Learning RGB-D Integration Foundation Model Architecture Real-Time Manipulation Sensor Fusion Pipeline Zero-Shot Generalization
1μs
Event Camera Latency
Faster Than Frame-Based
Zero
Manual Labels Required
Task Generalization

Perception that moves at the speed of physics.

Traditional robotic manipulation relies on frame-based cameras that bottleneck perception at 30–60fps. EvManip replaces this paradigm with event-driven, asynchronous sensing that captures motion at microsecond resolution — unlocking manipulation capabilities impossible with conventional approaches.

Self-Supervised Foundation Model

The first foundation model for robotic manipulation trained entirely without human annotations, leveraging contrastive event-RGB alignment as a supervisory signal.

Multimodal Sensor Fusion

A unified latent space that fuses event streams, RGB-D frames, and proprioceptive signals into a single coherent world representation for manipulation planning.

Zero-Shot Task Transfer

Policies trained on EvManip generalize across novel objects, lighting conditions, and manipulation contexts without fine-tuning or domain adaptation.

A new sensing paradigm
for dexterous robots.

Event Camera Input

Dynamic Vision Sensors (DVS) output asynchronous events at microsecond resolution, capturing motion with ultra-low latency and high dynamic range — immune to motion blur that cripples frame-based systems.

DVS / DAVIS Sensors

RGB-D Fusion Layer

Depth-aware spatial grounding fused with event streams via cross-modal attention. The model learns to align temporal events with structured geometry for precise 6-DOF grasp estimation.

Cross-Modal Attention

Foundation Model Core

A transformer-based architecture pretrained on diverse manipulation trajectories. Fine-tunable to any downstream task with minimal demonstrations, achieving state-of-the-art on standard benchmarks.

Transformer Architecture

Self-Supervised Pretraining

Contrastive learning between synchronized event and RGB streams generates rich, label-free representations. No human annotation pipelines. No costly teleoperation data collection.

Contrastive SSL

Real-Time Inference

Optimized inference pipeline runs at >100Hz on embedded compute. Deployable on Jetson-class hardware, enabling deployment in unstructured environments without cloud dependency.

Edge Deployment

Sim-to-Real Transfer

Event camera simulation in Isaac Sim enables large-scale synthetic pretraining. Domain randomization over lighting and texture conditions yields robust real-world transfer out of the box.

Isaac Sim / OceanSim
01
Event Stream
DVS sensor
μs resolution
02
RGB-D Frame
Depth camera
spatial grounding
03
Fusion Encoder
Cross-modal
attention
04
Latent Space
Unified world
representation
05
Policy Head
Action prediction
6-DOF control
06
Execution
Real-time robot
control output

Advancing the frontier
of embodied intelligence.

Our work sits at the intersection of neuromorphic sensing, self-supervised learning, and robot manipulation — pushing the boundaries of what autonomous systems can perceive and do.

Core Architecture

EvManip-1: Self-Supervised Foundation Model for Dexterous Manipulation

We introduce the first foundation model for robotic manipulation trained without manual labels, using event-RGB contrastive alignment as the pretraining objective. Evaluated across 7 manipulation benchmarks, EvManip-1 outperforms supervised baselines in 5 out of 7 tasks.

Sensor Fusion

Asynchronous Event-RGB-D Fusion via Cross-Modal Transformers

A novel cross-modal attention mechanism that temporally aligns asynchronous event streams with synchronous RGB-D frames without explicit time discretization. Achieves 40% reduction in grasp estimation error versus frame-only baselines.

Sim-to-Real

Scalable Event Camera Simulation for Robot Manipulation Pretraining

We propose a high-fidelity event simulator built on Isaac Sim with configurable sensor noise models. Demonstrates zero-shot sim-to-real transfer on pick-and-place and articulated object manipulation tasks.

Generalization

Zero-Shot Task Generalization via Multimodal Scene Understanding

By conditioning manipulation policies on language-grounded scene descriptions fused with event-RGB representations, EvManip generalizes to instructions unseen during training without any fine-tuning.

Built for researchers.
Designed for the future.

We're looking for research collaborators, hardware partners, and early adopters pushing the edge of robotic autonomy. If you're working on next-generation manipulation systems, let's talk.