// Foundation Model Research — Robotics
EvManip builds self-supervised foundation models that fuse event camera streams with RGB-D sensor data — enabling robots to manipulate objects with human-level dexterity in any environment.
// 01 — Mission
Traditional robotic manipulation relies on frame-based cameras that bottleneck perception at 30–60fps. EvManip replaces this paradigm with event-driven, asynchronous sensing that captures motion at microsecond resolution — unlocking manipulation capabilities impossible with conventional approaches.
The first foundation model for robotic manipulation trained entirely without human annotations, leveraging contrastive event-RGB alignment as a supervisory signal.
A unified latent space that fuses event streams, RGB-D frames, and proprioceptive signals into a single coherent world representation for manipulation planning.
Policies trained on EvManip generalize across novel objects, lighting conditions, and manipulation contexts without fine-tuning or domain adaptation.
// 02 — Technology
Dynamic Vision Sensors (DVS) output asynchronous events at microsecond resolution, capturing motion with ultra-low latency and high dynamic range — immune to motion blur that cripples frame-based systems.
DVS / DAVIS SensorsDepth-aware spatial grounding fused with event streams via cross-modal attention. The model learns to align temporal events with structured geometry for precise 6-DOF grasp estimation.
Cross-Modal AttentionA transformer-based architecture pretrained on diverse manipulation trajectories. Fine-tunable to any downstream task with minimal demonstrations, achieving state-of-the-art on standard benchmarks.
Transformer ArchitectureContrastive learning between synchronized event and RGB streams generates rich, label-free representations. No human annotation pipelines. No costly teleoperation data collection.
Contrastive SSLOptimized inference pipeline runs at >100Hz on embedded compute. Deployable on Jetson-class hardware, enabling deployment in unstructured environments without cloud dependency.
Edge DeploymentEvent camera simulation in Isaac Sim enables large-scale synthetic pretraining. Domain randomization over lighting and texture conditions yields robust real-world transfer out of the box.
Isaac Sim / OceanSim// Inference Pipeline
// 03 — Research
Our work sits at the intersection of neuromorphic sensing, self-supervised learning, and robot manipulation — pushing the boundaries of what autonomous systems can perceive and do.
We introduce the first foundation model for robotic manipulation trained without manual labels, using event-RGB contrastive alignment as the pretraining objective. Evaluated across 7 manipulation benchmarks, EvManip-1 outperforms supervised baselines in 5 out of 7 tasks.
A novel cross-modal attention mechanism that temporally aligns asynchronous event streams with synchronous RGB-D frames without explicit time discretization. Achieves 40% reduction in grasp estimation error versus frame-only baselines.
We propose a high-fidelity event simulator built on Isaac Sim with configurable sensor noise models. Demonstrates zero-shot sim-to-real transfer on pick-and-place and articulated object manipulation tasks.
By conditioning manipulation policies on language-grounded scene descriptions fused with event-RGB representations, EvManip generalizes to instructions unseen during training without any fine-tuning.
// 04 — Contact
We're looking for research collaborators, hardware partners, and early adopters pushing the edge of robotic autonomy. If you're working on next-generation manipulation systems, let's talk.