multimodal temporal/spatial memory and verification for vlm outputs. gemini + sam + depth + colmap.
spatial memory for vlms watching construction sites
vima takes ambiguous egocentric construction footage (hardhat-cam, basically) and turns it into auditable spatial memory. label objects, generate semantic boxes, build masks, estimate depth, group everything into episodic events, then answer questions with cited frames.
the pipeline
7 stages, each runnable independently or as part of the full chain:
- yolodex labels — yolo object detection from frames (or vendored tools). reuses the yolodex skill from another project for the labeling stage.
- robotics-er boxes — gemini robotics-er for specialized construction-domain semantic boxes (rebar, framework, pours).
- box merge — class + iou merge into unified labels (handles double-detection from yolo + robotics-er).
- masks — sam-style box-prompt masks per object track.
- depth — depth anything v2 or proxy depth estimation.
- episodic memory — object-event episodes (spatiotemporal groupings) for retrieval. "the rebar that was placed at 0:34 and re-checked at 1:12" becomes one queryable episode.
- answer from memory — gemini retrieves relevant episodes and generates a cited answer with frame timestamps.
why a vlm alone doesn't work
vlms are great at single frames, terrible at temporal grounding. ask "where did the foreman put the safety cones" and a raw vlm hallucinates a location. vima's episodic memory layer fixes this — every claim the vlm makes is backed by a specific frame range stored in the spatial index.
multiple interfaces wrap the same backend: a fastapi server (/analyze/frame, /cii/summary, /spatial/zones, /eval, /temporal/run), a cli, an mcp surface so agents can query it natively, and a dashboard for human review.
what shipped
ironsite prize + best use of solana at hacktech 2026. live at vimaspatial.tech. python backend, optional torch + transformers for local inference, hugginface model weights, mintlify docs. agent cli + mcp packages distributed under packages/vima-agent/ and packages/vima-mcp/.





