How to reconstruct 3D scenes without losing track over time?

Existing depth-and-pose foundation models assume a fixed global coordinate frame, which breaks down for long videos or streaming input—positions drift unbounded over time. R³ instead predicts relative constraints between frames using a lightweight MLP, with confidence scores that weight both training losses and pose aggregation. This lets the model handle arbitrarily long sequences without memory growth, working in both full-context and causal streaming modes.