← Back to Computer Vision cs.CV
How to reconstruct 3D scenes without losing track over time?
Congrong Xu, Huachen Gao, Xingyu Chen, Yuliang Xiu, Jun Gao, Anpei Chen
May 26, 2026
Existing depth-and-pose foundation models assume a fixed global coordinate frame, which breaks down for long videos or streaming input—positions drift unbounded over time. R³ instead predicts relative constraints between frames using a lightweight MLP, with confidence scores that weight both training losses and pose aggregation. This lets the model handle arbitrarily long sequences without memory growth, working in both full-context and causal streaming modes.
Read the original paper →