Skip to content
All posts
Tutorials & Tips12 Min read

How Monocular Depth Estimation Actually Works

The mechanics behind per-frame depth maps — from learned priors to failure modes — and why temporal consistency is the hard part.

Depth estimation from a single image sounds impossible. A flat photograph contains no parallax, no stereo disparity, no direct geometric signal about how far away things are. And yet modern AI models produce depth maps that are, more often than not, remarkably accurate. How?

The short answer is learned priors. Models like ZoeDepth, Depth Anything, and MiDaS have been trained on millions of image-depth pairs. They've learned that floors recede, that objects occlude backgrounds, that texture gradients correlate with distance, and that atmospheric haze increases with depth. These aren't rules anyone programmed — they're statistical patterns extracted from data.

The practical implication: depth estimation works best on scenes that resemble the training data. Well-lit outdoor landscapes, indoor rooms with clear furniture, portraits with bokeh backgrounds — excellent results. Edge cases — reflective surfaces, transparent objects, unusual camera angles — this is where the models struggle. Knowing these boundaries is the difference between getting great results and fighting the tool.

What the model actually outputs

A depth map is a single-channel grayscale image where each pixel's brightness represents its estimated distance from the camera. In most conventions, brighter pixels are closer and darker pixels are farther away, though some models invert this. The values are relative, not metric — the model tells you that the foreground is closer than the background, but it does not tell you the foreground is 1.5 meters away.

For stereo conversion, relative depth is sufficient. You need to know the ordering (what is in front of what) and the proportional distances (how much closer the foreground is than the background). You do not need absolute measurements in meters. This is why models like Depth Anything V2, which excel at relative depth ordering, are better suited for conversion than models like ZoeDepth, which prioritize metric accuracy at the cost of edge quality.

The output resolution is typically lower than the input. Most models downsample internally to 384×384 or 518×518, process at that resolution, then upsample the result to match the input. This means the depth map lacks the fine-grained spatial resolution of the source image — edges are softer, thin structures may be missed, and small objects may merge with their backgrounds. This is a fundamental limitation, not a bug in any specific model.

Where depth estimation fails

Transparent and reflective surfaces are the most common failure mode. Glass, water, and mirrors confuse the model because the visual appearance of the surface does not correspond to its physical distance. A window shows a scene behind it, but the glass surface is close to the camera. The model must decide which to prioritize, and it often guesses wrong.

Repetitive textures at varying depths produce ambiguous signals. A tiled floor receding into the distance provides strong monocular depth cues (texture gradient). But a tiled wall at a constant distance provides the same texture pattern with no depth variation. The model sometimes interprets the wall tiles as receding, producing incorrect depth.

Unusual viewpoints — extreme close-ups, overhead shots, underwater scenes — fall outside the typical training distribution. The model has seen fewer examples of these scenes and produces less reliable estimates. If your source material includes unusual camera angles, preview the depth map before committing to a full render.

Fast motion and motion blur interact poorly with per-frame depth estimation. The model processes each frame independently, and a motion-blurred frame contains less spatial information than a sharp one. This is another reason to run temporal smoothing after depth estimation — it stabilizes estimates across frames where individual frame quality varies.

The temporal problem

A single depth map can look perfect. Play the video and the illusion collapses. Objects shimmer, surfaces breathe, edges swim. This is the temporal consistency problem. Each frame is estimated independently, with no memory of the previous frame's prediction. Minor variations in lighting, compression artifacts, or sub-pixel motion cause the model to produce slightly different depth values frame to frame. These differences are small but visible in stereo, where the human visual system is extraordinarily sensitive to depth instability.

anelo addresses this with a dedicated temporal smoothing stage that runs after depth estimation. It uses bilateral filtering to enforce consistency between adjacent frames while preserving genuine depth changes during camera motion and scene cuts. The smoothing strength is configurable — documentaries need less smoothing than action films because camera motion is slower.

The key insight: depth accuracy and temporal consistency are separate problems. A model can be excellent at one and terrible at the other. anelo treats them as separate stages because they require different solutions — and because you need to tune them independently.