Bird
Raised Fist0
Computer Visionml~8 mins

Depth estimation basics in Computer Vision - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Depth estimation basics
Which metric matters for depth estimation and WHY

Depth estimation predicts how far objects are in an image. We want to know how close the predicted depths are to the true depths.

Common metrics include:

  • Mean Absolute Error (MAE): Average of absolute differences between predicted and true depths. Lower is better.
  • Root Mean Squared Error (RMSE): Square root of average squared differences. Penalizes big mistakes more.
  • Threshold Accuracy: Percentage of pixels where prediction is within a certain ratio of true depth (e.g., within 1.25 times). Higher is better.

These metrics tell us how accurate and reliable the depth predictions are.

Confusion matrix or equivalent visualization

Depth estimation is a regression task, so confusion matrix is not used. Instead, we visualize errors like this:

True Depth:      [2.0, 3.5, 1.0, 4.0]
Predicted Depth: [2.1, 3.0, 1.2, 5.0]

Errors (abs):    [0.1, 0.5, 0.2, 1.0]
MAE = (0.1+0.5+0.2+1.0)/4 = 0.45
RMSE = sqrt((0.1**2 + 0.5**2 + 0.2**2 + 1.0**2)/4) ≈ 0.57

Threshold accuracy @1.25:
Check if max(pred/true, true/pred) < 1.25
Values: [1.05, 1.17, 1.20, 1.25]
3 out of 4 pass -> 75% accuracy
    
Precision vs Recall tradeoff (or equivalent) with concrete examples

Depth estimation does not use precision and recall because it is not classification. Instead, we balance:

  • Small average error (MAE/RMSE): Means predictions are close on average.
  • High threshold accuracy: Means most predictions are close enough within a tolerance.

For example, a robot navigating a room needs depth predictions that are mostly accurate (high threshold accuracy) to avoid obstacles safely.

If the model has low average error but many big mistakes, it might be risky. If it has high threshold accuracy but slightly higher average error, it might be safer.

What "good" vs "bad" metric values look like for depth estimation

Good values:

  • MAE less than 0.1 meters (small average error)
  • RMSE less than 0.15 meters (few large errors)
  • Threshold accuracy @1.25 above 90% (most predictions close)

Bad values:

  • MAE above 0.5 meters (large average error)
  • RMSE above 0.7 meters (many big mistakes)
  • Threshold accuracy @1.25 below 60% (many predictions far off)

Good metrics mean the model can reliably tell how far things are. Bad metrics mean the model is often wrong and not useful.

Common pitfalls in depth estimation metrics
  • Ignoring scale differences: Depth predictions might be correct up to a scale factor but metrics expect absolute values.
  • Using only average error: Can hide big mistakes if many predictions are close but some are very wrong.
  • Data leakage: Testing on images very similar to training can give overly optimistic metrics.
  • Overfitting: Model performs well on training data but poorly on new scenes, metrics look good but model is not general.
Self-check question

Your depth estimation model has 98% threshold accuracy @1.25 but an RMSE of 1.5 meters. Is it good for real-world use?

Answer: No. The high threshold accuracy means most predictions are close, but the large RMSE shows some predictions have very big errors. These big mistakes can cause problems in applications like robot navigation. The model needs improvement to reduce large errors.

Key Result
Depth estimation quality is best judged by low average errors (MAE, RMSE) and high threshold accuracy, balancing overall closeness and avoiding large mistakes.

Practice

(1/5)
1. What is the main goal of depth estimation in computer vision?
easy
A. To find how far objects are from the camera in an image
B. To detect colors in an image
C. To recognize faces in a photo
D. To increase image resolution

Solution

  1. Step 1: Understand depth estimation purpose

    Depth estimation aims to measure distance from the camera to objects in an image.
  2. Step 2: Compare options to definition

    Only To find how far objects are from the camera in an image matches this goal; others describe different tasks.
  3. Final Answer:

    To find how far objects are from the camera in an image -> Option A
  4. Quick Check:

    Depth estimation = distance measurement [OK]
Hint: Depth estimation = measuring distance in images [OK]
Common Mistakes:
  • Confusing depth estimation with object detection
  • Thinking it finds colors or faces
  • Mixing it with image enhancement
2. Which of the following is the correct way to represent a depth map in Python using NumPy?
easy
A. depth_map = np.array([[0.5, 1.2], [2.3, 0.7]])
B. depth_map = np.array(["near", "far"])
C. depth_map = np.array([["red", "blue"], ["green", "yellow"]])
D. depth_map = np.array([True, False])

Solution

  1. Step 1: Identify valid depth map data type

    Depth maps store distances as numbers (floats), so arrays with floats are correct.
  2. Step 2: Check options for numeric arrays

    depth_map = np.array([[0.5, 1.2], [2.3, 0.7]]) uses floats in a 2D array, suitable for depth maps. Others use strings or booleans, which are incorrect.
  3. Final Answer:

    depth_map = np.array([[0.5, 1.2], [2.3, 0.7]]) -> Option A
  4. Quick Check:

    Depth map = numeric 2D array [OK]
Hint: Depth maps store numbers, not words or booleans [OK]
Common Mistakes:
  • Using strings instead of numbers for depth values
  • Confusing color or label arrays with depth maps
  • Using 1D arrays instead of 2D for images
3. Given this Python code snippet using a depth estimation model, what will be the shape of the output depth map?
import numpy as np
input_image = np.zeros((480, 640, 3))  # RGB image
output_depth = model.predict(input_image)
print(output_depth.shape)
Assuming the model outputs a depth map matching input image size but single channel.
medium
A. (480, 640, 3)
B. (3, 480, 640)
C. (640, 480)
D. (480, 640)

Solution

  1. Step 1: Understand input and output shapes

    The input is a color image with shape (480, 640, 3). The model outputs a depth map with one channel per pixel, so shape should be (480, 640).
  2. Step 2: Match output shape to depth map format

    Depth maps usually have height and width only, no color channels, so (480, 640) is correct.
  3. Final Answer:

    (480, 640) -> Option D
  4. Quick Check:

    Depth map shape = height x width [OK]
Hint: Depth maps have one channel, so shape drops color dimension [OK]
Common Mistakes:
  • Assuming output keeps 3 color channels
  • Swapping height and width dimensions
  • Confusing channel order in output
4. You run a depth estimation model but get an error: ValueError: input must be 4D tensor. What is the most likely cause?
medium
A. Model weights are not loaded
B. Output depth map has wrong shape
C. Input image is missing batch dimension
D. Input image has wrong color format

Solution

  1. Step 1: Understand model input requirements

    Many deep learning models expect input as 4D tensors: (batch_size, height, width, channels).
  2. Step 2: Identify cause of ValueError

    If input is a single image (3D), missing batch dimension causes this error.
  3. Final Answer:

    Input image is missing batch dimension -> Option C
  4. Quick Check:

    4D input = batch + image dims [OK]
Hint: Add batch dimension to input shape before model call [OK]
Common Mistakes:
  • Ignoring batch dimension requirement
  • Blaming model weights or output shape
  • Confusing color format with tensor shape
5. You want to improve depth estimation accuracy for a robot navigating indoors. Which approach is best?
hard
A. Use a single camera and increase image resolution only
B. Use stereo cameras and combine their images for depth
C. Use random noise as input to the model
D. Ignore depth and rely on color detection

Solution

  1. Step 1: Consider methods to improve depth accuracy

    Stereo cameras capture two views, allowing better depth calculation by comparing images.
  2. Step 2: Evaluate options for robot navigation

    Use stereo cameras and combine their images for depth uses stereo vision, which is proven to improve depth accuracy indoors. Increasing resolution alone (B) helps little. Noise input (C) and ignoring depth (D) are ineffective.
  3. Final Answer:

    Use stereo cameras and combine their images for depth -> Option B
  4. Quick Check:

    Stereo vision = better depth accuracy [OK]
Hint: Stereo cameras give real depth by comparing two views [OK]
Common Mistakes:
  • Thinking higher resolution alone improves depth
  • Using noise as input to improve model
  • Ignoring depth for color detection