0
0
Computer Visionml~12 mins

MediaPipe Pose in Computer Vision - Model Pipeline Trace

Choose your learning style9 modes available
Model Pipeline - MediaPipe Pose

MediaPipe Pose is a machine learning pipeline that detects human body landmarks in images or videos. It tracks 33 key points on the body to understand pose and movement in real time.

Data Flow - 5 Stages
1Input Image
1 frame x 1920 x 1080 pixels x 3 color channelsCapture or load a color image frame1 frame x 1920 x 1080 pixels x 3 color channels
A photo of a person standing in front of a plain background
2Preprocessing
1 frame x 1920 x 1080 x 3Resize and normalize image pixels to model input size1 frame x 256 x 256 x 3
Image resized to 256x256 pixels with pixel values scaled between 0 and 1
3Pose Landmark Detection Model
1 frame x 256 x 256 x 3Run neural network to predict 33 body landmarks1 frame x 33 landmarks x 3 coordinates (x, y, visibility)
Coordinates like (0.45, 0.60, 0.98) for right shoulder landmark
4Postprocessing
1 frame x 33 landmarks x 3Map normalized landmark coordinates back to original image size1 frame x 33 landmarks x 3 (pixel x, pixel y, visibility)
Right shoulder at pixel (864, 648) with visibility 0.98
5Output Visualization
1 frame x 33 landmarks x 3Draw landmarks and connections on original image1 frame x 1920 x 1080 x 3 with overlay
Image showing dots and lines over the person's body joints
Training Trace - Epoch by Epoch

Loss
2.5 |*       
2.0 | *      
1.5 |  *     
1.0 |   *    
0.5 |    *   
0.0 |     *  
     --------
     Epochs
EpochLoss ↓Accuracy ↑Observation
12.50.30Initial training with high loss and low accuracy
51.20.55Loss decreased, accuracy improving as model learns body shapes
100.70.75Model captures pose landmarks more accurately
150.40.85Good convergence, landmarks detected reliably
200.250.92Final epoch with low loss and high accuracy
Prediction Trace - 4 Layers
Layer 1: Input Image
Layer 2: Neural Network Forward Pass
Layer 3: Coordinate Mapping
Layer 4: Visualization Overlay
Model Quiz - 3 Questions
Test your understanding
What is the shape of the model's output after landmark detection?
A1920 x 1080 x 3 image
B256 x 256 x 3 image tensor
C33 landmarks x 3 coordinates
D1 landmark x 2 coordinates
Key Insight
MediaPipe Pose uses a neural network to detect 33 body landmarks by resizing input images and predicting normalized coordinates. Training reduces loss and improves accuracy, enabling real-time pose estimation with high confidence.