Overview - Distance metrics (euclidean, cosine, manhattan)

What is it?

Distance metrics are ways to measure how far apart two points or objects are. Euclidean distance measures the straight line between points, like using a ruler. Cosine distance measures how different the directions of two points are, ignoring their size. Manhattan distance adds up the absolute differences along each dimension, like walking city blocks.

Why it matters

Distance metrics help computers understand similarity or difference between data points. Without them, tasks like finding similar images, grouping customers, or recommending products would be guesswork. They turn raw numbers into meaningful comparisons that power search, clustering, and machine learning.

Where it fits

Before learning distance metrics, you should know basic math and vectors. After this, you can explore clustering algorithms, nearest neighbor search, and recommendation systems that use these distances to find patterns.

Mental Model

Core Idea

Distance metrics quantify how close or far two points are in different ways depending on the context and data shape.

Think of it like...

Imagine you want to get from your home to a friend's house. Euclidean distance is like flying straight there, Manhattan distance is like walking along streets in a grid city, and cosine distance is like comparing the direction you face rather than how far you walk.

Distance Metrics Visualization

Points: A and B

Euclidean (straight line):
A •─────────────• B

Manhattan (grid path):
A ──┐
    │
    └── B

Cosine (angle between vectors):
  \  B
   \ 
    \  A
     \

Build-Up - 7 Steps

1

FoundationUnderstanding points and vectors

Concept: Learn what points and vectors are in space as the basis for measuring distance.

A point is a position in space, like (x, y) on a map. A vector is like an arrow from the origin (0,0) to that point. We use vectors to represent data points in many dimensions.

Result

You can represent any data point as a list of numbers (coordinates).

Understanding points as vectors lets you apply math operations to measure distance and similarity.

2

FoundationBasic idea of distance

3

IntermediateEuclidean distance in multiple dimensions

4

IntermediateManhattan distance and grid paths

5

IntermediateCosine distance and angle similarity

6

AdvancedUsing scipy for distance calculations

7

ExpertChoosing distance metrics for real data

Under the Hood

Distance metrics compute numeric summaries of differences between vectors by applying mathematical formulas. Euclidean sums squared coordinate differences and takes a square root, reflecting geometric distance. Manhattan sums absolute differences, reflecting grid travel. Cosine computes the dot product normalized by vector lengths, capturing angular difference. Internally, these use vectorized operations for speed.

Why designed this way?

These metrics were designed to capture different notions of similarity relevant to various fields: Euclidean for geometry, Manhattan for urban planning and grid layouts, and Cosine for text analysis and directional data. Alternatives exist but these balance simplicity, interpretability, and computational efficiency.

Distance Metrics Internal Flow

Input Vectors A, B
   │
   ├─> Euclidean: sum((A_i - B_i)**2) → sqrt → Distance
   │
   ├─> Manhattan: sum(|A_i - B_i|) → Distance
   │
   └─> Cosine: dot(A,B) / (|A||B|) → 1 - similarity → Distance

Myth Busters - 3 Common Misconceptions

Quick: Is cosine distance affected by how long the vectors are? Commit yes or no.

Common Belief:Cosine distance measures how far apart two points are in space, just like Euclidean distance.

Tap to reveal reality

Quick: Can Manhattan distance ever be smaller than Euclidean distance? Commit yes or no.

Common Belief:Manhattan distance is always smaller or equal to Euclidean distance.

Tap to reveal reality

Quick: Does scaling data affect all distance metrics equally? Commit yes or no.

Common Belief:Scaling data does not affect distance metrics since they measure relative differences.

Tap to reveal reality

Expert Zone

1

Cosine distance is sensitive to zero vectors and requires careful handling of empty or zero-length data points.

2

Manhattan distance can be more robust to outliers than Euclidean because it does not square differences, reducing the impact of large deviations.

3

Euclidean distance assumes isotropic space; if features have different importance or scale, weighting or normalization is necessary.

When NOT to use

Avoid Euclidean distance on data with different scales or non-continuous features; use normalized or weighted distances instead. Do not use cosine distance on data where magnitude matters, such as physical measurements. Manhattan distance is less suitable for continuous smooth spaces where diagonal movement is possible.

Production Patterns

In production, Euclidean distance is common in image and sensor data analysis. Cosine distance is widely used in text mining and recommendation systems. Manhattan distance appears in grid-based pathfinding and some clustering algorithms. Often, distance computations are optimized with vectorized libraries like scipy and combined with dimensionality reduction for efficiency.

Connections

Clustering algorithms

Distance metrics are the foundation for grouping similar data points in clustering.

Understanding distance metrics helps grasp how clusters form and why different algorithms behave differently.

Vector space model in information retrieval

Cosine distance is directly used to measure document similarity in vector space models.

Knowing cosine distance clarifies how search engines rank documents by relevance.

Urban planning and navigation

Manhattan distance models real-world travel in grid-like city streets.

Recognizing this connection shows how abstract math models practical movement constraints.

Common Pitfalls

#1Using Euclidean distance on unscaled data with different units.

Wrong approach:from scipy.spatial import distance point1 = [1, 1000] point2 = [2, 2000] print(distance.euclidean(point1, point2))

Correct approach:from sklearn.preprocessing import StandardScaler from scipy.spatial import distance scaler = StandardScaler() data = scaler.fit_transform([[1, 1000], [2, 2000]]) print(distance.euclidean(data[0], data[1]))

Root cause:Different feature scales cause one dimension to dominate Euclidean distance, misleading similarity.

#2Using cosine distance on zero vectors without handling.

Wrong approach:from scipy.spatial import distance point1 = [0, 0, 0] point2 = [1, 2, 3] print(distance.cosine(point1, point2))

Correct approach:import numpy as np from scipy.spatial import distance point1 = np.array([0, 0, 0]) point2 = np.array([1, 2, 3]) if np.linalg.norm(point1) == 0 or np.linalg.norm(point2) == 0: print('Cosine distance undefined for zero vector') else: print(distance.cosine(point1, point2))

Root cause:Cosine distance requires non-zero vectors; zero vectors cause division by zero.

#3Confusing Manhattan and Euclidean distance formulas.

Wrong approach:from scipy.spatial import distance point1 = [1, 2] point2 = [4, 6] # Wrong: using Euclidean formula for Manhattan manhattan = ((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2)**0.5 print(manhattan)

Correct approach:from scipy.spatial import distance point1 = [1, 2] point2 = [4, 6] manhattan = abs(point1[0] - point2[0]) + abs(point1[1] - point2[1]) print(manhattan)

Root cause:Mixing formulas leads to incorrect distance values and wrong interpretations.

Key Takeaways

Distance metrics convert data points into numbers that express similarity or difference.

Euclidean distance measures straight-line distance and works best with continuous, scaled data.

Manhattan distance sums absolute differences and models grid-like movement or robust comparisons.

Cosine distance measures angle between vectors, focusing on direction rather than magnitude.

Choosing the right distance metric and preprocessing data properly is crucial for meaningful analysis.