0
0
SciPydata~15 mins

Distance metrics (euclidean, cosine, manhattan) in SciPy - Deep Dive

Choose your learning style9 modes available
Overview - Distance metrics (euclidean, cosine, manhattan)
What is it?
Distance metrics are ways to measure how far apart two points or objects are. Euclidean distance measures the straight line between points, like using a ruler. Cosine distance measures how different the directions of two points are, ignoring their size. Manhattan distance adds up the absolute differences along each dimension, like walking city blocks.
Why it matters
Distance metrics help computers understand similarity or difference between data points. Without them, tasks like finding similar images, grouping customers, or recommending products would be guesswork. They turn raw numbers into meaningful comparisons that power search, clustering, and machine learning.
Where it fits
Before learning distance metrics, you should know basic math and vectors. After this, you can explore clustering algorithms, nearest neighbor search, and recommendation systems that use these distances to find patterns.
Mental Model
Core Idea
Distance metrics quantify how close or far two points are in different ways depending on the context and data shape.
Think of it like...
Imagine you want to get from your home to a friend's house. Euclidean distance is like flying straight there, Manhattan distance is like walking along streets in a grid city, and cosine distance is like comparing the direction you face rather than how far you walk.
Distance Metrics Visualization

Points: A and B

Euclidean (straight line):
A •─────────────• B

Manhattan (grid path):
A ──┐
    │
    └── B

Cosine (angle between vectors):
  \  B
   \ 
    \  A
     \
Build-Up - 7 Steps
1
FoundationUnderstanding points and vectors
🤔
Concept: Learn what points and vectors are in space as the basis for measuring distance.
A point is a position in space, like (x, y) on a map. A vector is like an arrow from the origin (0,0) to that point. We use vectors to represent data points in many dimensions.
Result
You can represent any data point as a list of numbers (coordinates).
Understanding points as vectors lets you apply math operations to measure distance and similarity.
2
FoundationBasic idea of distance
🤔
Concept: Distance is a number that tells how far apart two points are.
The simplest distance is the straight line between two points, called Euclidean distance. For points (x1, y1) and (x2, y2), it is sqrt((x2-x1)**2 + (y2-y1)**2).
Result
You get a single number representing how far apart two points are.
Distance turns spatial relationships into numbers computers can use.
3
IntermediateEuclidean distance in multiple dimensions
🤔Before reading on: do you think Euclidean distance changes if we add more dimensions? Commit to your answer.
Concept: Euclidean distance generalizes to any number of dimensions by summing squared differences.
For points with coordinates (x1, x2, ..., xn) and (y1, y2, ..., yn), Euclidean distance is sqrt(sum((xi - yi)**2)). This works for 2D, 3D, or more.
Result
You can measure straight-line distance in any space dimension.
Knowing Euclidean distance scales to many dimensions is key for real-world data analysis.
4
IntermediateManhattan distance and grid paths
🤔Before reading on: do you think Manhattan distance can be larger than Euclidean distance? Commit to your answer.
Concept: Manhattan distance sums absolute differences along each dimension, like walking city blocks.
For points (x1, x2, ..., xn) and (y1, y2, ..., yn), Manhattan distance is sum(|xi - yi|). It measures distance if you can only move along axes, not diagonally.
Result
You get a distance that reflects grid-like movement constraints.
Manhattan distance models scenarios where movement is restricted to fixed directions.
5
IntermediateCosine distance and angle similarity
🤔Before reading on: does cosine distance depend on the length of vectors or just their direction? Commit to your answer.
Concept: Cosine distance measures how different the directions of two vectors are, ignoring their length.
Cosine similarity is the cosine of the angle between two vectors: (A·B) / (|A||B|). Cosine distance = 1 - cosine similarity. It focuses on orientation, not magnitude.
Result
You get a measure of how aligned two points are in direction.
Cosine distance is useful when magnitude is less important than pattern or trend.
6
AdvancedUsing scipy for distance calculations
🤔Before reading on: do you think scipy has built-in functions for all these distances? Commit to your answer.
Concept: Scipy provides ready-to-use functions to compute Euclidean, Manhattan, and Cosine distances efficiently.
Example code: from scipy.spatial import distance point1 = [1, 2, 3] point2 = [4, 5, 6] # Euclidean print(distance.euclidean(point1, point2)) # Manhattan print(distance.cityblock(point1, point2)) # Cosine print(distance.cosine(point1, point2))
Result
You get numeric distance values printed for each metric.
Using scipy saves time and avoids errors by providing optimized distance functions.
7
ExpertChoosing distance metrics for real data
🤔Before reading on: do you think one distance metric fits all data types? Commit to your answer.
Concept: Different distance metrics suit different data shapes, scales, and tasks; choosing wisely affects results.
Euclidean works well for continuous, equally scaled data. Manhattan is robust to outliers and grid-like data. Cosine is best for text or high-dimensional sparse data where direction matters more than magnitude. Scaling and normalization also affect metric choice.
Result
You understand how metric choice impacts clustering, classification, and similarity search.
Knowing when and why to pick each metric prevents misleading analysis and improves model performance.
Under the Hood
Distance metrics compute numeric summaries of differences between vectors by applying mathematical formulas. Euclidean sums squared coordinate differences and takes a square root, reflecting geometric distance. Manhattan sums absolute differences, reflecting grid travel. Cosine computes the dot product normalized by vector lengths, capturing angular difference. Internally, these use vectorized operations for speed.
Why designed this way?
These metrics were designed to capture different notions of similarity relevant to various fields: Euclidean for geometry, Manhattan for urban planning and grid layouts, and Cosine for text analysis and directional data. Alternatives exist but these balance simplicity, interpretability, and computational efficiency.
Distance Metrics Internal Flow

Input Vectors A, B
   │
   ├─> Euclidean: sum((A_i - B_i)**2) → sqrt → Distance
   │
   ├─> Manhattan: sum(|A_i - B_i|) → Distance
   │
   └─> Cosine: dot(A,B) / (|A||B|) → 1 - similarity → Distance
Myth Busters - 3 Common Misconceptions
Quick: Is cosine distance affected by how long the vectors are? Commit yes or no.
Common Belief:Cosine distance measures how far apart two points are in space, just like Euclidean distance.
Tap to reveal reality
Reality:Cosine distance measures the angle between vectors, ignoring their length, so two points far apart but in the same direction have zero cosine distance.
Why it matters:Confusing cosine with Euclidean distance can cause wrong similarity judgments, especially in text or high-dimensional data.
Quick: Can Manhattan distance ever be smaller than Euclidean distance? Commit yes or no.
Common Belief:Manhattan distance is always smaller or equal to Euclidean distance.
Tap to reveal reality
Reality:Manhattan distance is always greater or equal to Euclidean distance because it sums absolute differences instead of squared differences.
Why it matters:Misunderstanding this can lead to wrong assumptions about data spread and clustering tightness.
Quick: Does scaling data affect all distance metrics equally? Commit yes or no.
Common Belief:Scaling data does not affect distance metrics since they measure relative differences.
Tap to reveal reality
Reality:Scaling affects Euclidean and Manhattan distances significantly but not cosine distance, which is scale-invariant.
Why it matters:Ignoring scaling can cause misleading distances and poor model performance.
Expert Zone
1
Cosine distance is sensitive to zero vectors and requires careful handling of empty or zero-length data points.
2
Manhattan distance can be more robust to outliers than Euclidean because it does not square differences, reducing the impact of large deviations.
3
Euclidean distance assumes isotropic space; if features have different importance or scale, weighting or normalization is necessary.
When NOT to use
Avoid Euclidean distance on data with different scales or non-continuous features; use normalized or weighted distances instead. Do not use cosine distance on data where magnitude matters, such as physical measurements. Manhattan distance is less suitable for continuous smooth spaces where diagonal movement is possible.
Production Patterns
In production, Euclidean distance is common in image and sensor data analysis. Cosine distance is widely used in text mining and recommendation systems. Manhattan distance appears in grid-based pathfinding and some clustering algorithms. Often, distance computations are optimized with vectorized libraries like scipy and combined with dimensionality reduction for efficiency.
Connections
Clustering algorithms
Distance metrics are the foundation for grouping similar data points in clustering.
Understanding distance metrics helps grasp how clusters form and why different algorithms behave differently.
Vector space model in information retrieval
Cosine distance is directly used to measure document similarity in vector space models.
Knowing cosine distance clarifies how search engines rank documents by relevance.
Urban planning and navigation
Manhattan distance models real-world travel in grid-like city streets.
Recognizing this connection shows how abstract math models practical movement constraints.
Common Pitfalls
#1Using Euclidean distance on unscaled data with different units.
Wrong approach:from scipy.spatial import distance point1 = [1, 1000] point2 = [2, 2000] print(distance.euclidean(point1, point2))
Correct approach:from sklearn.preprocessing import StandardScaler from scipy.spatial import distance scaler = StandardScaler() data = scaler.fit_transform([[1, 1000], [2, 2000]]) print(distance.euclidean(data[0], data[1]))
Root cause:Different feature scales cause one dimension to dominate Euclidean distance, misleading similarity.
#2Using cosine distance on zero vectors without handling.
Wrong approach:from scipy.spatial import distance point1 = [0, 0, 0] point2 = [1, 2, 3] print(distance.cosine(point1, point2))
Correct approach:import numpy as np from scipy.spatial import distance point1 = np.array([0, 0, 0]) point2 = np.array([1, 2, 3]) if np.linalg.norm(point1) == 0 or np.linalg.norm(point2) == 0: print('Cosine distance undefined for zero vector') else: print(distance.cosine(point1, point2))
Root cause:Cosine distance requires non-zero vectors; zero vectors cause division by zero.
#3Confusing Manhattan and Euclidean distance formulas.
Wrong approach:from scipy.spatial import distance point1 = [1, 2] point2 = [4, 6] # Wrong: using Euclidean formula for Manhattan manhattan = ((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2)**0.5 print(manhattan)
Correct approach:from scipy.spatial import distance point1 = [1, 2] point2 = [4, 6] manhattan = abs(point1[0] - point2[0]) + abs(point1[1] - point2[1]) print(manhattan)
Root cause:Mixing formulas leads to incorrect distance values and wrong interpretations.
Key Takeaways
Distance metrics convert data points into numbers that express similarity or difference.
Euclidean distance measures straight-line distance and works best with continuous, scaled data.
Manhattan distance sums absolute differences and models grid-like movement or robust comparisons.
Cosine distance measures angle between vectors, focusing on direction rather than magnitude.
Choosing the right distance metric and preprocessing data properly is crucial for meaningful analysis.