Overview - Scatter plots

What is it?

A scatter plot is a simple graph that shows how two sets of numbers relate to each other. Each point on the plot represents one pair of values, one from each set. It helps us see patterns, trends, or clusters in data by placing points on a two-dimensional grid. Scatter plots are often used to explore relationships between variables.

Why it matters

Without scatter plots, it would be hard to quickly see how two things are connected or if one affects the other. They help people spot trends or unusual points that might need more attention. For example, a business can see if sales grow with advertising spend or if there is no clear link. This visual insight saves time and guides better decisions.

Where it fits

Before learning scatter plots, you should understand basic data structures like tables and columns, and how to use pandas to handle data. After mastering scatter plots, you can explore more complex visualizations like line charts, histograms, and regression plots to analyze data relationships further.

Mental Model

Core Idea

A scatter plot places pairs of numbers as dots on a grid to reveal how they move together or apart.

Think of it like...

Imagine throwing small balls onto a flat table where the position of each ball shows two measurements, like height and weight of people. How the balls spread out tells you if taller people tend to weigh more or less.

  Y-axis (Variable 2)
    ↑
    │       ●     ●
    │    ●     ●
    │  ●  ●
    │
    └────────────────→ X-axis (Variable 1)
       (Variable 1 values increase left to right)

Build-Up - 6 Steps

1

FoundationUnderstanding data points in pairs

Concept: Scatter plots use pairs of numbers to create points on a graph.

Each point on a scatter plot comes from two numbers: one for the horizontal position (x) and one for the vertical position (y). For example, if you have a list of ages and heights, each person's age and height form one point.

Result

You get a set of points on a grid, each showing one pair of values.

Understanding that each point represents two linked values is the base for seeing relationships visually.

2

FoundationCreating a basic scatter plot with pandas

3

IntermediateAdding color and size to points

4

IntermediateInterpreting scatter plot patterns

5

AdvancedHandling overlapping points and transparency

6

ExpertUsing scatter plots for outlier detection

Under the Hood

Scatter plots map each pair of values to coordinates on a two-dimensional plane. The plotting library translates data values into pixel positions based on axis scales. When color or size is added, these attributes are mapped to color gradients or size scales. Transparency blends overlapping points by adjusting pixel opacity, revealing density.

Why designed this way?

Scatter plots were designed to visually represent relationships between two variables simply and intuitively. Adding color and size extends this to multidimensional data without losing clarity. Transparency solves the problem of overlapping points hiding data. These design choices balance simplicity with expressive power.

DataFrame columns
    │
    ├─> x values ──────────────┐
    ├─> y values ──────────────┤
    ├─> color values (optional) │─> Plotting engine ──> Scatter plot image
    └─> size values (optional) ─┘

Myth Busters - 4 Common Misconceptions

Quick: does a scatter plot always prove one variable causes the other? Commit yes or no.

Common Belief:Scatter plots show cause and effect between variables.

Tap to reveal reality

Quick: do you think overlapping points mean fewer data points? Commit yes or no.

Common Belief:If points overlap, there must be fewer data points in that area.

Tap to reveal reality

Quick: does adding color and size always make scatter plots easier to read? Commit yes or no.

Common Belief:More colors and sizes always improve scatter plot clarity.

Tap to reveal reality

Quick: do you think outliers always stand out clearly on scatter plots? Commit yes or no.

Common Belief:Outliers always appear as isolated points far from others.

Tap to reveal reality

Expert Zone

1

Scatter plots can be combined with regression lines or smoothing curves to better understand relationships.

2

Choosing the right scale (linear vs logarithmic) for axes can reveal patterns hidden in raw data.

3

Using interactive scatter plots allows zooming and tooltips, improving exploration of large datasets.

When NOT to use

Scatter plots are not suitable for categorical data or when you have more than four variables to visualize simultaneously. Alternatives include bar charts for categories, heatmaps for large matrices, or dimensionality reduction techniques like PCA for many variables.

Production Patterns

In real-world data science, scatter plots are used for exploratory data analysis, quality checks, and communicating findings. They often appear in dashboards with filters and interactivity. Analysts combine scatter plots with statistical tests to confirm observed patterns.

Connections

Correlation coefficient

Scatter plots visually show relationships that correlation coefficients measure numerically.

Understanding scatter plots helps grasp what correlation numbers mean and when they might be misleading.

Dimensionality reduction

Scatter plots display two or three dimensions, while dimensionality reduction techniques reduce many variables to two or three for plotting.

Knowing scatter plots prepares you to interpret complex data visualizations from dimensionality reduction.

Astronomy star maps

Both scatter plots and star maps plot points in space to reveal patterns and clusters.

Recognizing this connection shows how visualizing points helps understand complex systems across fields.

Common Pitfalls

#1Plotting data without checking for missing or invalid values.

Wrong approach:df.plot.scatter(x='age', y='height') # without cleaning data

Correct approach:df_clean = df.dropna(subset=['age', 'height']) df_clean.plot.scatter(x='age', y='height')

Root cause:Missing values cause errors or misleading plots; cleaning ensures accurate visualization.

#2Using default point size and color for all points regardless of data meaning.

Wrong approach:df.plot.scatter(x='age', y='height') # all points look the same

Correct approach:df.plot.scatter(x='age', y='height', c='weight', s='income', alpha=0.6)

Root cause:Ignoring extra variables misses opportunities to reveal richer data insights.

#3Setting alpha=1 (fully opaque) when many points overlap.

Wrong approach:df.plot.scatter(x='age', y='height', alpha=1)

Correct approach:df.plot.scatter(x='age', y='height', alpha=0.5)

Root cause:Opaque points hide data density, making plots less informative.

Key Takeaways

Scatter plots show pairs of values as points on a grid to reveal relationships visually.

Using color, size, and transparency adds extra layers of information to scatter plots.

Patterns in scatter plots help identify correlations, clusters, and outliers in data.

Scatter plots do not prove cause and effect; they only show associations.

Proper data cleaning and thoughtful design choices make scatter plots powerful tools for data exploration.