0
0
PandasHow-ToBeginner · 3 min read

How to Sample Data from DataFrame in pandas: Simple Guide

Use the sample() method on a pandas DataFrame to randomly select rows. You can specify the number of rows with n or the fraction of rows with frac. This method helps to get a random subset of your data easily.
📐

Syntax

The basic syntax of the sample() method is:

  • df.sample(n=None, frac=None, replace=False, random_state=None)

Where:

  • n: Number of rows to return (integer).
  • frac: Fraction of rows to return (float between 0 and 1).
  • replace: Whether to sample with replacement (True or False).
  • random_state: Seed for reproducibility (integer or None).
python
df.sample(n=5, frac=None, replace=False, random_state=None)
💻

Example

This example shows how to sample 3 random rows from a DataFrame and how to sample 50% of the rows.

python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['NY', 'LA', 'Chicago', 'Houston', 'Phoenix']}

df = pd.DataFrame(data)

# Sample 3 random rows
sample_n = df.sample(n=3, random_state=1)

# Sample 50% of rows
sample_frac = df.sample(frac=0.5, random_state=1)

print('Sample 3 rows:')
print(sample_n)
print('\nSample 50% rows:')
print(sample_frac)
Output
Sample 3 rows: Name Age City 2 Charlie 35 Chicago 0 Alice 25 NY 3 David 40 Houston Sample 50% rows: Name Age City 2 Charlie 35 Chicago 0 Alice 25 NY
⚠️

Common Pitfalls

Common mistakes when sampling data include:

  • Using both n and frac at the same time, which causes an error.
  • Not setting random_state when you want reproducible results.
  • Sampling more rows than exist without replace=True, which causes an error.
python
import pandas as pd

df = pd.DataFrame({'A': range(5)})

# Wrong: using both n and frac
# df.sample(n=2, frac=0.5)  # This will raise ValueError

# Correct: use only one
sample_correct = df.sample(n=2, random_state=42)

# Wrong: sampling more rows than exist without replacement
# df.sample(n=10)  # Raises ValueError

# Correct: use replace=True to allow duplicates
sample_replace = df.sample(n=10, replace=True, random_state=42)

print('Sample with n=2:')
print(sample_correct)
print('\nSample with replacement (n=10):')
print(sample_replace)
Output
Sample with n=2: A 1 1 4 4 Sample with replacement (n=10): A 1 1 4 4 1 1 1 1 2 2 4 4 1 1 2 2 4 4 2 2
📊

Quick Reference

ParameterDescriptionExample
nNumber of rows to sampledf.sample(n=5)
fracFraction of rows to sampledf.sample(frac=0.3)
replaceSample with replacementdf.sample(n=10, replace=True)
random_stateSeed for reproducibilitydf.sample(n=3, random_state=42)

Key Takeaways

Use df.sample() to randomly select rows from a DataFrame.
Specify either n (number) or frac (fraction) but not both.
Set random_state for reproducible sampling results.
Use replace=True to sample with replacement when needed.
Sampling helps create smaller, random subsets for analysis or testing.