How to Sample Data from DataFrame in pandas: Simple Guide
Use the
sample() method on a pandas DataFrame to randomly select rows. You can specify the number of rows with n or the fraction of rows with frac. This method helps to get a random subset of your data easily.Syntax
The basic syntax of the sample() method is:
df.sample(n=None, frac=None, replace=False, random_state=None)
Where:
n: Number of rows to return (integer).frac: Fraction of rows to return (float between 0 and 1).replace: Whether to sample with replacement (True or False).random_state: Seed for reproducibility (integer or None).
python
df.sample(n=5, frac=None, replace=False, random_state=None)
Example
This example shows how to sample 3 random rows from a DataFrame and how to sample 50% of the rows.
python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'], 'Age': [25, 30, 35, 40, 45], 'City': ['NY', 'LA', 'Chicago', 'Houston', 'Phoenix']} df = pd.DataFrame(data) # Sample 3 random rows sample_n = df.sample(n=3, random_state=1) # Sample 50% of rows sample_frac = df.sample(frac=0.5, random_state=1) print('Sample 3 rows:') print(sample_n) print('\nSample 50% rows:') print(sample_frac)
Output
Sample 3 rows:
Name Age City
2 Charlie 35 Chicago
0 Alice 25 NY
3 David 40 Houston
Sample 50% rows:
Name Age City
2 Charlie 35 Chicago
0 Alice 25 NY
Common Pitfalls
Common mistakes when sampling data include:
- Using both
nandfracat the same time, which causes an error. - Not setting
random_statewhen you want reproducible results. - Sampling more rows than exist without
replace=True, which causes an error.
python
import pandas as pd df = pd.DataFrame({'A': range(5)}) # Wrong: using both n and frac # df.sample(n=2, frac=0.5) # This will raise ValueError # Correct: use only one sample_correct = df.sample(n=2, random_state=42) # Wrong: sampling more rows than exist without replacement # df.sample(n=10) # Raises ValueError # Correct: use replace=True to allow duplicates sample_replace = df.sample(n=10, replace=True, random_state=42) print('Sample with n=2:') print(sample_correct) print('\nSample with replacement (n=10):') print(sample_replace)
Output
Sample with n=2:
A
1 1
4 4
Sample with replacement (n=10):
A
1 1
4 4
1 1
1 1
2 2
4 4
1 1
2 2
4 4
2 2
Quick Reference
| Parameter | Description | Example |
|---|---|---|
| n | Number of rows to sample | df.sample(n=5) |
| frac | Fraction of rows to sample | df.sample(frac=0.3) |
| replace | Sample with replacement | df.sample(n=10, replace=True) |
| random_state | Seed for reproducibility | df.sample(n=3, random_state=42) |
Key Takeaways
Use df.sample() to randomly select rows from a DataFrame.
Specify either n (number) or frac (fraction) but not both.
Set random_state for reproducible sampling results.
Use replace=True to sample with replacement when needed.
Sampling helps create smaller, random subsets for analysis or testing.