0
0
Pandasdata~5 mins

Data validation checks in Pandas

Choose your learning style9 modes available
Introduction

Data validation checks help us find mistakes or problems in data before using it. This keeps our results correct and trustworthy.

When you get new data and want to make sure it looks right.
Before analyzing data to catch missing or wrong values.
When combining data from different sources to check consistency.
To confirm data types are correct for calculations.
To find duplicates or unexpected values in a dataset.
Syntax
Pandas
import pandas as pd

# Check for missing values
missing = df.isnull()

# Check data types
types = df.dtypes

# Check for duplicates
duplicates = df.duplicated()

# Check if values meet a condition
condition_check = df['column_name'] > 0

df.isnull() returns True where data is missing.

df.duplicated() marks duplicate rows as True.

Examples
Counts how many missing values are in each column.
Pandas
df.isnull().sum()
Shows the data type of the 'age' column.
Pandas
df['age'].dtype
Counts how many duplicate rows are in the DataFrame.
Pandas
df.duplicated().sum()
Checks if all 'salary' values are between 30,000 and 150,000.
Pandas
df['salary'].between(30000, 150000).all()
Sample Program

This program creates a small table with some missing and duplicate data. It then checks for missing values, data types, duplicates, and if all ages are positive numbers.

Pandas
import pandas as pd

# Create sample data
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob'],
    'age': [25, 30, None, 22, 30],
    'salary': [50000, 60000, 70000, None, 60000]
}

df = pd.DataFrame(data)

# Check for missing values
missing_counts = df.isnull().sum()

# Check data types
data_types = df.dtypes

# Check for duplicates
duplicate_rows = df.duplicated().sum()

# Check if all ages are positive
ages_positive = (df['age'] > 0).all()

print('Missing values per column:')
print(missing_counts)
print('\nData types:')
print(data_types)
print(f'\nNumber of duplicate rows: {duplicate_rows}')
print(f'\nAre all ages positive? {ages_positive}')
OutputSuccess
Important Notes

Missing values can cause errors in calculations if not handled.

Duplicates might skew your analysis and should be reviewed.

Always check data types to ensure correct operations.

Summary

Data validation checks help find missing, wrong, or duplicate data.

Use pandas functions like isnull(), duplicated(), and dtypes to check data.

Validating data early keeps your analysis accurate and reliable.