Data validation checks help us find mistakes or problems in data before using it. This keeps our results correct and trustworthy.
0
0
Data validation checks in Pandas
Introduction
When you get new data and want to make sure it looks right.
Before analyzing data to catch missing or wrong values.
When combining data from different sources to check consistency.
To confirm data types are correct for calculations.
To find duplicates or unexpected values in a dataset.
Syntax
Pandas
import pandas as pd # Check for missing values missing = df.isnull() # Check data types types = df.dtypes # Check for duplicates duplicates = df.duplicated() # Check if values meet a condition condition_check = df['column_name'] > 0
df.isnull() returns True where data is missing.
df.duplicated() marks duplicate rows as True.
Examples
Counts how many missing values are in each column.
Pandas
df.isnull().sum()Shows the data type of the 'age' column.
Pandas
df['age'].dtypeCounts how many duplicate rows are in the DataFrame.
Pandas
df.duplicated().sum()Checks if all 'salary' values are between 30,000 and 150,000.
Pandas
df['salary'].between(30000, 150000).all()
Sample Program
This program creates a small table with some missing and duplicate data. It then checks for missing values, data types, duplicates, and if all ages are positive numbers.
Pandas
import pandas as pd # Create sample data data = { 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob'], 'age': [25, 30, None, 22, 30], 'salary': [50000, 60000, 70000, None, 60000] } df = pd.DataFrame(data) # Check for missing values missing_counts = df.isnull().sum() # Check data types data_types = df.dtypes # Check for duplicates duplicate_rows = df.duplicated().sum() # Check if all ages are positive ages_positive = (df['age'] > 0).all() print('Missing values per column:') print(missing_counts) print('\nData types:') print(data_types) print(f'\nNumber of duplicate rows: {duplicate_rows}') print(f'\nAre all ages positive? {ages_positive}')
OutputSuccess
Important Notes
Missing values can cause errors in calculations if not handled.
Duplicates might skew your analysis and should be reviewed.
Always check data types to ensure correct operations.
Summary
Data validation checks help find missing, wrong, or duplicate data.
Use pandas functions like isnull(), duplicated(), and dtypes to check data.
Validating data early keeps your analysis accurate and reliable.