How to Load CSV Dataset in PyTorch: Simple Guide
To load a CSV dataset in PyTorch, create a custom
Dataset class that reads the CSV file and returns data samples. Then use DataLoader to batch and shuffle the data for training.Syntax
Loading a CSV dataset in PyTorch involves these steps:
- Custom Dataset class: Inherit from
torch.utils.data.Datasetand implement__init__,__len__, and__getitem__. - Reading CSV: Use
pandasorcsvmodule to load data in__init__. - DataLoader: Wrap the Dataset with
torch.utils.data.DataLoaderto enable batching and shuffling.
python
import torch from torch.utils.data import Dataset, DataLoader import pandas as pd class CSVDataset(Dataset): def __init__(self, csv_file): self.data = pd.read_csv(csv_file) def __len__(self): return len(self.data) def __getitem__(self, idx): row = self.data.iloc[idx] # Example: last column is label, rest are features features = torch.tensor(row[:-1].values, dtype=torch.float32) label = torch.tensor(row[-1], dtype=torch.long) return features, label # Usage: dataset = CSVDataset('data.csv') dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
Example
This example shows how to load a CSV file with numeric features and labels, then iterate batches for training.
python
import torch from torch.utils.data import Dataset, DataLoader import pandas as pd class CSVDataset(Dataset): def __init__(self, csv_file): self.data = pd.read_csv(csv_file) def __len__(self): return len(self.data) def __getitem__(self, idx): row = self.data.iloc[idx] features = torch.tensor(row[:-1].values, dtype=torch.float32) label = torch.tensor(row[-1], dtype=torch.long) return features, label # Create a sample CSV file import csv with open('sample.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['feat1', 'feat2', 'label']) writer.writerow([1.0, 2.0, 0]) writer.writerow([3.0, 4.0, 1]) writer.writerow([5.0, 6.0, 0]) # Load dataset and dataloader dataset = CSVDataset('sample.csv') dataloader = DataLoader(dataset, batch_size=2, shuffle=False) for batch_idx, (features, labels) in enumerate(dataloader): print(f"Batch {batch_idx}") print("Features:", features) print("Labels:", labels)
Output
Batch 0
Features: tensor([[1., 2.],
[3., 4.]])
Labels: tensor([0, 1])
Batch 1
Features: tensor([[5., 6.]])
Labels: tensor([0])
Common Pitfalls
- Not converting data to
torch.tensorcauses errors during training. - Forgetting to set correct data types (e.g.,
float32for features,longfor labels) can cause model errors. - Not handling header rows in CSV can lead to wrong data parsing.
- Using
ilocwith wrong indices or slicing can cause shape mismatches. - Not shuffling data in
DataLoadermay reduce training effectiveness.
python
import torch from torch.utils.data import Dataset, DataLoader import pandas as pd # Wrong: Not converting to tensor class WrongDataset(Dataset): def __init__(self, csv_file): self.data = pd.read_csv(csv_file) def __len__(self): return len(self.data) def __getitem__(self, idx): row = self.data.iloc[idx] features = row[:-1].values # Not tensor label = row[-1] # Not tensor return features, label # Right: Convert to tensor with correct types class RightDataset(Dataset): def __init__(self, csv_file): self.data = pd.read_csv(csv_file) def __len__(self): return len(self.data) def __getitem__(self, idx): row = self.data.iloc[idx] features = torch.tensor(row[:-1].values, dtype=torch.float32) label = torch.tensor(row[-1], dtype=torch.long) return features, label
Quick Reference
Tips for loading CSV datasets in PyTorch:
- Use
pandas.read_csv()to load CSV files easily. - Always convert data to
torch.tensorwith correct types. - Implement
__len__and__getitem__in your Dataset. - Use
DataLoaderfor batching and shuffling. - Handle headers and missing data carefully.
Key Takeaways
Create a custom Dataset class to load CSV data and convert rows to tensors.
Use DataLoader to batch and shuffle data for efficient training.
Ensure features and labels have correct tensor types: float32 for features, long for labels.
Handle CSV headers and missing values properly to avoid errors.
Test your Dataset by iterating batches before training your model.