0
0
PytorchHow-ToBeginner · 4 min read

How to Load CSV Dataset in PyTorch: Simple Guide

To load a CSV dataset in PyTorch, create a custom Dataset class that reads the CSV file and returns data samples. Then use DataLoader to batch and shuffle the data for training.
📐

Syntax

Loading a CSV dataset in PyTorch involves these steps:

  • Custom Dataset class: Inherit from torch.utils.data.Dataset and implement __init__, __len__, and __getitem__.
  • Reading CSV: Use pandas or csv module to load data in __init__.
  • DataLoader: Wrap the Dataset with torch.utils.data.DataLoader to enable batching and shuffling.
python
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        # Example: last column is label, rest are features
        features = torch.tensor(row[:-1].values, dtype=torch.float32)
        label = torch.tensor(row[-1], dtype=torch.long)
        return features, label

# Usage:
dataset = CSVDataset('data.csv')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
💻

Example

This example shows how to load a CSV file with numeric features and labels, then iterate batches for training.

python
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        features = torch.tensor(row[:-1].values, dtype=torch.float32)
        label = torch.tensor(row[-1], dtype=torch.long)
        return features, label

# Create a sample CSV file
import csv
with open('sample.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['feat1', 'feat2', 'label'])
    writer.writerow([1.0, 2.0, 0])
    writer.writerow([3.0, 4.0, 1])
    writer.writerow([5.0, 6.0, 0])

# Load dataset and dataloader
dataset = CSVDataset('sample.csv')
dataloader = DataLoader(dataset, batch_size=2, shuffle=False)

for batch_idx, (features, labels) in enumerate(dataloader):
    print(f"Batch {batch_idx}")
    print("Features:", features)
    print("Labels:", labels)
Output
Batch 0 Features: tensor([[1., 2.], [3., 4.]]) Labels: tensor([0, 1]) Batch 1 Features: tensor([[5., 6.]]) Labels: tensor([0])
⚠️

Common Pitfalls

  • Not converting data to torch.tensor causes errors during training.
  • Forgetting to set correct data types (e.g., float32 for features, long for labels) can cause model errors.
  • Not handling header rows in CSV can lead to wrong data parsing.
  • Using iloc with wrong indices or slicing can cause shape mismatches.
  • Not shuffling data in DataLoader may reduce training effectiveness.
python
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

# Wrong: Not converting to tensor
class WrongDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        features = row[:-1].values  # Not tensor
        label = row[-1]  # Not tensor
        return features, label

# Right: Convert to tensor with correct types
class RightDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        features = torch.tensor(row[:-1].values, dtype=torch.float32)
        label = torch.tensor(row[-1], dtype=torch.long)
        return features, label
📊

Quick Reference

Tips for loading CSV datasets in PyTorch:

  • Use pandas.read_csv() to load CSV files easily.
  • Always convert data to torch.tensor with correct types.
  • Implement __len__ and __getitem__ in your Dataset.
  • Use DataLoader for batching and shuffling.
  • Handle headers and missing data carefully.

Key Takeaways

Create a custom Dataset class to load CSV data and convert rows to tensors.
Use DataLoader to batch and shuffle data for efficient training.
Ensure features and labels have correct tensor types: float32 for features, long for labels.
Handle CSV headers and missing values properly to avoid errors.
Test your Dataset by iterating batches before training your model.