How to load csv dataset pytorch

PytorchHow-ToBeginner · 4 min read

How to Load CSV Dataset in PyTorch: Simple Guide

To load a CSV dataset in PyTorch, create a custom Dataset class that reads the CSV file and returns data samples. Then use DataLoader to batch and shuffle the data for training.

📐

Syntax

Loading a CSV dataset in PyTorch involves these steps:

Custom Dataset class: Inherit from torch.utils.data.Dataset and implement __init__, __len__, and __getitem__.
Reading CSV: Use pandas or csv module to load data in __init__.
DataLoader: Wrap the Dataset with torch.utils.data.DataLoader to enable batching and shuffling.

python

import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        # Example: last column is label, rest are features
        features = torch.tensor(row[:-1].values, dtype=torch.float32)
        label = torch.tensor(row[-1], dtype=torch.long)
        return features, label

# Usage:
dataset = CSVDataset('data.csv')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

💻

Example

This example shows how to load a CSV file with numeric features and labels, then iterate batches for training.

python

import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        features = torch.tensor(row[:-1].values, dtype=torch.float32)
        label = torch.tensor(row[-1], dtype=torch.long)
        return features, label

# Create a sample CSV file
import csv
with open('sample.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['feat1', 'feat2', 'label'])
    writer.writerow([1.0, 2.0, 0])
    writer.writerow([3.0, 4.0, 1])
    writer.writerow([5.0, 6.0, 0])

# Load dataset and dataloader
dataset = CSVDataset('sample.csv')
dataloader = DataLoader(dataset, batch_size=2, shuffle=False)

for batch_idx, (features, labels) in enumerate(dataloader):
    print(f"Batch {batch_idx}")
    print("Features:", features)
    print("Labels:", labels)

Output

Batch 0 Features: tensor([[1., 2.], [3., 4.]]) Labels: tensor([0, 1]) Batch 1 Features: tensor([[5., 6.]]) Labels: tensor([0])

⚠️

Common Pitfalls

Not converting data to torch.tensor causes errors during training.
Forgetting to set correct data types (e.g., float32 for features, long for labels) can cause model errors.
Not handling header rows in CSV can lead to wrong data parsing.
Using iloc with wrong indices or slicing can cause shape mismatches.
Not shuffling data in DataLoader may reduce training effectiveness.

python

import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

# Wrong: Not converting to tensor
class WrongDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        features = row[:-1].values  # Not tensor
        label = row[-1]  # Not tensor
        return features, label

# Right: Convert to tensor with correct types
class RightDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        features = torch.tensor(row[:-1].values, dtype=torch.float32)
        label = torch.tensor(row[-1], dtype=torch.long)
        return features, label

📊

Quick Reference

Tips for loading CSV datasets in PyTorch:

Use pandas.read_csv() to load CSV files easily.
Always convert data to torch.tensor with correct types.
Implement __len__ and __getitem__ in your Dataset.
Use DataLoader for batching and shuffling.
Handle headers and missing data carefully.

✅

Key Takeaways

Create a custom Dataset class to load CSV data and convert rows to tensors.

Use DataLoader to batch and shuffle data for efficient training.

Ensure features and labels have correct tensor types: float32 for features, long for labels.

Handle CSV headers and missing values properly to avoid errors.

Test your Dataset by iterating batches before training your model.