Overview - How Git stores objects

What is it?

Git stores data as objects in a special database called the object store. Each object represents a piece of your project, like a file, a folder, or a snapshot of your project at a moment in time. These objects are saved using a unique code called a hash, which helps Git find and verify them quickly. This system lets Git track changes efficiently and safely.

Why it matters

Without Git's object storage system, tracking changes in files and folders would be slow and unreliable. Developers would struggle to manage versions, collaborate, or recover lost work. Git's object storage ensures data integrity, fast access, and efficient storage, making modern software development smooth and dependable.

Where it fits

Before learning how Git stores objects, you should understand basic Git concepts like commits, branches, and repositories. After this, you can explore how Git uses these objects to build history, manage branches, and perform operations like merging and rebasing.

Mental Model

Core Idea

Git stores every piece of your project as a uniquely identified object in a content-addressable database to track changes efficiently and securely.

Think of it like...

Imagine a library where every book is given a unique barcode based on its content. Instead of searching by title or author, you scan the barcode to find the exact book. Git’s objects are like these barcoded books, ensuring you always get the exact content you need.

┌───────────────┐
│ Git Object DB │
├───────────────┤
│ Blob (file)   │
│ Tree (folder) │
│ Commit       │
│ Tag          │
└─────┬─────────┘
      │
      ▼
  Unique SHA-1 Hash
      │
      ▼
  Content stored compressed and checksummed

Build-Up - 7 Steps

1

FoundationGit objects basics

Concept: Git stores data as objects of four types: blobs, trees, commits, and tags.

Git breaks down your project into four object types: - Blob: stores file content. - Tree: stores folder structure and references blobs or other trees. - Commit: stores a snapshot of the project, pointing to a tree and parent commits. - Tag: marks a specific commit with a name. Each object is stored in a compressed form and identified by a SHA-1 hash.

Result

You understand the four core object types Git uses to represent your project data.

Knowing these object types helps you see how Git models your project as a set of connected data pieces, not just files.

2

FoundationContent-addressable storage explained

3

IntermediateObject storage on disk

4

IntermediatePackfiles for efficiency

5

IntermediateObject referencing and linking

6

AdvancedObject hashing and collision resistance

7

ExpertDelta compression in packfiles

Under the Hood

Git stores objects by first creating a header with the object type and size, then appending the raw content. It computes a SHA-1 hash of this combined data, which acts as the object's unique ID. Objects are compressed using zlib and saved in a directory structure based on the hash. Over time, Git packs many objects into packfiles using delta compression to save space. Commits reference trees, which reference blobs and other trees, forming a directed acyclic graph representing project history.

Why designed this way?

Git was designed to be fast, reliable, and space-efficient for version control. Using content-addressable storage with hashes ensures data integrity and easy detection of changes. The directory structure avoids filesystem limits on file counts. Packfiles and delta compression address scaling issues as projects grow. Alternatives like storing full copies or using timestamps were slower or less reliable, so Git’s design balances speed, safety, and storage.

┌───────────────┐
│ User files    │
└──────┬────────┘
       │ git add (creates blobs)
       ▼
┌───────────────┐
│ Blob objects  │
└──────┬────────┘
       │ referenced by
       ▼
┌───────────────┐
│ Tree objects  │
└──────┬────────┘
       │ referenced by
       ▼
┌───────────────┐
│ Commit object │
└──────┬────────┘
       │ stored as
       ▼
┌───────────────┐
│ Object store  │
│ (.git/objects)│
└───────────────┘
       │
       ▼
┌───────────────┐
│ Packfiles     │
│ (compressed)  │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Do you think Git stores full copies of every file version separately? Commit yes or no.

Common Belief:Git stores a full copy of every file each time you commit.

Tap to reveal reality

Quick: Do you think the SHA-1 hash in Git is just a random ID? Commit yes or no.

Common Belief:The SHA-1 hash is just a random identifier assigned to objects.

Tap to reveal reality

Quick: Do you think packfiles store objects uncompressed? Commit yes or no.

Common Belief:Packfiles store objects as-is without compression.

Tap to reveal reality

Quick: Do you think commits store file data directly? Commit yes or no.

Common Belief:Commits contain the actual file contents of the project snapshot.

Tap to reveal reality

Expert Zone

1

Git’s object storage design allows for efficient garbage collection by identifying unreachable objects through the commit graph.

2

Delta compression chains in packfiles are carefully balanced to optimize between compression ratio and access speed, which can be tuned by Git’s configuration.

3

Git’s transition from SHA-1 to SHA-256 is designed to be backward compatible, allowing repositories to gradually upgrade without breaking existing data.

When NOT to use

Git’s object storage is optimized for source code and text files. For large binary files or datasets, specialized tools like Git LFS or external storage systems are better suited to avoid performance degradation.

Production Patterns

In production, Git repositories often use packfiles aggressively to reduce size and improve clone/fetch speed. Continuous integration systems rely on Git’s object model to quickly check out specific commits. Advanced workflows use shallow clones and partial fetches to limit object transfer.

Connections

Content-addressable storage (CAS)

Git’s object storage is a practical implementation of CAS principles.

Understanding CAS in distributed storage systems helps grasp how Git ensures data integrity and deduplication.

Hash functions in cryptography

Git uses cryptographic hash functions to identify objects uniquely and securely.

Knowing how hash functions work explains Git’s resistance to data corruption and tampering.

Database indexing

Git’s object store and packfiles act like an index for fast data retrieval.

Recognizing Git’s storage as an indexing system clarifies how it achieves quick access to project history.

Common Pitfalls

#1Trying to manually edit files inside .git/objects directory.

Wrong approach:echo 'change' > .git/objects/e6/8f3a... (editing object files directly)

Correct approach:Use Git commands like git add, git commit to modify repository content safely.

Root cause:Misunderstanding that Git objects are compressed and hashed data, not plain files.

#2Deleting loose object files to save space without packing.

Wrong approach:rm .git/objects/e6/8f3a... (removing object files manually)

Correct approach:Run git gc to safely clean and pack objects.

Root cause:Not knowing Git’s internal storage and garbage collection mechanisms.

#3Assuming SHA-1 hashes are collision-proof and ignoring security updates.

Wrong approach:Ignoring Git warnings about SHA-1 vulnerabilities and not upgrading.

Correct approach:Upgrade Git to versions supporting SHA-256 and migrate repositories accordingly.

Root cause:Lack of awareness about cryptographic hash weaknesses and Git’s evolving security.

Key Takeaways

Git stores project data as objects identified by hashes, ensuring integrity and uniqueness.

Objects include blobs for files, trees for folders, commits for snapshots, and tags for references.

Git uses a content-addressable storage system that compresses and organizes objects efficiently on disk.

Packfiles and delta compression allow Git to scale to large projects without performance loss.

Understanding Git’s object storage reveals the foundation of its speed, reliability, and powerful version control features.