
You Use Git Every Day, But Do You Really Understand It? — Completely Understand Git from Its Underlying Data Structures
You've been using Git for years. You can handle commit, push, merge, and rebase easily. But if someone asks you:
- "Why are Git branches so fast?"
- "What exactly is stored in the
.gitfolder?" - "Where do the files come from when you do
git checkout?"
You might hesitate.
This article aims to break down and thoroughly explain Git's underlying workings. You don't need to have read Git's source code. But after reading, you'll find that Git's every action makes perfect, "obvious" sense.
Most people understand Git as a "version control tool." That's correct, but that's the high-level view.
From the low level, Git is a content-addressable key-value storage system.
What does that mean?
You give Git some data. Git computes a SHA-1 hash for it, producing a 40-character hexadecimal string as the key. Then it stores the data. Later, you can use that key to retrieve the original data.
It's like a massive HashMap<String, byte[]>. The key is the hash of the content, and the value is the compressed content itself.
You can verify this yourself:
# Manually write some content into the Git database
echo "Hello Git Internals" | git hash-object -w --stdin
# Output looks like: d670460b4b4aece5915caf5c68d12f560a9fe3e4
# Retrieve the content using the key
git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4
# Output: Hello Git InternalsThis is the entire foundation of Git. All the concepts that follow—files, directories, commits, branches—are built on top of this key-value store.
Inside the .git/objects/ directory, there are only 4 types of objects. Understand them, and you understand 90% of Git.
A Blob (Binary Large Object) stores pure file content. It does not contain any metadata like the filename, path, or permissions.
# See the blob corresponding to a file
git hash-object README.md
# Output: a1b2c3d4...
git cat-file -p a1b2c3d4
# Output: The content of README.mdA key design point: If two files have identical content, no matter their names or directories, Git stores only one blob.
This means if you copy and paste the same config file 10 times in your project, Git's storage increases by only one copy.
Since blobs don't have filenames, where are filenames stored? The answer is the tree object.
A Tree is a "directory listing" that records:
- The type of each entry (blob or sub-tree)
- File permissions (100644 for normal files, 100755 for executables, 040000 for directories)
- The hash of the corresponding object
- The file/directory name
# View a tree object
git cat-file -p main^{tree}Output looks like:
100644 blob 8ab686e... .eslintrc.json
100644 blob f1d2d2f... .gitignore
100644 blob 7c211433... package.json
040000 tree 4b825dc... src
src is a sub-tree. It points to its own blobs and trees. This is a recursive structure, exactly like a filesystem directory tree.
Visualized:
tree (root directory)
├── blob README.md
├── blob package.json
└── tree src/
├── blob index.ts
└── tree components/
└── blob Button.tsx
The Commit object is the "glue" that ties everything together. It records:
git cat-file -p HEADtree d8329fc... ← Points to the root tree (full snapshot of this commit)
parent 1a410ef... ← Parent commit (previous commit)
author xxx <xxx@example.com> 1714200000 +0800
committer xxx <xxx@example.com> 1714200000 +0800
feat: implement user login
Each commit points to a complete tree. This means each commit is a complete snapshot of the project at that moment.
This differs from many people's intuition—you might think commits store "what changed" (diff), but actually, commits store "what the whole project looked like" (snapshot).
Then how does
git diffwork? The answer: It dynamically compares two tree objects, rather than storing differences in advance.
An annotated tag is a separate object pointing to a commit with extra info (tagger name, message, signature, etc.). A lightweight tag is just a reference, not an object. Not commonly used in daily development, so we won't elaborate here.
Now we have all the building blocks. Let's see how they assemble:
graph LR
subgraph "Reference Layer (Refs)"
HEAD["HEAD"]
main["refs/heads/main"]
dev["refs/heads/dev"]
end
subgraph "Commit Chain"
The entire Git repository is a Directed Acyclic Graph (DAG):
- HEAD → current branch → latest commit
- Commit → parent commit (forms history chain) + root tree (complete snapshot)
- Tree → sub-trees + blobs (directory structure)
- Blob → file content
This might be one of Git's most elegant designs.
cat .git/refs/heads/main
# Output: a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2 (40 chars + newline)A branch is simply a text file, containing a 40-character commit hash.
Therefore:
| Action | What Actually Happens |
|---|---|
Create branch git branch dev | Create a 41-byte file .git/refs/heads/dev |
Switch branch git checkout dev | Change the content of .git/HEAD to ref: refs/heads/dev |
Commit git commit | Create a new commit object, then update the hash in the current branch's file to the new commit's hash |
Delete branch git branch -d dev | Delete the file .git/refs/heads/dev |
This is why Git branch operations are instantaneous—you're just reading/writing a tiny file of a few dozen bytes, not copying an entire directory like in SVN.
Many know Git has three areas: "Working Directory → Staging Area → Repository." But they're unclear about what the Staging Area actually is.
The Staging Area is the binary file .git/index. It's essentially a flattened tree, recording:
Path | Permissions | Blob Hash | Mtime | Size | ...
src/index.ts | 100644 | a1b2c3d4... | ... | ... |
src/components/App.tsx | 100644 | d4e5f6... | ... | ... |
package.json | 100644 | 789abc... | ... | ... |When you execute git add file.ts:
- Git hashes the content of
file.tsand stores it as a blob object. - Updates
.git/index, changing the blob hash forfile.tsto the new one.
When you execute git commit:
- Git converts the contents of
.git/indexinto tree objects (creating sub-trees recursively). - Creates a commit object pointing to this tree.
- Updates the reference of the current branch.
Therefore, the Staging Area is the "draft for the next commit."
If every file in every version were stored as a full blob, wouldn't the repository explode?
Git has a background optimization called packfile. When the number of objects reaches a threshold (or you run git gc), Git will:
- Find blobs with similar content (e.g., different versions of the same file).
- Store only the latest version in full. Older versions are stored as delta (differences) from the newer version.
- Pack all objects into a single
.packfile, accompanied by an.idxindex file.
.git/objects/pack/
├── pack-abc123.idx ← Index: Hash → offset in the .pack file
└── pack-abc123.pack ← All objects packed together, similar objects delta-compressedThis is why Git repositories are often much smaller than you'd expect—logically, each commit is a full snapshot, but physically, only the differences of similar content are stored.
Note: This is a pure storage optimization; it doesn't affect the high-level semantics. From the API perspective, you always get the complete blob content; delta expansion is transparent.
Now that you have the low-level knowledge, let's reinterpret some daily operations:
- Download all objects (blobs, trees, commits) from the remote repository to the local
.git/objects/. - Download all references (branches, tags) to
.git/refs/. - Checkout the default branch: From commit → tree → blob, "unpack" files into the working directory.
main: A → B → C
↑ main
dev: A → B → C → D → E
↑ devA Fast-forward merge only needs to move the main reference from C to E. No new objects are created; only a pointer is changed.
main: A → B → C → F
dev: A → B → D → EGit finds the common ancestor B, compares the tree differences between B→F and B→E, merges them, and creates a new tree and a new commit (with two parents).
Rebase doesn't "move" commits; it recreates commits. Because a commit's hash depends on its parent, tree, timestamp, etc. If the parent changes, the hash changes. So the rebased commits are entirely new objects. The old commits remain in .git/objects/ (until garbage collected).
Stash essentially creates a special commit (not attached to any branch), with its reference stored in .git/refs/stash. git stash pop applies that commit's content to the working directory and then deletes the reference.
Worktree doesn't copy the .git directory. Instead, it creates a .git file (not a directory) in the new directory, pointing to the main repo's .git:
cat /path/to/worktree/.git
# Output: gitdir: /path/to/main-repo/.git/worktrees/my-worktreeThen, from the shared .git/objects/, following the target branch's commit → tree → blob, it "unpacks" files into the new directory. Objects are shared; files are regenerated.
Don't just read; try it. Open a terminal and run these in any Git repository:
# 1. See which branch HEAD points to
cat .git/HEAD
# 2. See which commit that branch points to
cat .git/refs/heads/main
# 3. View the content of that commit
git cat-file -p $(cat .git/refs/heads/main)
After these 7 steps, you'll have a muscle-memory level understanding of Git's internal structure.
| Concept | What it really is |
|---|---|
| Git repository | A content-addressable key-value database |
| Blob | File content (without filename) |
| Tree | Directory listing (filename + permissions + pointer to blob/tree) |
| Commit | Snapshot manifest (points to tree + parent commit + metadata) |
| Branch | A 41-byte file containing a commit hash |
| HEAD | A pointer recording the current branch |
| Staging Area | .git/index, the draft for the next commit |
| Packfile | Storage optimization, similar objects delta-compressed |
Git's design philosophy: The bottom layer is minimalist (4 object types + references), the upper layers are flexible (all advanced features are operations on these basic structures).
Once you understand this layer, commands like merge, rebase, cherry-pick, reflog, and worktree are no longer incantations to memorize, but natural operations on a data structure. You can even predict how the .git directory will change after a command executes—that's what it truly means to "understand" Git.