trees

Oct 23, 2023

2.3 trees

To get Git to track filenames and directories we have it create a different type of object called a ‘tree’ and to create tree objects we use the ‘index’. The index is a sort of holding area within our repository² (you will also see the ‘index’ called the ‘cache’ or ‘staging’ area). In the index we collect information about all of the objects we want to store in our repository, then we use a single command to create a tree entry using the entries in the index.

1git update-index --add --cacheinfo 100644 83baae61804e65cc73a7201a7252750c76066a30 file1.txt 
2tree .git

1.git 
2├── branches 
3├── config 
4├── description 
5├── HEAD 
6├── hooks 
7├── index 
8├── info 
9│   └── exclude 
10├── objects 
11│   ├── 1f 
12│   │   └── 7a7a472abf3dd9643fd615f6da379c4acb3e3a 
13│   ├── 7a 
14│   │   └── b4ff63b2ea4c2c3ff89ee972bc42988a4b8472 
15│   ├── 83 
16│   │   └── baae61804e65cc73a7201a7252750c76066a30 
17│   ├── info 
18│   └── pack 
19└── refs 
20    ├── heads 
21    └── tags 
22 
2312 directories, 19 files

update-index is used to manipulate our repository index. Initially a new repository has no index but after adding an object’s information to the index we see a new file index (line 7 above). The --cacheinfo option specifies the object data to be added. The file’s mode (100644) is stored, then the object hash (83baae61804e65cc73a7201a7252750c76066a30), and finally the filename we want to associated with the object (file1.txt). Note, these are entirely under our control in the update-index command and do not have to correspond with any real file. Even the object identity is not checked by the update-index command (you should always provide a real hash though, otherwise you will get an “invalid object” error when you attempt to write the tree—up next).

Having created our index we can examine its content using git ls-files --stage, the --stage option causes ls-files to display the mode and object hash.

1git ls-files --stage 
2git write-tree 
3git ls-files --stage

git ls-files --stage

1100644 83baae61804e65cc73a7201a7252750c76066a30 0       file1.txt

git write-tree

1b7e8fac7e3e35d93d39d2fa2260868f025a9efb4

git ls-files --stage

1100644 83baae61804e65cc73a7201a7252750c76066a30 0       file1.txt

The git write-tree operation does not change the index file. The ls-files shows us that the index is the same before and after the write-tree.

1tree .git

1.git 
2├── branches 
3├── config 
4├── description 
5├── HEAD 
6├── hooks 
7├── index 
8├── info 
9│   └── exclude 
10├── objects 
11│   ├── 1f 
12│   │   └── 7a7a472abf3dd9643fd615f6da379c4acb3e3a 
13│   ├── 7a 
14│   │   └── b4ff63b2ea4c2c3ff89ee972bc42988a4b8472 
15│   ├── 83 
16│   │   └── baae61804e65cc73a7201a7252750c76066a30 
17│   ├── b7 
18│   │   └── e8fac7e3e35d93d39d2fa2260868f025a9efb4 
19│   ├── info 
20│   └── pack 
21└── refs 
22    ├── heads 
23    └── tags 
24 
2513 directories, 20 files

After the write-tree a new object has appeared in our repository. The hash for this object (b7e8fac7e3e35d93d39d2fa2260868f025a9efb4) is what was returned from the write-tree command. You can check the type of this object, confirming it is a tree, and then look at its content to see that the --cacheinfo we used above has been captured.

1git cat-file -t b7e8 
2git cat-file -p b7e8

git cat-file -t b7e8

1tree

git cat-file -p b7e8

1100644 blob 83baae61804e65cc73a7201a7252750c76066a30    file1.txt

The second field of this tree record blob is telling us that the record refers to an object of type ‘blob’. Why blob and not object? The object directory contains both file content (blob) and tree objects (which we will shortly see as analogous to directories in the workspace). In other words, blobs and trees are both objects. It is therefore fine to use the term ‘object’ when the context makes clear the type of object we are talking about (or we are talking collectively about any type of object). I will continue to use ‘object’ unless it is important to use a more specific type.

We can add multiple objects to our index and these can be a mix of existing repository objects and new files added from our working area.

1echo 'Another file' > another_file.txt 
2git update-index --add another_file.txt 
3git ls-files --stage

git ls-files --stage

1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0       another_file.txt 
2100644 83baae61804e65cc73a7201a7252750c76066a30 0       file1.txt

Here we are using update-index directly on the file another_file.txt. This will create a new object in the repository holding the content of another_file.txt at the time this update-index is run and then create the entry in the index to relate the filename and the file mode to this object. We cannot use --cacheinfo here because the object does not exist within the repository until we run the update-index. We need the --add option so that update-index will accept new files (files that have no existing index entry) into the index.

Some time back we created a new object containing the text ‘version 2’. This object was assigned the hash 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a when we created it with hash-object -w. We want to add this object to our index.

1git update-index --cacheinfo 100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a file1.txt 
2git ls-files --stage

git ls-files --stage

1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0       another_file.txt 
2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0       file1.txt

Notice that the index is modified so that the file1.txt entry now refers to object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a.

Why was a new line not created in the index? Note the absence of the --add option. We are modifying the index entry associated with the name file1.txt, not adding a new entry. The index is a mapping between objects in the Git repository and files in the workspace and workspace files must be uniquely identified filename. There can only be a one to one mapping from filename to object in the index (a filename can only refer to one object).

It is fine for the index to have a one to many mapping from object to filename (one object can be referred to by many filenames). This can be illustrated by adding a second index entry referring to the object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a but using a different filename.

1git update-index --add --cacheinfo 100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a filerX.txt 
2git ls-files --stage

git ls-files --stage

1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0       another_file.txt 
2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0       file1.txt 
3100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0       fileX.txt

What does this represent?

Work through what we have learned so far. The object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a contains the data ‘version 2’. The index shows the mapping between the data and the files in the workspace. So both file1.txt and fileX.txt in the workspace are to have the same content (that from object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a).

We don’t really want this double mapping (interesting as it is), so we remove it from the index using the --remove option to the update-index command.

1git update-index --remove fileX.txt 
2git ls-files --stage

git ls-files --stage

1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0       another_file.txt 
2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0       file1.txt

We now create another tree object.

1git write-tree

So far we have created some basic blob and tree objects, but we have not yet dealt with directories. Or have we?

A directory is essentially a container holding files and other directories. Sounds familiar? The tree object we just created is a list of blobs related to file names. Can we similarly relate a directory name with a tree object and include it in another tree object?

Create a directory and a new file in that directory.

1mkdir dir1 
2echo 'version 1' > dir1/file11.txt

We now add this new file to the index.

1git update-index --add dir1/file11.txt

If we now look at our index we find that this has simply added an entry to the index with the path dir1/file11.txt rather than a simple filename. We have discovered that the index maps files by pathname rather than simply their file name. These pathnames are relative to the root of our working area.

1git ls-files -s

1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0       another_file.txt 
2100644 83baae61804e65cc73a7201a7252750c76066a30 0       dir1/file11.txt 
3100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0       file1.txt

2.3.1 Progress review: blobs and trees

Let’s review the situation we now have.

We have some blobs in the .git/objects store holding various data. We have two tree objects in the .git/objects store (b7e8fac7e3e35d93d39d2fa2260868f025a9efb4) that relates 83baae to the name file1.txt and 349fa0b7f3252dbe6989c2e8156803b3265a78e0 that relates 1f7a7a to file1.txt and b0b9fc to another_file.txt). We have a .git/index file containing various mappings between blobs and filenames (which we just listed out above).

We can list all the objects in .git/objects using cat-file with the --batch-all-objects and --batch-check options.

1git cat-file --batch-all-objects --batch-check

git cat-file --batch-all-objects --batch-check

11f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob 10 
2349fa0b7f3252dbe6989c2e8156803b3265a78e0 tree 81 
37ab4ff63b2ea4c2c3ff89ee972bc42988a4b8472 blob 11 
483baae61804e65cc73a7201a7252750c76066a30 blob 10 
5b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f blob 13 
6b7e8fac7e3e35d93d39d2fa2260868f025a9efb4 tree 37

We can now see what happens when we add sub-directories to our object store. Remember that our index has a new dir1/file11.txt path mapping so we are expecting write-tree to account for this in our repository.

1git write-tree 
2git cat-file --batch-all-objects --batch-check

git cat-file --batch-all-objects --batch-check

10139f016af84acd889e2f707ef9eca2140e0222e tree 112 
21f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob 10 
3337f3832b1bce2d8f364e99965c8519a3eb9dc6c tree 38 
4349fa0b7f3252dbe6989c2e8156803b3265a78e0 tree 81 
57ab4ff63b2ea4c2c3ff89ee972bc42988a4b8472 blob 11 
683baae61804e65cc73a7201a7252750c76066a30 blob 10 
7b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f blob 13 
8b7e8fac7e3e35d93d39d2fa2260868f025a9efb4 tree 37

We have added two new tree objects, 337f38 and 0139f0. Inspecting these we can see what has happened.

1git cat-file -p 337f38 
2git cat-file -p 0139f0

git cat-file -p 337f38

1100644 blob 83baae61804e65cc73a7201a7252750c76066a30    file11.txt

git cat-file -p 0139f0

1100644 blob b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f  another_file.txt 
2040000 tree 337f3832b1bce2d8f364e99965c8519a3eb9dc6c  dir1 
3100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a  file1.txt

The first (337f38) represents the content of the dir1 directory, in this instance just the mapping of 83baae to the file name file11.txt.

The second (0139f0) represent the content of our root directory. The interesting entry being the tree object referenced on line 2 and mapped to the name dir1.

From this short exercise we can make a few observations.

The index maps blobs to file paths (not simply file names).
The index does not map tree objects.
Tree objects are created as required whenever a write-tree is executed.
Tree objects are mapped to names by other tree objects.
Tree objects form a directed graph representing a directory structure.
The root Tree object has no name (since names are mapped by tree objects and, by definition, the root tree object is not itself a part of a parent tree object).

We have now shown how Git stores data in blobs. Names are mapped to those blobs by tree objects. Tree objects can contain other tree objects and map them to names, allowing us to store directories³.

Now that we can store a basic file structure it is time to consider how Git stores the history of changes to files.

²This is a lie! In Chapter 3 we will take a closer look at the index and learn why this lie is so often repeated.

³Note that we cannot create an empty tree object. This is the reason Git cannot store empty directories.