2.3 trees
To get Git to track filenames and directories we have it create a different type of object called a ‘tree’ and to create tree objects we use the ‘index’. The index is a sort of holding area within our repository2 (you will also see the ‘index’ called the ‘cache’ or ‘staging’ area). In the index we collect information about all of the objects we want to store in our repository, then we use a single command to create a tree entry using the entries in the index.
1git update-index --add --cacheinfo 100644 83baae61804e65cc73a7201a7252750c76066a30 file1.txt 2tree .git
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks 7├── index 8├── info 9│ └── exclude 10├── objects 11│ ├── 1f 12│ │ └── 7a7a472abf3dd9643fd615f6da379c4acb3e3a 13│ ├── 7a 14│ │ └── b4ff63b2ea4c2c3ff89ee972bc42988a4b8472 15│ ├── 83 16│ │ └── baae61804e65cc73a7201a7252750c76066a30 17│ ├── info 18│ └── pack 19└── refs 20 ├── heads 21 └── tags 22 2312 directories, 19 files
update-index is used to manipulate our repository index. Initially a new repository has no index but after adding an object’s information to the index we see a new file index (line 7 above). The --cacheinfo option specifies the object data to be added. The file’s mode (100644) is stored, then the object hash (83baae61804e65cc73a7201a7252750c76066a30), and finally the filename we want to associated with the object (file1.txt). Note, these are entirely under our control in the update-index command and do not have to correspond with any real file. Even the object identity is not checked by the update-index command (you should always provide a real hash though, otherwise you will get an “invalid object” error when you attempt to write the tree—up next).
Having created our index we can examine its content using git ls-files --stage, the --stage option causes ls-files to display the mode and object hash.
1git ls-files --stage 2git write-tree 3git ls-files --stage
1100644 83baae61804e65cc73a7201a7252750c76066a30 0 file1.txt
1b7e8fac7e3e35d93d39d2fa2260868f025a9efb4
1100644 83baae61804e65cc73a7201a7252750c76066a30 0 file1.txt
The git write-tree operation does not change the index file. The ls-files shows us that the index is the same before and after the write-tree.
1tree .git
1.git 2├── branches 3├── config 4├── description 5├── HEAD 6├── hooks 7├── index 8├── info 9│ └── exclude 10├── objects 11│ ├── 1f 12│ │ └── 7a7a472abf3dd9643fd615f6da379c4acb3e3a 13│ ├── 7a 14│ │ └── b4ff63b2ea4c2c3ff89ee972bc42988a4b8472 15│ ├── 83 16│ │ └── baae61804e65cc73a7201a7252750c76066a30 17│ ├── b7 18│ │ └── e8fac7e3e35d93d39d2fa2260868f025a9efb4 19│ ├── info 20│ └── pack 21└── refs 22 ├── heads 23 └── tags 24 2513 directories, 20 files
After the write-tree a new object has appeared in our repository. The hash for this object (b7e8fac7e3e35d93d39d2fa2260868f025a9efb4) is what was returned from the write-tree command. You can check the type of this object, confirming it is a tree, and then look at its content to see that the --cacheinfo we used above has been captured.
1git cat-file -t b7e8 2git cat-file -p b7e8
1tree
1100644 blob 83baae61804e65cc73a7201a7252750c76066a30 file1.txt
The second field of this tree record blob is telling us that the record refers to an object of type ‘blob’. Why blob and not object? The object directory contains both file content (blob) and tree objects (which we will shortly see as analogous to directories in the workspace). In other words, blobs and trees are both objects. It is therefore fine to use the term ‘object’ when the context makes clear the type of object we are talking about (or we are talking collectively about any type of object). I will continue to use ‘object’ unless it is important to use a more specific type.
We can add multiple objects to our index and these can be a mix of existing repository objects and new files added from our working area.
1echo 'Another file' > another_file.txt 2git update-index --add another_file.txt 3git ls-files --stage
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 83baae61804e65cc73a7201a7252750c76066a30 0 file1.txt
Here we are using update-index directly on the file another_file.txt. This will create a new object in the repository holding the content of another_file.txt at the time this update-index is run and then create the entry in the index to relate the filename and the file mode to this object. We cannot use --cacheinfo here because the object does not exist within the repository until we run the update-index. We need the --add option so that update-index will accept new files (files that have no existing index entry) into the index.
Some time back we created a new object containing the text ‘version 2’. This object was assigned the hash 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a when we created it with hash-object -w. We want to add this object to our index.
1git update-index --cacheinfo 100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a file1.txt 2git ls-files --stage
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 file1.txt
Notice that the index is modified so that the file1.txt entry now refers to object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a.
Why was a new line not created in the index? Note the absence of the --add option. We are modifying the index entry associated with the name file1.txt, not adding a new entry. The index is a mapping between objects in the Git repository and files in the workspace and workspace files must be uniquely identified filename. There can only be a one to one mapping from filename to object in the index (a filename can only refer to one object).
It is fine for the index to have a one to many mapping from object to filename (one object can be referred to by many filenames). This can be illustrated by adding a second index entry referring to the object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a but using a different filename.
1git update-index --add --cacheinfo 100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a filerX.txt 2git ls-files --stage
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 file1.txt 3100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 fileX.txt
What does this represent?
Work through what we have learned so far. The object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a contains the data ‘version 2’. The index shows the mapping between the data and the files in the workspace. So both file1.txt and fileX.txt in the workspace are to have the same content (that from object 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a).
We don’t really want this double mapping (interesting as it is), so we remove it from the index using the --remove option to the update-index command.
1git update-index --remove fileX.txt 2git ls-files --stage
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 file1.txt
We now create another tree object.
1git write-tree
So far we have created some basic blob and tree objects, but we have not yet dealt with directories. Or have we?
A directory is essentially a container holding files and other directories. Sounds familiar? The tree object we just created is a list of blobs related to file names. Can we similarly relate a directory name with a tree object and include it in another tree object?
Create a directory and a new file in that directory.
1mkdir dir1 2echo 'version 1' > dir1/file11.txt
We now add this new file to the index.
1git update-index --add dir1/file11.txt
If we now look at our index we find that this has simply added an entry to the index with the path dir1/file11.txt rather than a simple filename. We have discovered that the index maps files by pathname rather than simply their file name. These pathnames are relative to the root of our working area.
1git ls-files -s
1100644 b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f 0 another_file.txt 2100644 83baae61804e65cc73a7201a7252750c76066a30 0 dir1/file11.txt 3100644 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a 0 file1.txt
2.3.1 Progress review: blobs and trees
Let’s review the situation we now have.
We have some blobs in the .git/objects store holding various data. We have two tree objects in the .git/objects store (b7e8fac7e3e35d93d39d2fa2260868f025a9efb4) that relates 83baae to the name file1.txt and 349fa0b7f3252dbe6989c2e8156803b3265a78e0 that relates 1f7a7a to file1.txt and b0b9fc to another_file.txt). We have a .git/index file containing various mappings between blobs and filenames (which we just listed out above).
We can list all the objects in .git/objects using cat-file with the --batch-all-objects and --batch-check options.
1git cat-file --batch-all-objects --batch-check
11f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob 10 2349fa0b7f3252dbe6989c2e8156803b3265a78e0 tree 81 37ab4ff63b2ea4c2c3ff89ee972bc42988a4b8472 blob 11 483baae61804e65cc73a7201a7252750c76066a30 blob 10 5b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f blob 13 6b7e8fac7e3e35d93d39d2fa2260868f025a9efb4 tree 37
We can now see what happens when we add sub-directories to our object store. Remember that our index has a new dir1/file11.txt path mapping so we are expecting write-tree to account for this in our repository.
1git write-tree 2git cat-file --batch-all-objects --batch-check
10139f016af84acd889e2f707ef9eca2140e0222e tree 112 21f7a7a472abf3dd9643fd615f6da379c4acb3e3a blob 10 3337f3832b1bce2d8f364e99965c8519a3eb9dc6c tree 38 4349fa0b7f3252dbe6989c2e8156803b3265a78e0 tree 81 57ab4ff63b2ea4c2c3ff89ee972bc42988a4b8472 blob 11 683baae61804e65cc73a7201a7252750c76066a30 blob 10 7b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f blob 13 8b7e8fac7e3e35d93d39d2fa2260868f025a9efb4 tree 37
We have added two new tree objects, 337f38 and 0139f0. Inspecting these we can see what has happened.
1git cat-file -p 337f38 2git cat-file -p 0139f0
1100644 blob 83baae61804e65cc73a7201a7252750c76066a30 file11.txt
1100644 blob b0b9fc8f6cc2f8f110306ed7f6d1ce079541b41f another_file.txt 2040000 tree 337f3832b1bce2d8f364e99965c8519a3eb9dc6c dir1 3100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a file1.txt
The first (337f38) represents the content of the dir1 directory, in this instance just the mapping of 83baae to the file name file11.txt.
The second (0139f0) represent the content of our root directory. The interesting entry being the tree object referenced on line 2 and mapped to the name dir1.
From this short exercise we can make a few observations.
- The index maps blobs to file paths (not simply file names).
- The index does not map tree objects.
- Tree objects are created as required whenever a write-tree is executed.
- Tree objects are mapped to names by other tree objects.
- Tree objects form a directed graph representing a directory structure.
- The root Tree object has no name (since names are mapped by tree objects and, by definition, the root tree object is not itself a part of a parent tree object).
We have now shown how Git stores data in blobs. Names are mapped to those blobs by tree objects. Tree objects can contain other tree objects and map them to names, allowing us to store directories3.
Now that we can store a basic file structure it is time to consider how Git stores the history of changes to files.