Getting Git: Walking File History

Abstract: Join me in a trip down the rabbit hole of making a “File History Diff Walker.”

“I want to walk through a file’s (git) history.” It sounds like a simple request. Getting a file diff from git is straightforward. Using “git log” to get the commits that touched a particular file is also straightforward. So, combine the two into a diff view with forward and back buttons and you have a way to walk file history. 

In fact, that straightforward solution was the first version of the file history feature in Understand.

It didn’t take long to find a problem. Consider this simplified example.

The chronological order of the commits is A, B, C, D. So, the diff history would look like this:

Left SideRight Side
Commit D (“Hi Universe”)Working Copy (“Hi Universe”)
Commit C (“Hi World”)Commit D (“Hi Universe”)
Commit B (“Hello Universe”)Commit C (“Hi World”)
Commit A (“Hello World”)Commit B (“Hello Universe”)
 Commit A (“Hello World”)

The problem is the comparison between Commit B and Commit C. They’re chronologically adjacent but aren’t actually related to each other. Commit C came from Commit A, not Commit B. Another problem with the initial window is that merge commits weren’t shown at all, so Commit D wasn’t included even though its contents were unique.

How should this be fixed? Well, it kind of depends on what you want to see. The person who found the bug said “I only care about my branch, and I care about merges where something changed. So, since I’m on the feature branch, I want to see commits A,B,D.”

The problem with this is the nature of Git branches. The labels like “Master Branch” and “Feature Branch” aren’t really labels on graph edges like they’re often displayed. They’re pointers to commits, like this:

So, from git, Commit D has two parents and the order of the parents is known. Commit C is the first parent and commit B is the second parent. But there’s no knowledge that at the time Commit B was made, it was pointed at by “Feature Branch” or that at the time Commit C was made, it was pointed to by “Master Branch.”

So, given a merge commit like Commit D, the problem is which path to follow. Does Understand pick the path or does the user pick the path? Suppose it falls to an algorithm to decide which path to follow. What are the available options?

Two sources of documentation to consider are the documentation for “git log” and the documentation for libgit2 since Understand accesses git through the libgit2 library.  From the git log documentation, it looks like there are two major options impacting which paths are followed:

  1. –first-parent follows only the first parent, so from Commit D, only commit C would be considered.
  2. –full-history walk down all paths. From Commit D this would consider both Commit B and Commit C.

Libgit2 also seems to support the same two options. The default is the full history, and git_revwalk_simplify_first_parent switches to –first-parent mode. Which is the default for “git log”? Neither, it turns out. The default is a third option. To explain it, consider this history:

In –first-parent mode, the history is E,F,H,I. From Commit I, only Commit H is considered. In –full-histroy mode, the history is E,F,G,H,I. From Commit I, both commits H and G are followed. By default, the history would be E,G,I. Why? Commit I is the same as commit G, and by default “git log” follows the first TREESAME parent. There isn’t an equivalent for libgit2’s revision walker. 

So, in summary, there are three possible ways to pick paths to follow when traversing git history:

  1. Follow only the first parent
  2. Follow all parents
  3. Follow the first TREESAME parent or all parents if none were TREESAME

Up to this point, we’ve been focusing on which paths are considered. But, traversing a path isn’t the same as reporting a particular commit in the history. Recall that in the initial history walker in Understand, all merges were ignored. That was the default behavior of the libgit2 wrapper library from the open-source project GitAhead that Understand uses. And, it makes sense to a point. Most merges aren’t relevant to the file history. For instance, in the last example, does Commit I really matter to the history of the file? It’s a reasonable argument that only Commits E and G are needed since Commit G matches the current state of the file.

So, which merges are worth reporting? Shown all merges is overkill, but not showing any merges omits Commit D from the first example. What makes Commit D more worth showing than Commit I?

Returning to the git log documentation, there are several different options. The simplest options are:

  • sparse keep all merges
  • –max-parent=1 or –no-merges keeps none

Well, the current behavior of no merges wasn’t working and all merges sounds like way too many. The other git log options all have to do with differences:

  • showpulls keep merges that are different from the first parent
  • dense keep merges that are different from all parents
  • fullhistory without parent rewriting seems to imply a behavior of keeping merges that are different from any parent. 

Then, merges shown can vary from everything to nothing with:

  1. All
  2. Different from any parent
  3. Different from first parent
  4. Different from all parents
  5. No merges

Now, we have two option sets (which paths to visit and which merges to show) to address the initial problem of how to display non-linear histories. It’s time to decide how to update our diff walker. Despite being a Mac user, I love Microsoft Office, so I compile all this information into a beautiful PowerPoint for the project managers (Kevin and Heidi) so they can make a decision. They accuse me of having a dark side because I bring these crazy problems to them. However, in my defense, not only did I prepare a PowerPoint presentation, but I also created a demo to demonstrate the issues. You can access it too if you’re ever stuck on this problem. If you define the super-secret environment variable STI_HISTORY_GRAPH to a non-empty value, then you can access this hidden graphical view for file entities:

The secret History Graph

This screenshot is based on the example in git log:

The commit pattern for the git commits

It’s extended with commits R and F for an example of following the first identical parent when it is not the first parent. The italicized words show the contents of “Foo.txt” at each commit and the bolded words are different from all parents:

The graph can be used to check which commits would be included in the history and their order, given the different options. For example, following the first identical parent and keeping only commits different from all parents would give this graph with the numbers showing the sorted order:

Heidi and Kevin promptly told me that all these options and the graph were too confusing for the file history diff view and we’d have to narrow it down to the following two modes.

  1. The default mode follows only the first parent and shows merges if different from the first parent (git log –first-parent –show-pulls). This allows for a completely linear history.

2. The alternative mode is “Show Decisions.” Show Decisions follows the first identical parent, and shows merges that were different from all parents

When “Show Decisions” is checked, the file history diff view has a drop-down to allow the user to pick a path:

You can read more about how we integrated Git into Understand with these other articles.