How git rebase can break your history

Photo by Sara Rozic on Unsplash

Until a few years ago, Mercurial was the primary tool used for version control at the company where I work. Just like any other tool, there are plenty of things 'wrong' with it — depending on who you ask — but for me and the teams it usually worked just fine. Combined with TortoiseHg it also offered a good view of the state of a repository, its branches and how everything merged together.

That changed when the company moved to Git for version control. The primary reason for switching was better integration with CI tooling, and tooling support for Git is quite good in general. But that meant development teams also had to switch. Coming from TortoiseHg, I was quite let down by the whole command line experience. Team members that were less proficient Mercurial users came to me repeatedly because they simply didn't know what to do with the Git command line.

Having heard a lot about Git and its reputation as one of the most popular version control systems, I optimistically set out to find a tool to replace TortoiseHg. It would probably even blow TortoiseHg out of the water, because it was Git! Right?

Wrong.

I was seriously surprised by the state of GUI tools for Git. If this version control software was all the rage, then where were the shiny tools?

Git's natural environment

During my search I started to see why Git is so popular. It was developed by Linus Torvalds for working on projects with many contributors, where a select group of maintainers is in charge of the main repository. When I first heard about the readily available features of Git that allow one to rewrite history, I didn't understand why one would ever need such a thing. But in a world of distributed software development with a handful of gatekeepers responsible for pulling in changes, features like rebase, squash and cherry-picking make sense.

In a world dominated by GitHub for distributed software development, where anyone can propose changes via pull request, Git really shines. That's what it was made for.

On the other end of the spectrum are software companies with small teams working on projects of similar size. In such teams there is no gatekeeper; everyone is responsible for the code being committed to the repository. Where the vocal Git community's mantra is about ‘keeping a clean history’, these taciturn teams generally benefit more from a truthful history rather than a clean one.

The benefits of a dirty and honest history

For me, the primary purpose of version control history is understanding why code is the way it is now. Similar to an archaeologist dusting off ancient artifacts in an attempt to understand a civilisation's evolution, I sometimes go through old commits trying to make sense of how it came to be.

Software development isn't always a perfect world. Requirements change, team compositions change and yearly project budgets run out. What once seemed like a good decision may now raise eyebrows. That's where an accurate history comes in. Whenever I see a line of code or a method that seems out of place, I consult the history. A detailed history can shed light on the reasons:

  • It can be a remnant of a design direction that was later revised.
  • A feature may have gone through many iterations where an old implementation echoes in the current code.
  • Two feature branches may have had conflicting changes that were resolved, but which left a mark in the code.

Rewriting history can make such details more difficult to track down. For example, if Audrey tried several approaches to solve a problem, traces of earlier directions may still be visible in her final solution. If she then squashes all intermediate commits into a single commit, the details of the tried approaches are lost. This would leave a future developer in the dark as to why the code may look a bit awkward.

Or instead of merging two branches, Bill rebases his commits onto the other branch. This results in conflicts, which Bill then quickly fixes without spending too much effort. A few months later another developer wonders why the code is written like that. It's the result of a merge conflict. But because Bill rebased his commits, the history won't show any signs of that merge taking place.

Photo by _M_V_ on Unsplash

Merges are a common source of issues — albeit small — most of which tend to go unnoticed. By the time they do get noticed, it can be valuable to see when and where they originated. Being able to track those issues back to a merge commit can explain a lot. Rebasing removes these important intersections from the history.

The dangers of rebasing

So far I've mainly discussed differences between large distributed software projects and small centralized projects. There is no ‘right’ way of working with Git. Each of these project types can benefit from different workflows, either with a clean history, or a dirty but thruthful history. I've also highlighted how a full history without any rewriting can be a valuable tool in understanding code in the future.

Now it's time for an example that shows how rebasing can downright break your version history. Imagine Audrey working on the following code, in which she just added a log statement.

public void SendDiscountVoucherEmail(string discountVoucher)
{
    var template = this.templateProvider.GetCouponTemplate();

    var parameters = new Dictionary<string, string>
    {
        ["code"] = discountVoucher,
        ["month"] = DateTime.Today.ToString("MMMM")
    };

    var body = template.Format(parameters);

    Log.Information("Emailing {Code}", discountVoucher);

    this.emailService.Send("Your discount", body);
}

She commits this change, which we will refer to as commit A. She then continues working on this method and creates commit B, after which the method looks like this:

public void SendDiscountVoucherEmail(string discountVoucher)
{
    var template = this.templateProvider.GetCouponTemplate();

    var parameters = new Dictionary<string, string>
    {
        ["code"] = discountVoucher,
        ["month"] = DateTime.Today.ToString("MMMM")
    };

    var body = template.Format(parameters);
    var content = template.Format(parameters);

    Log.Information("Emailing {Code}", discountVoucher);

    this.emailService.Send("Your discount", body);
    this.emailService.Send("Personal discount", content);
}

Earlier that day, Bill renamed the concept of 'discount voucher' to 'coupon' as per the ubiquitous language of the project. Audrey rebases her commits onto Bill's commit and fortunately she realises that Bill's change breaks the logging statement that she added earlier. So she amends her last commit to make sure everything compiles and the tests succeed:

public void SendCouponEmail(string couponCode)
{
    var template = this.templateProvider.GetCouponTemplate();

    var parameters = new Dictionary<string, string>
    {
        ["code"] = couponCode,
        ["month"] = DateTime.Today.ToString("MMMM")
    };

    var body = template.Format(parameters);
    var content = template.Format(parameters);

    Log.Information("Emailing {Code}", discountVoucher);
    Log.Information("Emailing {Code}", couponCode);

    this.emailService.Send("Your discount", body);
    this.emailService.Send("Personal discount", content);
}

She pushes the final result and all is well… Until someone checks out commit A.

Rebasing can invalidate previous commits

Although Audrey amended commit B with a fix to incorporate Bill's changes, commit A was left in a broken state due to rebasing. The discountVoucher parameter that Audrey logged earlier was renamed by Bill:

public void SendCouponEmail(string couponCode)
{
    var template = this.templateProvider.GetCouponTemplate();

    var parameters = new Dictionary<string, string>
    {
        ["code"] = couponCode,
        ["month"] = DateTime.Today.ToString("MMMM")
    };

    var body = template.Format(parameters);

    Log.Information("Emailing {Code}", discountVoucher);

    this.emailService.Send("Your discount", body);
}

One could argue that Audrey is at fault here. She pulled in Bill's changes retroactively without ensuring that all of her previous commits were still in working order. But can you really expect a team to validate all of their commits, all the time after each rebase?

Yes. Well, at least they should if they rebase in order to keep a 'clean history' without too many branches and merge commits. If you want it clean, you probably also want it to work, right? What's the value of a clean history when it's broken?

“But we never look at historical commits anyway.” Then why all the fuss about a clean history? And if someone does want to look into the history of a code base, you're doing them a disservice by removing valuable information such as merge commits, and by potentially breaking commits that were just fine.

How to avoid the danger

The solution to this problem is rather simple: use merge instead of rebase. It's also quite time efficient, because there is no need to revalidate previous commits. Al commits are left as they were, completely in working order.

Yes, you will have multiple branches and merge commits in your history. That's simply a fact when you have more than one person working on a project. Audrey did work in parallel with Bill and others, there is no need to hide that fact from history.

Software development isn't always as clean and clinical as we might want it to be. It's best to focus on writing clean code and let version control software do its work: document changes. If some of those changes were done in parallel, then that is valuable information. There is no need to throw away that information.