Martin’s Atelier: Diffing directories

The problem

Recently I’ve wanted to compare a few versions of my home directory: I want to know which files change over time, and to check that my backups actually contain the files I expect, with the data in them I expect.

The notes below relate to MacOS, but I expect they’d work in approximately the same way on any Unix box.

In rough terms, I want a recursive diff¹ i.e.

$ diff --recursive a b

There are a couple of problems with this:

Sometimes it is not convenient to put both directory trees on the same machine.
When I tried doing this recently on my home directory (~700GB), I killed the job after a few days because it hadn’t finished and I got bored.

Intrusion Detection Systems

This problem is close to one approach to host-based intrusion detection² which alert a system administrator to nefarious changes in files.

On Linux, I’ve used integrit³ and tripwire⁴ for this, AIDE⁵ seems common too.

However none of these are trivial to install on MacOS, and even if they were I think they spend lots of time securing against deliberate subterfuge which necessarily makes them harder to use when you just want to compare arbitrary trees.

The checksum trick

One the key ideas in things like tripwire is that instead of comparing the files directly, it is a good approximation to compute a hash⁶ of the file and then compare those instead. The idea is an old one, but remains popular: for example I think it’s central to the way git⁷ works.

So, can we easily generate hashes of all the files we care about ? This being Unix, we can indeed, in a single find⁸ command:

$ find a -type f -exec gsha256sum {} + > a.csums

This gnerates SHA-2⁹ hashes for all the files under a, and saves them to a.csums. I am not sure that SHA-2 is the best choice here, but I’m reasonably confident that it’s not entirely stupid.

The gsha256sum command from GNU's coreutils¹⁰ package actually does the hashing. It isn’t installed on stock MacOS, but is in homebrew¹¹. Assuming that you have homebrew, you can install coreutils thus:

$ brew install coreutils

It is worth noting that the find command ignores both non-files e.g. links, and any file metadata. These struck me as advantages, but YMMV.

Sample output

The output of gsha256sum looks like this:

6a2b70adfcf22278f71f75fe532a254b981dffc303925d6008ee4240b10f7317  bu
f8401d2de8c7094ca2c170dc93603179b64d5dfdcef8ea23e965a250e813e588  tm

You could easily use a different hashing program, but sadly the format of MacOS’s standard md5 command isn’t suitable:

$ md5 *
MD5 (bu) = 4849350721dec3431f6a506d27655641
MD5 (tm) = 6a21f7b1583cde5daeecaa0a0609fb2e

Sorting for the win

The file full of checksums above suffers from a problem: it is ordered in the order in which find traversed the directory tree, which isn’t something we care about.

We can remove this excess entropy by simply sorting the file:

$ sort a.csums > a.scs

At first I balked at the idea of sorting this enormous (~500MB) file, but I tried it anyway and it took about a minute. It’s easy to forget just how fast modern machines are, particularly when running code written when resources were more limited and so people took more care to write efficient code.

Diff mangling

So, given a couple of files of sorted checksums, the only problem left is to compare them. The naïve approach gets us a long way there:

$ diff a.scs b.scs
15d14
...
< 000098c3fac6be1dcad03b4f75280db0c14c4d3a3f34ad02350f16f8df646dd0 foo/bar
...
> 000dfe1a15ea0bf343d536611c71cd5c2d676d72ec380f47c015071b54b746b3 foo/baz
...

There are three classes of line:

lines which begin with a number which tell us where the differences between the files is located: we can ignore this;
lines which begin <: these are lines which occur in the first file but not the second;
lines which begin >: these are lines which occur in the second file but not the first.

With many files it is a pain to absorb all these changes by eye, so we need a program. This is the one part of the solution which doesn’t exist, so we’ll need to write it: happily two dozen lines of Perl will suffice:

#! /usr/bin/perl						
								
use strict;							
use warnings;							
								
my %diffs;							
while(<>)							
  {								
    next unless /^[<>]/;					
								
    chomp;							
    								
    my ($dir, $hash, $file) = split(/\s+/, $_, 3);    		
    $diffs{$file}->{$dir} = $hash;				
  }								
								
foreach my $file (sort keys %diffs)				
  {								
    my $h = $diffs{$file};					
								
    my $k = (!defined $h->{'<'})     ? ' >'			
          : (!defined $h->{'>'})     ? '< '			
	  : ($h->{'<'} ne $h->{'>'}) ? '<>'			
	  :                            '==';			
								
    printf "%s %s\n", $k, $file;				
  }

As is probably obvious the code accumulates all the diffs in a hash, keyed by the file name. It then iterates over the hash printing a list of the files marked with the kind of change:

< means that it’s in the first tree but not the second;
> means that it’s in the second tree but not the first;
<> means that it’s in both trees but the contents are different.

In the unlikely event that diff flags both files as different, yet they have the same hash, the mark is ==. I’ve not seen this ever appear, but it seems prudent to include the case.

Assuming that you’ve saved the code on your $PATH as munge-diff-output, you can do the full comparison thus:

$ find a -type f -exec gsha256sum {} + > a.csums
$ sort a.csums > a.scs

$ find b -type f -exec gsha256sum {} + > b.csums
$ sort b.csums > b.scs

$ diff a.scs b.scs | munge-diff-output
<  foo/bar
 > foo/baz
<> foo/banana

As is probably obvious, all of the information from the directory tree is distilled into the .scs files. So, the three steps above could all be performed on different machines.

My home directory has about 700GB of files in it. The checksum file is a bit less than 500MB, and gzip compresses it to about 150MB.

Handling lots of changes

As noted above the code builds a hash of changes in memory. We do this because the files are listed in order of their checksum, so the same file will probably occur at very different places in the two checksum files.

If holding everything in memory is a problem, it might be better to sort the output first e.g.:

$ diff a.scs b.scs | sort -k 3 -k 1

Which brings all mentions of a file together with < lines before > lines. This could then be post-processed with minimal memory use. I have not tried this though.

Conclusion

There’s very little new here and I expect most people fluent at the command line could do this for themselves without much thought. Certainly it’s taken me longer to write this up than to concoct it.

On the other hand, it might save some people some time, and it’s nice to be reminded how well the Unix shell still works.

References

1. https://en.wikipedia.org/wiki/Diff_utility
2. https://en.wikipedia.org/wiki/Host-based_intrusion_detection_system
3. https://github.com/integrit/integrit
4. https://github.com/Tripwire/tripwire-open-source
5. http://aide.sourceforge.net
6. https://en.wikipedia.org/wiki/Hash_function
7. https://en.wikipedia.org/wiki/Git
8. https://www.gnu.org/software/findutils/manual/html_node/find_html/Invoking-find.html#Invoking-find
9. https://en.wikipedia.org/wiki/SHA-2
10. https://www.gnu.org/s/coreutils/
11. https://brew.sh

Diffing directories

The problem

Intrusion Detection Systems

The checksum trick

Sample output

Sorting for the win

Diff mangling

Handling lots of changes

Conclusion

References

Martin’s Atelier

Related Articles

Related Subjects

Other Articles