The problem
Recently I’ve wanted to compare a few versions of my home directory: I want to know which files change over time, and to check that my backups actually contain the files I expect, with the data in them I expect.
The notes below relate to MacOS, but I expect they’d work in approximately the same way on any Unix box.
In rough terms, I want a recursive diff1 i.e.
$ diff --recursive a b
There are a couple of problems with this:
- Sometimes it is not convenient to put both directory trees on the same machine.
- When I tried doing this recently on my home directory (~700GB), I killed the job after a few days because it hadn’t finished and I got bored.
Intrusion Detection Systems
This problem is close to one approach to host-based intrusion detection2 which alert a system administrator to nefarious changes in files.
On Linux, I’ve used integrit3 and tripwire4 for this, AIDE5 seems common too.
However none of these are trivial to install on MacOS, and even if they were I think they spend lots of time securing against deliberate subterfuge which necessarily makes them harder to use when you just want to compare arbitrary trees.
The checksum trick
One the key ideas in things like tripwire is that instead of comparing the files directly, it is a good approximation to compute a hash6 of the file and then compare those instead. The idea is an old one, but remains popular: for example I think it’s central to the way git7 works.
So, can we easily generate hashes of all the files we care about ? This being Unix, we can indeed, in a single find8 command:
$ find a -type f -exec gsha256sum {} + > a.csums
This gnerates SHA-29 hashes for all the files under a
, and saves them to a.csums
. I am not sure that SHA-2 is the best choice here, but I’m reasonably confident that it’s not entirely stupid.
The gsha256sum
command from GNU's coreutils10 package actually does the hashing. It isn’t installed on stock MacOS, but is in homebrew11. Assuming that you have homebrew, you can install coreutils thus:
$ brew install coreutils
It is worth noting that the find command ignores both non-files e.g. links, and any file metadata. These struck me as advantages, but YMMV.
Sample output
The output of gsha256sum looks like this:
6a2b70adfcf22278f71f75fe532a254b981dffc303925d6008ee4240b10f7317 bu
f8401d2de8c7094ca2c170dc93603179b64d5dfdcef8ea23e965a250e813e588 tm
You could easily use a different hashing program, but sadly the format of MacOS’s standard md5 command isn’t suitable:
$ md5 *
MD5 (bu) = 4849350721dec3431f6a506d27655641
MD5 (tm) = 6a21f7b1583cde5daeecaa0a0609fb2e
Sorting for the win
The file full of checksums above suffers from a problem: it is ordered in the order in which find
traversed the directory tree, which isn’t something we care about.
We can remove this excess entropy by simply sorting the file:
$ sort a.csums > a.scs
At first I balked at the idea of sorting this enormous (~500MB) file, but I tried it anyway and it took about a minute. It’s easy to forget just how fast modern machines are, particularly when running code written when resources were more limited and so people took more care to write efficient code.
Diff mangling
So, given a couple of files of sorted checksums, the only problem left is to compare them. The naïve approach gets us a long way there:
$ diff a.scs b.scs
15d14
...
< 000098c3fac6be1dcad03b4f75280db0c14c4d3a3f34ad02350f16f8df646dd0 foo/bar
...
> 000dfe1a15ea0bf343d536611c71cd5c2d676d72ec380f47c015071b54b746b3 foo/baz
...
There are three classes of line:
- lines which begin with a number which tell us where the differences between the files is located: we can ignore this;
- lines which begin
<
: these are lines which occur in the first file but not the second; - lines which begin
>
: these are lines which occur in the second file but not the first.
With many files it is a pain to absorb all these changes by eye, so we need a program. This is the one part of the solution which doesn’t exist, so we’ll need to write it: happily two dozen lines of Perl will suffice:
#! /usr/bin/perl
use strict;
use warnings;
my %diffs;
while(<>)
{
next unless /^[<>]/;
chomp;
my ($dir, $hash, $file) = split(/\s+/, $_, 3);
$diffs{$file}->{$dir} = $hash;
}
foreach my $file (sort keys %diffs)
{
my $h = $diffs{$file};
my $k = (!defined $h->{'<'}) ? ' >'
: (!defined $h->{'>'}) ? '< '
: ($h->{'<'} ne $h->{'>'}) ? '<>'
: '==';
printf "%s %s\n", $k, $file;
}
As is probably obvious the code accumulates all the diffs in a hash, keyed by the file name. It then iterates over the hash printing a list of the files marked with the kind of change:
<
means that it’s in the first tree but not the second;>
means that it’s in the second tree but not the first;<>
means that it’s in both trees but the contents are different.
In the unlikely event that diff
flags both files as different, yet they have the same hash, the mark is ==
. I’ve not seen this ever appear, but it seems prudent to include the case.
Assuming that you’ve saved the code on your $PATH as munge-diff-output
, you can do the full comparison thus:
$ find a -type f -exec gsha256sum {} + > a.csums
$ sort a.csums > a.scs
$ find b -type f -exec gsha256sum {} + > b.csums
$ sort b.csums > b.scs
$ diff a.scs b.scs | munge-diff-output
< foo/bar
> foo/baz
<> foo/banana
As is probably obvious, all of the information from the directory tree is distilled into the .scs files. So, the three steps above could all be performed on different machines.
My home directory has about 700GB of files in it. The checksum file is a bit less than 500MB, and gzip compresses it to about 150MB.
Handling lots of changes
As noted above the code builds a hash of changes in memory. We do this because the files are listed in order of their checksum, so the same file will probably occur at very different places in the two checksum files.
If holding everything in memory is a problem, it might be better to sort the output first e.g.:
$ diff a.scs b.scs | sort -k 3 -k 1
Which brings all mentions of a file together with < lines before > lines. This could then be post-processed with minimal memory use. I have not tried this though.
Conclusion
There’s very little new here and I expect most people fluent at the command line could do this for themselves without much thought. Certainly it’s taken me longer to write this up than to concoct it.
On the other hand, it might save some people some time, and it’s nice to be reminded how well the Unix shell still works.
References
- 1. https://en.wikipedia.org/wiki/Diff_utility
- 2. https://en.wikipedia.org/wiki/Host-based_intrusion_detection_system
- 3. https://github.com/integrit/integrit
- 4. https://github.com/Tripwire/tripwire-open-source
- 5. http://aide.sourceforge.net
- 6. https://en.wikipedia.org/wiki/Hash_function
- 7. https://en.wikipedia.org/wiki/Git
- 8. https://www.gnu.org/software/findutils/manual/html_node/find_html/Invoking-find.html#Invoking-find
- 9. https://en.wikipedia.org/wiki/SHA-2
- 10. https://www.gnu.org/s/coreutils/
- 11. https://brew.sh