Introduction

Chapter 2 of David MacKay’s excellent book Information Theory, Inference, and Learning Algorithms1 is a general introduction to probability. One of the examples asks about some dice-rolling:

Fred rolls an unbiased six-sided die once per second, noting the occasions when the outcome is a six.

  1. What is the mean number of rolls from one six to the next six?
  2. Between two rolls, the clock strikes one. What is the mean number of rolls until the next six?
  3. Now think back before the clock struck. What is the mean number of rolls, going back in time, until the most recent six?
  4. What is the mean number of rolls from the six before the clock struck to the next six?

When I first encountered this, I found it quite hard to tackle, because it’s one of those problems which is almost trivial if you look at it in the right way, but hard otherwise. The key is to educate your intuition so that you do indeed see it from the right perspective.

I don’t remember how I tackled it in the past, but when I was discussing it recently, it struck me as a nice thing to simulate, and my favourite language for such things is Haskell2.

Random Numbers in Haskell

In many languages, you can generate a random number by calling a function e.g. in python3 we might simulate a couple of dice-rolls thus:

>>> random.randint(1, 6)
2
>>> random.randint(1, 6)
6

Pedantically this generates a pseudo-random4 number, and behind the function call there’s some hidden state so that repeated calls return different results.

Haskell’s idea of a function is much closer to the mathematical one: in particular functions are pure5 which means that if we call a function again with the same arguments we’ll get the same result. We could proceed by explicitly passing the state of the random number generator around. However, Haskell is lazy6 and so copes well with infinite lists. So, we can define an infinite list of random rolls happy in the knowledge that the samples will only be generated as they’re needed.

Once defined we can pass the infinite list of rolls around just like any other list. The code which analyses the sequence knows nothing about randomness: it just sees the numbers.

Happily the System.Random7 package contains all the code we need to generate the list:

ghci> import System.Random
ghci> let rolls = randomRs (1,6) . mkStdGen

ghci> take 10 $ rolls 42
[6,4,2,5,3,2,1,6,1,4]

ghci> sum . take 100000 $ rolls 42
350050

The key remaining difference is that we have to specify an explicit seed (here 42). In other languages this often defaults to some external source of entropy e.g. the system clock.

If you’re unfamiliar with Haskell and find the . and $ confusing, this Stack Overflow8 article might help.

Some helpful utility functions

Haskell has many list handling functions in the standard Data.List9 package. However, it will be useful to define several new functions, all of which are simple wrappers around Data.List.Split10:

import qualified Data.List.Split as Sp			
							
splitAfter :: (a -> Bool) -> [a] -> [[a]]		
splitAfter p = Sp.split (Sp.keepDelimsR $ Sp.whenElt p)	
							
splitBefore :: (a -> Bool) -> [a] -> [[a]]		
splitBefore p = Sp.split (Sp.keepDelimsL $ Sp.whenElt p)
							
takeUntil :: (a -> Bool) -> [a] -> [a]			
takeUntil p = head . (splitAfter p)			

Hopefully the names and function signatures are enough to explain what these do, but here are some examples:

ghci> splitAfter  (== 'c') "abcdefabcdef"
["abc","defabc","def"]

ghci> splitBefore (== 'c') "abcdefabcdef"
["ab","cdefab","cdef"]

ghci> takeUntil   (== 'c') "abcdefabcdef"
"abc"

As is perhaps clear now, our general plan for simulating the dice rolls will be to take the infinite list of rolls, then cut it into sections whose lengths we’ll average.

With this in mind, we’ll find a few other functions helpful too:

averageOver :: Integral a => Int -> [a] -> Double	
averageOver n xs = (fromIntegral sigma) / (fromIntegral n)
   where sigma = sum $ take n xs			
							
averageLength :: Int -> [[a]] -> Double			
averageLength n = averageOver n . map length		

We can easily calculate the mean dice-roll:

ghci> averageOver 10000 $ rolls 42
3.4912

However, we’re usually interested in the average length of a sequence, so here’s a (very artificial) example:

ghci> averageLength  10000 . map (\n -> replicate n 'a') $ rolls 42
3.4912

To see why this works consider what the map does to the first five rolls:

ghci> take 5 $ rolls 42
[6,4,2,5,3]
ghci> take 5 . map (\n -> replicate n 'a') $ rolls 42
["aaaaaa","aaaa","aa","aaaaa","aaa"]

The simulations

Having prepared our tools, we can now actually tackle the questions.

Part A

The question asks:

What is the mean number of rolls from one six to the next six ?

A reasonable approach is to split the list of rolls whenever we see a six, then measure the lengths of the sublists:

a_seqs :: [Roll] -> [[Roll]]
a_seqs = splitAfter (== 6)
ghci> take 4 . a_seqs $ rolls 42
[[6],[4,2,5,3,2,1,6],[1,4,4,4,1,3,3,2,6],[2,4,1,3,1,1,5,5,5,1,3,6]]

ghci> averageLength   1000 a_seqs $ rolls 42
6.137
ghci> averageLength 100000 a_seqs $ rolls 42
5.98306

It seems likely, and unsurprisingly, that this answer will tend to six as the number of samples tends to infinity. This is easy to show analytically too!

Given that the dice is fair, there is a one-sixth change of rolling a six, and a five-sixths change of not. So, the chance of having to wait n rolls for a six is:

\[ p(n) = \frac{1}{6} \times \left(\frac{5}{6}\right)^{n-1}. \]

and thus the mean number of rolls is given by,

\[ \mu_A = \sum_{i = 1}^{\infty} i \ \times \theta \, \left(1 - \theta\right)^{i-1}, \]

where \(\theta = 1/6\).

Evaluating this11 does indeed give,

\[ \mu_B = \frac{1}{\theta} = 6. \]

Parts B & C

The question asks:

Between two rolls, the clock strikes one. What is the mean number of rolls until the next six?

The first problem here is that we have no notion of the time in our simulation, so let’s add one:

addTimes :: [a] -> [(Time,a)]
addTimes = zip times
   where times = cycle [0..longPeriod - 1]

ghci> take 5            $ rolls 42
[6,4,2,5,3]
ghci> take 5 . addTimes $ rolls 42
[(0,6),(1,4),(2,2),(3,5),(4,3)]

We’ve replaced the list of rolls with a list of (time,roll) pairs. We need some convention about time: let’s say that “one o’clock” corresponds to t = 0, and the roll happens after the tick. Thus (0,6) means that the roll immediately after the clock struck was a 6.

Simulating this part is a little bit more complicated but it’s not too bad. Begin by splitting the list when the clock chimes, then discard the sequence after the first 6. In code:

b_seqs :: [Roll] -> [[(Time,Roll)]]
b_seqs = map (takeUntil (\(t, r) -> r  6))
               . splitAfter (\(t,r) -> t  0)
               . addTimes

ghci> take 3 . b_seqs $ rolls 42
[[(0,6)]
 ,[(1,4),(2,2),(3,5),(4,3),(5,2),(6,1),(7,6)]
 ,[(1,3),(2,1),(3,2),(4,6)]]

ghci> averageLength   1000 b_seqs $ rolls 42
6.21
ghci> averageLength 100000 b_seqs $ rolls 42
6.01542

The calculation is noticably slower, but the answer seems to be the same. That seems reasonable: we are just picking random points in the sequence of rolls and starting our count there.

Nothing in the analytic result above cares where we start, so the analytic result is also unchanged!

Although I’ve not done it explicitly here, I think it’s clear that part C is just the same but with time running backwards.

Part D

The question asks:

What is the mean number of rolls from the six before the clock struck to the next six?

Today, it seems sensible to me to tackle this question by splitting the list of rolls every time we see a six, then throwing out those sequences in which the clock doesn’t chime. The code is straightforward and gives the right answer:

d_seqs :: [Roll] -> [[(Time,Roll)]]
d_seqs = filter (any (\(t,r) -> t  0))
              . splitAfter (\(t,r) -> r  6)
              . addTimes

ghci> averageLength 100000 d_seqs $ rolls 42
11.1093

However, I worry somewhat that this code is only easy to write because I know how to think about the question. So, here’s a messier approach which relates more closely to the words in MacKay’s book.

We begin by adding another component to the list of rolls, which counts how long it’s been since we rolled a six. A couple of caveats: it fudges the start of the sequence, and it resets the count in the tuple after the six is rolled:

addTimeSince6 :: [Roll] -> [(Roll,Int)]
addTimeSince6 = tail . addT
   where addT xs = (0,0):(zipWith f xs (addT xs))
         f r (r',i) = (r, if r' == 6 then 1 else i + 1)

ghci> addTimeSince6 [1,2,6,1,2,6,1,2]
[(1,1),(2,2),(6,3),(1,1),(2,2),(6,3),(1,1),(2,2)]

ghci> addTimes . addTimeSince6 $ [1,2,6,1,2,6,1,2]
[(0,(1,1)),(1,(2,2)),(2,(6,3)),(3,(1,1)),(4,(2,2)),(5,(6,3)),(6,(1,1)),(7,(2,2))]

If we annotate this sequence with the time, and split it where the clock chimes, we can read off the length directly by looking for the first six in the break:

d_lengths' :: [Roll] -> [Int]				
d_lengths' = tail					
                 . map len 				
                 . splitBefore (\(t,(r,q)) -> t  0)	
                 . addTimes				
                 . addTimeSince6			
  where len [] = 0					
        len xs = head [ q | (t,(r,q)) <- xs, r  6]
ghci> averageOver 3000 . d_lengths' $ rolls 42
11.08

The code is slow (which is why we only consider 3,000 samples) but it appears to get the same result! If we wanted to be sure, we can compare the lengths directly:

*Main> take 20  . map length . d_seqs $ rolls 42
[1,14,43,7,13,8,6,9,5,7,27,41,1,3,22,14,8,9,18,9]
*Main> take 20  . d_lengths'  $ rolls 42
[1,14,43,7,13,8,6,9,5,7,27,41,1,3,22,14,8,9,18,9]

Happily these sequences agree, at least up to the first twenty terms. Less happily, I think the code is rather messy, and probably reasonably opaque if you’re not familiar with Haskell.

Finally, let’s derive this analytically. The key insight here is that the chance of the clock chiming in a particular run between sixes depends on the length of the sequence. Although all ticks are equally likely to hear the clock chime, longer sequences have more ticks and so are more likely. Thus the probability of being in a sequence of length \(i\) when the clock chimes is,

\[ q_i \propto i \times p_i, \]

Now,

\[ p_i \propto \left(1 - \theta\right)^{i - 1}, \]

So,

\[ q_i \propto i \times \left(1 - \theta\right)^{i-1}. \]

Thus, multiplying all terms by \(1 - \theta\), the mean number is given by

\[ \mu_D = \frac{\sum_{i = 1}^{\infty}{i^2\, \left(1 - \theta\right)^i}}{\sum_{i = 1}^{\infty}{i\, \left(1 - \theta\right)^i}}. \]

Evaluating this12 does indeed show that

\[ \mu_D = \frac{2}{\theta} - 1 = 11. \]

Other ways to (roll a) die

As we mentioned above, because we hide all the randomness behind the infinite list of rolls, it is easy to consider other situations without changing the analysis code.

A dodgy die

Suppose we have a dodgy die which due to bad planning13 has six dots painted where there should be five. In other words, there the chance of rolling a six is now one-third.

How does this affect our answers ?

addBias :: [Roll] -> [Roll]
addBias = map (\i -> if i == 5 then 6 else i)

ghci> take 10           $ rolls 42
[6,4,2,5,3,2,1,6,1,4]
ghci> take 10 . addBias $ rolls 42
[6,4,2,6,3,2,1,6,1,4]
ghci> averageLength 10000 . a_seqs . addBias $ rolls 42
3.0129
ghci> averageLength 10000 . d_seqs . addBias $ rolls 42
4.958

So, it seems that the means change to 3 and 5.

Sure enough, if we put \(\theta = 1/3\) into the expressions we derived about we find that:

\[ \mu_A = 3, \mu_D = 5. \]

A magic die

Now suppose that instead of an incompetent manufacturer, our die came from a magician. In particular, suppose that the dice always rolls the sequence 1,2,3,4,5,6,1,2,... We will leave the implementation of this for now, but it’s easy to simulate:

magicRolls :: [Roll]
magicRolls = cycle [1..6]

ghci> take 10 $ magicRolls
[1,2,3,4,5,6,1,2,3,4]

*Main> averageLength 10000 . a_seqs $ magicRolls
6.0
*Main> averageLength 10000 . d_seqs $ magicRolls
6.0

\(\mu_A\) stays the same, but \(\mu_D\) is now six! To understand why it is important to notice that the magic die rolls six on every sixth roll, and so the sequences between sixes are all precisely six rolls long.

This means that the chime is equally likely to fall into all the sequences, and whichever one it does pick, is bound to be six rolls long.

Implementation notes

Performance issues

Although Haskell is a very pleasant environment for doing this sort of work, in practice you can hit performance issues. There are a couple of problems:

  1. The ghci REPL doesn’t compile the code efficiently, so in practice you’re better off compiling the code as a library, then loading the object into ghci.
  2. There are space leaks in the code, so it doesn’t scale well. For the simple sequence stuff, this isn’t a problem, but it for the second part D calculation it is a nuisance.

Day length

Although the question talks about a daily chime, in practice all we need are events sufficiently widely spaced that they won’t interact. This translates to saying that we can be confident that a six will occur between two sets of chimes.

Days are 86,400 seconds long, so one o’clock chimes occur at least 43,200 seconds apart. However, given that

\[ \left(\frac{5}{6}\right)^{100} \approx 1.2 \times 10^{-8} \]

it seems enough to work in a world where the chimes happen every 100 seconds. The code above does this, and runs much more quickly as a result.