Merge shuffle

Posted on January 24, 2011 by phimuemue

Following this question on stackoverflow, I came up with the following "merge shuffle" algorithm to permute a linked list. Originally, I had doubts that the results would be distributed uniform, but I discovered that the algorithm gives good results. So I decided to check whether the distribution is indeed uniform.

Let us first of all discover what the algorithm looks like (in Haskell-like pseudo code):

splitlist [] = ([],[])
splitlist (x:xs)=(x:r,l) where (l,r)=splitlist xs

merge [] r = r
merge l [] = l
merge (lh:l) (rh:r) = 
 if (randomCondition) then lh:(merge l (rh:r)) 
 else rh:(merge (lh:l) r)

mergeshuffle [] = []
mergeshuffle [a] = [a]
mergeshuffle (x:xs) =
 merge (mergeshuffle l) (mergeshuffle r)
 where (l,r) = splitlist (x:xs)

As you can see, it is just a merge sort algorithm with a randomized merge subroutine, that takes the first element of the randomly chosen list. We now assume that randomCondition is fulfilled with probability 0.5 and not fulfilled with probability 0.5.

We are now going to prove that the above algorithm yields evenly distributed results, i.e. all outcomes of mergeShuffle are equally probable.

Therefore, we first examine three base cases - lists with 0, 1 or 2 elements. We choose these base cases, since all recursive calls somewhen result in calls for lists with these lengths. For those cases, it is clear that all outcomes are equally probable (with the assumption that the afore mentioned randomCondition occurs with probability 0.5). Thus, we can say that if we have a list with at most 2 elements, all outcomes are equally probable.

Now, we consider lists with more than 2 elements. First, we take the merge subroutine. It is clear that for two fixed input lists of lengths \(n_1\) resp. \(n_2\), there is for every possible (merged) result exactly one run of the merge subroutine. Furthermore, it is clear (since we have \(n_1+n_2\) list items in total) that there are \(\binom{n_1+n_2}{n_1}\) possible outcomes in total (since we have \(n_1+n_2\) times the random variable randomCondition). So the probability of a certain outcome of merge has probability \(\dfrac{1}{\binom{n_1+n_2}{n_1}}=\dfrac{n_1!\cdot n_2!}{(n_1+n_2)!}\).

Now, we prove the claim using induction. We assume that the algorithm yields uniformly distributed results for lists of lengths \(n_1\) resp. \(n_2\) and conclude that then the algorithm yields uniformly distributed results for a list of length \(n_1+n_2\).

We saw above that for two lists, each possible result of merge is reached via one unique sequence of randomConditions. Moreover we know - because of our induction assumption - that the algorithm (i.e. the recursive calls of mergeshuffle) yields uniformly distributed results. I.e. the probability that mergeshuffle (applied to a list of length \(n_1\)) yields a specific result, is \(\frac{1}{n_1!}\).

Thus, we can compute the probability that mergeshuffle applied to a list of length \(n_1+n_2\) results in a specific outcome by the following:

\(\frac{1}{n_1!}\cdot\frac{1}{n_2!}\cdot\frac{n_1!\cdot n_2!}{(n_1+n_2)!}=\frac{1}{(n_1+n_2)!}\)

This is because the recursive calls yield two specific results with probability \(\frac{1}{n_1!}\cdot\frac{1}{n_2!}\) and for the results, the probability that the merge subroutine reaches (another) specific outcome is \(\frac{n_1!\cdot n_2!}{(n_1+n_2)!}\). Since both phases are independent (or at least we assume they are), we can just multiply the two probabilities to see that all results are equally likely.