Monday, September 22, 2008

How to Shuffle an Array (Correctly)

Today at a career fair, an Amazon recruiter asked me how to shuffle an array. Sounds like an easy problem, right? It turns out a good majority of people posting code to the internet get it wrong. But before reading on, try it out yourself, see what you come up with, and then Google for "array shuffle" or "randomize array" to see what other people think.

. . .

Two approaches came to mind: iterate through all of the elements, swapping each element with some random location in the array, or iterate through all of the elements, swapping each element with some random current-or-later location in the array. If you're like me, something bugs you about both these approaches. The first one may swap a given element several times, so it seems like it does too much work. The second one swaps over an ever-decreasing range of possible positions, so it seems like it might treat later elements differently than earlier ones. On the spot at the career fair, I waffled and ended up picking the first approach just because I couldn't convince myself that the second one was valid, and I thought, you can never have too much shuffling.

Only later did I prove to myself that the current-or-later algorithm works. Think about the last element in an array of size n. What are the odds that it ends up at index 1? Obviously, 1/n. What about index 2? This is a bit trickier, since it's a compound probability: we have to multiply the odds of not swapping with 1 by the odds of then swapping with 2, to find the probability of the element actually ending up at 2. In other words, (n-1)/n * 1/(n-1) = 1/n. Similar math shows that there is an equal probability of swapping with 3, 4, etc, and you can do the same for other elements in the array besides the last.

What about the alternative, just swapping each element with any old element? It's a popular solution; see, for example: http://www.codecodex.com/wiki/index.php?title=Shuffle_an_array, which recommends implementations of this algorithm in several languages. Unfortunately, the algorithm doesn't work. Here's a mathematical argument for why, although I'd love to hear a more succinct one which avoids the number theory.

Let's start with the fact that the number of possible permutations of an n-element array is n!. So for a good shuffle, any given permutation must be chosen with probability 1/n!. What happens when we apply the swap-with-anything algorithm? On the first iteration, we can end up with n possible permutations (the results of swapping the first element with each of n locations, including itself). For every one of of these possibilities, we generate n possibilities for the location of the second element. And so on. We will end up with n*n*...*n = nn possible results. Obviously this is bigger than n!, but that's ok because we expect that some permutations can be produced in more than one way, as elements are swapped more than once. The important thing is that each of the n! permutations must appear the same number of times in our nn possible results, to ensure that they each are chosen with equal possibility. In other words, for some integer k: k*n! = nn.

But this equation is not satisfiable for all values of n. Consider n=3. Then n! = 6 and nn = 27. Since 6 does not divide 27, there is no way the 27 possible results of the algorithm can be evenly distributed amongst the 6 possible permutations, so some will be chosen more than others. In fact, the only number n such that n! divides nn is 2. The reason is that since n! includes as prime factors every prime number less than n, and nn includes only those prime factors in n, for n! to divide nn, n must include every smaller prime number as a factor. However, a product of prime numbers +1 is itself a prime number [CORRECTION: it's divisible by some prime number not in the product], so (hand waving) n can't have any prime factors, so n must be 2

The moral of this story is, don't blindly cut-and-paste code you find with Google! Unless you are implementing an online poker server, in which case, please send me the URL.

UPDATES:

As was pointed out to me by several people, the correct algorithm described above is known as the "Knuth Shuffle", or "Fisher-Yates Shuffle", and Wikipedia has an explanation very similar to mine about the dangers of mis-implementing this algorithm. I joked about online poker servers, but apparently at least one online poker server really did implement the broken algorithm described above. They were exploited by some white hats, who give an excellent explanation of why the algorithm is broken and also discuss issues with random number generation.

In the comments mook suggests an interesting variation: pick two elements at random and swap them. Repeat some large number of times. Unfortunately this approach is wrong for the same reason as the bad iterative approach. Supposing you are shuffling a deck of 52 cards: each time you choose a random element with 1/52 probability, you multiply the number of possible outcomes of the algorithm by 52. By choosing two random elements on each iteration rather than picking one deterministically, you are just squaring the number of possible outcomes. So you'll still end up with 522k possible outcomes for some k. And it's still the case that you can't divide the 52! possible permutations of the deck evenly amongst these 522k possible outcomes (check the prime factors) -- so the method must be biased.

Finally, I should mention that there is yet another broken alternative in common use: generate a random number for each element and sort the elements by these numbers. Oleg Kislyov explains with typical lucidity why this doesn't work.