Monday, September 22, 2008

How to Shuffle an Array (Correctly)

Today at a career fair, an Amazon recruiter asked me how to shuffle an array. Sounds like an easy problem, right? It turns out a good majority of people posting code to the internet get it wrong. But before reading on, try it out yourself, see what you come up with, and then Google for "array shuffle" or "randomize array" to see what other people think.

. . .

Two approaches came to mind: iterate through all of the elements, swapping each element with some random location in the array, or iterate through all of the elements, swapping each element with some random current-or-later location in the array. If you're like me, something bugs you about both these approaches. The first one may swap a given element several times, so it seems like it does too much work. The second one swaps over an ever-decreasing range of possible positions, so it seems like it might treat later elements differently than earlier ones. On the spot at the career fair, I waffled and ended up picking the first approach just because I couldn't convince myself that the second one was valid, and I thought, you can never have too much shuffling.

Only later did I prove to myself that the current-or-later algorithm works. Think about the last element in an array of size n. What are the odds that it ends up at index 1? Obviously, 1/n. What about index 2? This is a bit trickier, since it's a compound probability: we have to multiply the odds of not swapping with 1 by the odds of then swapping with 2, to find the probability of the element actually ending up at 2. In other words, (n-1)/n * 1/(n-1) = 1/n. Similar math shows that there is an equal probability of swapping with 3, 4, etc, and you can do the same for other elements in the array besides the last.

What about the alternative, just swapping each element with any old element? It's a popular solution; see, for example: http://www.codecodex.com/wiki/index.php?title=Shuffle_an_array, which recommends implementations of this algorithm in several languages. Unfortunately, the algorithm doesn't work. Here's a mathematical argument for why, although I'd love to hear a more succinct one which avoids the number theory.

Let's start with the fact that the number of possible permutations of an n-element array is n!. So for a good shuffle, any given permutation must be chosen with probability 1/n!. What happens when we apply the swap-with-anything algorithm? On the first iteration, we can end up with n possible permutations (the results of swapping the first element with each of n locations, including itself). For every one of of these possibilities, we generate n possibilities for the location of the second element. And so on. We will end up with n*n*...*n = nn possible results. Obviously this is bigger than n!, but that's ok because we expect that some permutations can be produced in more than one way, as elements are swapped more than once. The important thing is that each of the n! permutations must appear the same number of times in our nn possible results, to ensure that they each are chosen with equal possibility. In other words, for some integer k: k*n! = nn.

But this equation is not satisfiable for all values of n. Consider n=3. Then n! = 6 and nn = 27. Since 6 does not divide 27, there is no way the 27 possible results of the algorithm can be evenly distributed amongst the 6 possible permutations, so some will be chosen more than others. In fact, the only number n such that n! divides nn is 2. The reason is that since n! includes as prime factors every prime number less than n, and nn includes only those prime factors in n, for n! to divide nn, n must include every smaller prime number as a factor. However, a product of prime numbers +1 is itself a prime number [CORRECTION: it's divisible by some prime number not in the product], so (hand waving) n can't have any prime factors, so n must be 2

The moral of this story is, don't blindly cut-and-paste code you find with Google! Unless you are implementing an online poker server, in which case, please send me the URL.

UPDATES:

As was pointed out to me by several people, the correct algorithm described above is known as the "Knuth Shuffle", or "Fisher-Yates Shuffle", and Wikipedia has an explanation very similar to mine about the dangers of mis-implementing this algorithm. I joked about online poker servers, but apparently at least one online poker server really did implement the broken algorithm described above. They were exploited by some white hats, who give an excellent explanation of why the algorithm is broken and also discuss issues with random number generation.

In the comments mook suggests an interesting variation: pick two elements at random and swap them. Repeat some large number of times. Unfortunately this approach is wrong for the same reason as the bad iterative approach. Supposing you are shuffling a deck of 52 cards: each time you choose a random element with 1/52 probability, you multiply the number of possible outcomes of the algorithm by 52. By choosing two random elements on each iteration rather than picking one deterministically, you are just squaring the number of possible outcomes. So you'll still end up with 522k possible outcomes for some k. And it's still the case that you can't divide the 52! possible permutations of the deck evenly amongst these 522k possible outcomes (check the prime factors) -- so the method must be biased.

Finally, I should mention that there is yet another broken alternative in common use: generate a random number for each element and sort the elements by these numbers. Oleg Kislyov explains with typical lucidity why this doesn't work.

18 comments:

lucas said...

I think you complicated the shuffle with the backwards swap. If you think about it, you can get the n! permutations just swapping with the forward elements.
The function would increase the position in the array in each iteration and each time you would have n-k posible values for k between 0 and n.

Laurie Cheers said...

Specifically: when you do it with three elements ABC, the flawed algorithm has 18 possible decision paths.

Since the list can be arranged 6 possible ways, there should be 3 paths to get each arrangement: but actually there are four paths to get CBA, and two to get BCA.

Unknown said...

OK – I’m pretty dim with maths, and that may become hillariously obvious, but I remember trying to do this for a card game many moons ago, and I reasoned thus;

What you want is for each item to swap randomly with potentially each other item so: first randomize the posible options a=(Rand(52)), then randomize the choices b=(Rand(52)), next chose items from the list in the order (a), and set them at positions (b).

The real problem is what you use as a randomizer, since they all (AFAIK) have some ultimately crackable seed, but that detail aside, this does seem to me to give a 1-n (52, in the EG) chance of each item arriving in each potential destination, and with only 1 sort.

Now tell me what a fool I am. I can take it.

Rick said...

Look up Fisher-Yates. I'm sure there are other canned methods that are equally good, maybe the NIST algorithms site would have them.

Sean B said...

Think about it this way: the absolutely best way to shuffle is to throw all the cards on the floor and then pick up individual cards at random and add them back to the deck. As you pick up cards, the number of cards to choose from on the floor shrinks, but in this case it is obvious that that would not make them any less random.

Adrian Quark said...

Apparently I need to work on the clarity of my writing.

lagenar: the "backwards swap" is not just overcomplicated, it's incorrect, which was the point of this post.

laurie cheers: the flawed algorithm actually has 27 possible decision paths (3*3*3=27).

rick: the correct algorithm I described is essentially Fisher-Yates, although I didn't know the name of it so thanks for pointing that out!

mook: Unfortunately I think that method can be proven biased in the same way as the iterate-and-swap-with-any-random-element method. But this is an interesting example so I'll add it to the post.

Unknown said...

The product of primes +1 is not necessarily prime. In fact, the product of any set of odd primes +1 is divisible by two. It is a common misconception that Euclid's proof of the infinitude of primes states that (p1*...*pn)+1 is prime when in fact it only uses the property that there exists a prime dividing the product.

Tim Gebhardt said...

The most evenly distributed way to do it is just assign each card a random number and sort the deck by the random number.

Unknown said...

sbaker48's method is pretty much what I chose when I had to write one. Pull a card at random out of a list and put it at the end of some other list. Not terribly efficient memory-wise, incredibly simple to implement without thinking about it.

But isn't that exactly what the Fisher-Yates/Knuth algorithm is doing? They're just being efficient about it, the first elements of the array are the second randomized list. They build the list in-place using the space that's no longer needed.

Dave C said...

But doesn't your criticism rely on the fact that your initial array must always be in the same order?

Sure, if you always generate a shuffled from list from the same initial list (ie: 1,2,3,4,5,6,etc) then your analysis is correct. However, if you start with a previously shuffled list each time you even out 'duplicated permutations' it works.

eg (pardon the mishmash of pseudo code):

shuffledDeck = (1,2,,4,5,6,...)
while (playingGame)
shuffledDeck = randomize(shuffledDeck)
doGameStuffAndCreateWinner()
playingGame = gamerWantsToPlayAgain()
wend

Jieren Chen said...

Hmm... is there anything wrong with the naive implementation of creating an input array (the array you want to shuffle) and an output array, selecting a random number from the input array and setting it to a random position in the output array, and then repeating for every unshuffled number in the input array?

Pythor said...

Jieren,
What happens when your randomly selected position in the output array already contains a previously placed input element?

Scott Swank said...

A simpler proof of the correctness of the Fisher-Yates shuffle is inductive. Clearly it works for n=1. Consider N>1. The swap of the 1st element with a random element produces a random 1st element and leaves an N-1 array to sort.

If we can always shuffle an array of size n=1 and we can always reduce shuffling n>1 to shuffling n-1 then by induction we can shuffle any n>=1.

Kamil Szot said...
This comment has been removed by the author.
Kamil Szot said...

Correct algorithm is easiest for me to believe in when I think of it as kind of selection sort.

In each step I pick one random element from the poll (random, not minimal as in selection sort) and I add it to result sequence.

Since I keep both, the poll and the result sequence in the same array I may have to make room for this new element if it's not already in place. So I swap k-th element with element chosen randomly. I change order of elements in the poll by this, but it doesn't matter, since I'm selecting from poll randomly.

So after each swap I have two parts of my array, k elements already shuffled, and n-k elements in the poll to pick from.

Random picking does all the work, and swapping is just used to make sure result sequence and the poll don't mix together.

Unknown said...

Do you think this was good interview question? I can think of three outcomes:
a) You knew the algorithm from prior experience and it's just testing your memory.
b) You are able to derive an algorithm from scratch.
c) You have no idea what to do.
It seems like they're hoping for someone who can do "b". However, there are lots of algorithms and I'm not convinced that everyone can can do every one during an interview.

Alexander Fairley said...

1) Generate a permutation from S_n
(by selecting without replacement n times from integers 1-n)
2) Provide the cards in that order.
What's the big deal?

Jieren Chen said...

Hey Pythor... sorry for taking so long to get back to you...

Subsequent selections of the output locations do not include already selected output locations as a valid output.

So you could randomly select any of the n elements of the input array and any of the n output locations of the output array. However, the next time you would have n-1 elements to select from the input array and n-1 locations to output to in the output array.

I did some rough calculations and surprisingly... this results in a uniform distribution where every possible output array has an equal probability of occurrence.

Of course... it may be correct but not efficient so I'm curious why this algorithm didn't get any play in the discussion. :-)