Pokémon Probabilities
Porter Johnson

The recent mass distribution of cards containing pictures of Pokémon characters presents some interesting probability questions.

In a recent SMILE class Pat Phillip [Arai Middle School] posed the question of how many Pokémon Cards [given randomly and with equal probability] one would have to get to have "reasonable assurance" of having a complete set of 6 cards. We drew a few cards during the class, and we able to confirm the claim that the "average" number drawn would be about 14, and that you would generally have to get several of the same kind before acquiring a full set.

The initial enthusiasm of getting the cards wore off after a few trials. I decided to "automate the process" by writing a computer program to do a significant number of "random trials". Here, for your potential amusement, is the FORTRAN program to make and record these trials:

c  pokemon cards
c  monte carlo simulation
c  number of "draws" to get a complete set.
      dimension icard(6),ntimes(100), ncum(0:100)
      open(unit=8,file='poke.dat',status='unknown')
c  set up random number generator
      nrand = 11
      call seed(nrand)
c  initializing counter of games to zero
      do 10 n = 1,100
10    ntimes(n) = 0
c  input: number of games [complete sets] to get
      write(6,*)' how many games do you wish to play?'
      read(5,*) kgames
      write(8,88) kgames
88    format(' num games = ', i10)
c  playing the games
      do 100 k = 1,kgames
c  k is the game number
      do 40 i = 1,6
40    icard(i) = 0
c  playing a particular game
      do 60 n = 1,100
c  n is the number of cards in a given game
      call random(x)
      xx = 6.*x +1.
      j = int(xx)
c  j is the card number received on nth draw
      icard(j) = icard(j) +1
c  icard(j) is number of times jth card has appeared
      t = icard(1)*icard(2)*icard(3)*icard(4)*icard(5)*icard(6)
c   testing for a complete set:  t is not zero
      if (t.gt.0) then
c   complete set is obtained; go to next game
      ntimes(n) = ntimes(n)+1
      goto 100
      end if
c  end of a particular game
60    continue
c  if more than 100 games are required to get a match, we stop
      stop
100   continue
c  recording of 
      write(8,*) ' number of draws for a match'
      write(8,99) (ntimes(n), n = 1,100)
99    format(10i8)
c computing mean and standard deviation of number of draws
      sum1 = 0.
      sum2 = 0.
      do 200 n = 1,100
      fn = float(n)
      fnum = float(ntimes(n))
      sum1 = sum1 + fnum*fn
      sum2 = sum2 + fnum*fn**2
200   continue
      fgames = float(kgames)
      fmean = sum1/fgames
      fmeansq = sum2/fgames
      stddev = sqrt(fmeansq-fmean**2)
      write(8,999) fmean, fmeansq, stddev
999   format(' mean =', f10.6, /, ' meansq =', f10.6, /,
     > ' stddev =', f10.6)
c  ncum(j) is probability of getting a complete set after n games     
      ncum(0) = 0
      do 300 n = 1,100
300   ncum(n) = ncum(n-1)+ntimes(n)
c  print of ncum(j)
      write(8,*) ' cumulative hit number'
      write(8,99) (ncum(n), n = 1,100)
c  quit and go home
      stop
      end

Here is output for a run with ten million "games". In each game I told the computer to keep drawing until it got a complete set. A record of the number N of draws needed for each set is given, as well as the cumulative hit numbers [number of times that N draws or fewer is required to obtain a complete set]:
 num games =   10000000
  number of draws for a match
       0       0       0       0       0  154776  385006  600803  750090  825776
  843029  816335  762255  690521  613102  538025  466376  400611  342242  290821
  245911  206988  174815  147081  122821  102982   85649   71363   60185   50199
   41792   35076   29128   24394   20305   16716   14039   11830    9921    8163
    6622    5673    4692    3953    3353    2869    2282    1943    1591    1315
    1073     924     750     634     548     488     368     298     230     184
     184     150     132      99      88      66      54      51      51      35
      20      24      26      11      16      12      14      12       9       2
       3       3       2       8       1       4       1       0       2       2
       0       0       0       0       1       0       0       0       0       1
 mean = 14.701120
 meansq =255.126600
 stddev =  6.245292
  cumulative hit number
       0       0       0       0       0  154776  539782 1140585 1890675 2716451
 3559480 4375815 5138070 5828591 6441693 6979718 7446094 7846705 8188947 8479768
 8725679 8932667 9107482 9254563 9377384 9480366 9566015 9637378 9697563 9747762
 9789554 9824630 9853758 9878152 9898457 9915173 9929212 9941042 9950963 9959126
 9965748 9971421 9976113 9980066 9983419 9986288 9988570 9990513 9992104 9993419
 9994492 9995416 9996166 9996800 9997348 9997836 9998204 9998502 9998732 9998916
 9999100 9999250 9999382 9999481 9999569 9999635 9999689 9999740 9999791 9999826
 9999846 9999870 9999896 9999907 9999923 9999935 9999949 9999961 9999970 9999972
 9999975 9999978 9999980 9999988 9999989 9999993 9999994 9999994 9999996 9999998
 9999998 9999998 9999998 9999998 9999999 9999999 9999999 9999999 999999910000000

We see that the mean number of draws is about 14.70, whereas the spread in that mean is about 6.25, so that between 7 and 21 draws are required to get a full set most of the time. In fact, it is about 5.4 % probable to have a full set after 7 draws, about 51.4 % probable to have a full set after 13 draws, and about 87.3 % probable to have the set after 21 draws. However, the distribution of draws is quite "skewed", and the most likely number of draws required is 11, which occurs about 8.4% of the time. Of the ten million computer-generated games, one of the games required 100 draws to get a full set of cards.

The Monte Carlo approach to solving was first employed in large-scale digital problems shortly after World War II by an applied mathematician active in the Manhattan Project Stanislaw Ulam, shortly after he recovered from a very serious illness. There is no better introduction to the subject than this paragraph, taken from his autobiography, Adventures of a Mathematician, [U California Press 1991; ISBN 0-520-07154-9] pp 196-197.

The idea for what was later called the Monte Carlo method occurred to me when I was playing Solitaire during my illness. I noticed that it may be much more practical to get an idea of the probability of the successful outcome of a Solitaire game (like Canfield or some other where the skill of the player is not important) by laying down the cards, or experimenting with the process, and merely noticing what proportion comes out successfully; rather than to try to compute all the combinatorial possibilities which are an exponentially increasing number so great that, except in very elementary cases, there is no way to estimate it. This is intellectually surprising, and if not exactly humiliating, it gives one a feeling of modesty about the limits of rational or traditional thinking. In a sufficiently complicated problem, actual sampling is better than examining all the chains of possibilities.

How do we understand these numbers, and how do we begin to compute the odds when the chances of drawing cards are not the same, as is surely the case for Pokémon cards given with Happy Meals at McDonalds?

Let us suppose that the Pokémon cards #1, #2, #3, #4, #5, #6 are generated with probabilities p1, p2, p3, p4, p5, p6, respectively. Each probability pi is positive, and the total probabilities must add to 1:

p1 + p2 + p3 + p4 + p5 + p6 = 1.

It is also true that

[p1 + p2 + p3 + p4 + p5 + p6]N = 1 .

It is remarkable, but true, that if you write out the algebraic expression for

GN(pi) = [p1 + p2 + p3 + p4 + p5 + p6]N

as a sum of monomials in pi, the probability that, in N draws, you get card #1 n1 times; card #2 n2 times; card #3 n3 times; card #4 n4 times; card #5 n5 times; and card #6 n6 times, respectively, is given by the monomial term occurring in the above expression; ie

N! / [n1! n2! n3! n4! n5! n6!] * p1n1 p2n2 p3n3 p4n4 p5n5 p6n6
where
n1 + n2 + n3 + n4 + n5 + n6 = N

in that polynomial, GN(pi). We wish to determine the probability that, in N draws, we get a complete set of cards. That probability is obtained by picking out the terms from the expression for GN(pi), and dropping the terms that do not contain all six pi. The concept is simple in principle, but tedious in practice!

When the dust clears, we obtain the following expression for the probability of having a complete set after N trials:

P(N) = [p1 + p2 + p3 + p4 + p5 + p6]N
- [p1 + p2 + p3 + p4 + p5]N - ...
+ [p1 + p2 + p3 + p4]N + ...
- [p1 + p2 + p3]N - ...
+ [p1 + p2]N + ...
-[p1]N - ...

In this expression the first line is complete; in the second line there are 6 terms that involve any five of the pi; in the third line there are 10 terms that involve any four of the pi; in the fourth line there are 15 terms that involve any three of the pi; in the fifth line there are 10 terms that involve any two of the pi; and in the sixth line there are 6 terms involving each of the pi. That is the answer, no matter what N is and what the individual pi happen to be.

We can simplify the answer considerably for the case in which the 6 cards are equally probable, by setting pi = 1/6. We obtain that, for any N,

P(N) = 1 - 6 * [5/6]N + 10 * [4/6]N - 15 * [3/6]N + 10 * [2/6]N - 6 * [1/6]N

I have written an EXCEL program to determine these numbers and to draw a graph, which is shown below: The numbers agree with those obtained in the Monte Carlo Simulation, to about four significant figures. Here is a listing of the numbers for small values of N:

N P(N) N P(N) N P(N) N P(N)
1 .0000 11 .3562 21 .8726 31 .9790
2 .0000 12 .4378 22 .8933 32 .9825
3 .0000 13 .5139 23 .9108 33 .9854
4 .0000 14 .5828 24 .9254 34 .9878
5 .0000 11 .6442 25 .9388 35 .9899
6 .0154 16 .6980 26 .9480 36 .9915
7 .0540 17 .7446 27 .9566 37 .9930
8 .1140 18 .7847 28 .9638 38 .9941
9 .1890 19 .8189 29 .9698 39 .9950
10 .2718 20 .8480 30 .9748 40 .9959


[Pokemon Curve]