Description of the Physically Random Numbers on this CDROM

This document describes how the files of random bytes contained on this CDROM were generated, and why we believe that they are, in a mathematically definable sense "perfectly" random. (Note that we do not warrant this randomness in any legal respect, however.)

The files are produced in two stages: a "collection" phase, where raw files of random bits are collected from a device with a great deal of physical randomness; and a "perfection" stage, where several raw files are overlaid and shuffled in a manner designed to saturate (in a definable sense) the randomness of the resulting file. We discuss each of these phases in turn.

Collection Phase

Professor Paul Horowitz, of Harvard University, kindly constructed for us an electronic source of physical randomness. The analog noise source is a transistor junction that is biased to function as an avalanche diode. Physically, the noise current from this junction is generated by the random creation of electron-hole pairs in the junction, due to thermal noise (and, ultimately, quantum mechanical processes). Experimentally, the output of the device, viewed with a spectrum analyzer, is flat from close to D.C. out to 50 MHz.

The "Hororan" circuit (shown in an attached document) samples the amplified analog voltage from the noise junction at a rate that is 8 times the desired baud rate for output (the latter typically 38.4 Kbps or 115.2 Kbps). The sample duration ("aperture") is very short, about 2 nanoseconds for the LT1016 latched comparator, so there is plenty of aliasing of the higher available frequencies into the sampled value. If the sample voltage is positive, a digital "1" bit is generated; if it is negative, a digital "0" bit is generated. The collected bits are continuously exclusive-or'd (XOR's) into a register. After every 8 bits are collected, the state of the register is output as one bit in the raw file. After every eight output bits, the next two output bits are discarded during the formatting of the required RS232 stop and start bits. The digital XOR and start/stop formatting functions are performed by a 26V12 PAL, whose programming is included here.

For a readable account of the sort of analog and digital electronics used in this circuit, see The Art of Electronics, by Horowitz and Hill, Cambridge University Press, 1989, ISBN 0-521-37095-7.

It is intentional that we do not do any more complicated digital processing (mixing, scrambling, encrypting, etc.) within the Hororan box, because we want to be able to measure, and characterize, the degree of randomness coming out of the box. The box indeed has a measurable nonrandomness in its 1-point statistic. That is, the number of output 1's and 0's are not exactly equal but differ by, typically, a few parts in 10^4. (Indeed, the box has a trim adjustment to minimize this nonrandomness.) Notice that it requires several times 10^8 collected bits even to measure this bias convincingly -- but we have satisfied ourselves that it is present. The bias drifts slowly with time, presumably in response to thermal and other environmental changes.

We are unable to find, in numerical experiment, any trace of nonrandomness in the higher-point statistics of the raw collected bits. In particular, we have not been able to find (in several times 10^9 collected bits) any 2-point nonrandomness, either in the autocorrelation at small lags, or in the power spectrum at likely frequencies such as 60 Hz or its first few harmonics. On the basis of the physical construction of the Hororan box, we believe that M-point statistics with M > 2 (with the exception of high-point statistics that are the result of slow drifts in the 1-point bias) should be strongly decreasing with M. The reason is essentially that the transistor junction has no "memory" and has an internal timescale on the order of the reciprocal of its 50 MHz bandwidth.

Thus, while the output of the Hororan box is demonstratably non-random, we believe that its true "entropy per bit" is convincingly bounded as being close to 1 (the fully random value). If 1-e denotes the entropy per bit in a large file of N bits (a more detailed definition is given below), then we think that, with a high degree of experimental certainty (there is no way to "prove" this mathematically) we have e < 0.01.

Incidentally, the raw output of the Hororan box easily passes all of the tests in Marsaglia's "Diehard battery of tests".

Perfection Phase

Each supplied random file of length 16777216 bytes (2^27 bits) is produced by overlaying four of the above-described raw files, each also of length 16777216 bytes, according to the following procedure:

1. A sequence of raw Hororan output is reserved for use as the bits for one-time triple DES (Data Encryption Standard) encryption. These bits are not used for any other purpose. (Below the term "new key" will mean that the DES keys are refreshed with the next consecutive bits from this reserved sequence.)

2. A pseudo-random file of length 16777216 bytes is generated by applying triple DES encryption (new key) to a deterministically generated sequence that is unique to the particular file being generated.

3. The file is now "shuffled" (more precisely: encrypted by a one-to-one map from the space of 16777216 bytes into the space of 16777216 bytes) by a butterfly algorithm, much like the data flow in a fast Fourier transform, that uses triple DES to cause every output bit in the output file to depend, through encryption, on every input bit. A complete characterization of the shuffle is the following Fortran 90 routine. (Unit 26 reads the sequence in step 1, above.)

    SUBROUTINE shuffle(iarr)
    INTEGER(I4B), DIMENSION(:) :: iarr
    INTEGER(I4B), DIMENSION(32) :: sched1,sched2,sched3
    INTEGER(I4B), DIMENSION(2) :: key1,key2,key3,inp,out
    INTEGER(I4B) :: dum,nb,nb2,jb,j,n
    n=size(iarr)
    nb=n
1   nb2=nb/2
    read (26) key1,key2,key3
    dum=f_des_fixkey(key1)
    dum=f_des_sched(key1,sched1)
    dum=f_des_fixkey(key2)
    dum=f_des_sched(key2,sched2)
    dum=f_des_fixkey(key3)
    dum=f_des_sched(key3,sched3)
    do jb=1,n,nb
        do j=0,nb2-1
            inp(1)=iarr(jb+j)
            inp(2)=iarr(jb+j+nb2)
            dum=f_des_encrypt(inp,out,sched1)
            dum=f_des_encrypt(out,inp,sched2)
            dum=f_des_encrypt(inp,out,sched3)
            iarr(jb+j)=out(1)
            iarr(jb+j+nb2)=out(2)    
        enddo
    enddo
    nb=nb/2
    if (nb.gt.1) goto 1
    END SUBROUTINE shuffle

4. A raw file of 16777216 physically random bytes is bitwise XOR'd with the result of the previous step.

5. Steps 3. and 4. are repeated, in turn, for three additional stages. At each stage new key is used in the shuffles, and a new physically random file is XOR'd to get the output result. The last step of this iteration is an XOR with a raw random file; however the file would be no less random if the final step were an additional shuffle (since the shuffle is a one-to-one map).

It should not be surprising that the final files pass all tests for randomness that we have tried, including all of the Marsaglia "Diehard" tests. (In fact, the starting pseudo-random file passes all these tests, and things should only get better from there.)

Discussion

Suppose we have 1-e entropy per raw bit of N bits, in the sense that the filled part of the 2^N state space is only 2^[(1-e)N] states. Then the filled fraction of states is the quotient of these expresions, namely 2^(-eN), which may be small for large enough N. This is the problem that must be overcome.

Suppose that we exclusive-or (XOR) a random file drawn from one of the filled states with another random file, drawn from an entirely independent set of 2^K states out of the full 2^N. Then each of the 2^K "starting points" helps the result to fill in more of the full 2^N total states. If there were just two starting points (K=1), the missed fraction would be the square of the initially missed fraction. By similar reasoning, we can estimate the missed fraction of states for K>1 as

                  K                       K -eN
             -eN 2                      -2 2
         (1-2   )    = (approximately) e

If we require this to be much less than 2^(-N), say 200 powers of 2 less, then we can be reasonably certain that not one single one of the 2^N states is missing. The condition for this is

             K-eN
           -2          -N-200
          e        <  2

which implies simply K > eN + log[(N+200) ln 2], where log denotes logarithm base 2 and ln denotes natural logarithm.

We see that up to a modest additive logarithmic value, the number of new bits of entropy K that we need in the new random file is only about eN, about the same number that were "missing" from the original file. However, the procedure described above, where the second file has the same randomness as the first file, has K=(1-e)N, which is vastly more than needed as long as e (which we estimated above as almost certainly < 0.01) is < 0.5.

Furthermore, we repeat the whole XOR procedure three extra times, essentially out of superstition, or to allow for some flexibility in the following point:

The remaining unproved mathematical issue, of course, is whether the triple DES shuffling procedure spreads the 2^[(1-e)N] states of a raw random file sufficiently uniformly over the full state space so that the above calculation applies. Because no correlation between input and output of triple DES (or even single DES) has ever been found; and because our shuffle involves many stages of DES with independent keys, so that every output bit depends on every input bit; and because even our slightly non-random raw input files pass standard high-quality tests for randomness -- there is little room for any doubt that the supplied files are, in practice, "good enough". Nevertheless, is an interesting mathematical question, on which we are doing further work, as to whether they are in fact "perfect" in the sense that the above calculation indicates.

William H. Press
Harvard University
May 2, 1996