[Libre-soc-dev] [OpenPOWER-HDL-Cores] microwatt grows up LCA2021

Tue Feb 9 23:48:57 GMT 2021

On Tue, Feb 09, 2021 at 10:48:38AM +0000, Luke Kenneth Casson Leighton wrote:
> On Tue, Feb 9, 2021 at 3:28 AM Paul Mackerras <paulus at ozlabs.org> wrote:
> >
> > On Mon, Feb 08, 2021 at 10:42:16AM +0000, Luke Kenneth Casson Leighton wrote> > > btw one very important thing, it may be worthwhile to coordinate with
> > > Bunnie Huang regarding the FAILURE runs here:
> > >
> > > https://github.com/betrusted-io/betrusted-wiki/wiki/TRNG-characterization
> > >
> > > the marsaglia tests in particular have me concerned, they fail twice.
> >
> > I'm running dieharder again now, and I got:
> >
> >  marsaglia_tsang_gcd|   0|  10000000|     100|0.09468860|  PASSED
> >  marsaglia_tsang_gcd|   0|  10000000|     100|0.72404443|  PASSED
> >
> > That's just one data point of course, but there doesn't seem to be an
> > immediate indication of a problem here.
> 

> exactly, and that's the problem: dieharder doesn't do the same type of
> test-of-test-of-tests that STS does.

Actually, it does, and in fact better than STS.  STS looks at the
p-value for each individual test and determines pass/fail, then
compares the number of passes with a threshold which is determined
from an alpha value.  In contrast, dieharder computes the p-value for
each individual test and then does a Kolmogorov–Smirnov test on the
set of 100 or 1000 (or more) p-values, and computes a p-value for the
result of the KS test.  The way STS does it loses information at the
point where the individual p-values are converted to pass/fail, and it
doesn't do the conversion of the number of passes to a p-value for
you.

> remember: statistically speaking, failures (outliers) *are* going to
> occur.  it's the *number* of failures/outliers that diehard and
> dieharder *are not telling you about*, but STS does.

So, the number of passes is a statistic, i.e. a random variate, which
has its own distribution, so in fact outliers of that value are going
to occur too.

> basically that marsaglia_tsang_gcd test needs to be:
> 
> * ported into STS
> * run 1,000 times, indepdendently.
> * have the same test-of-test-of-tests histogram analysis run on it
> [that diehard/er is NOT doing]

That statistic above was from 100 runs, and the p-value is the overall
p-value of the KS test on the 100 individual results.  The KS test is
effectively a fine-grained histogram test, so your statement about
dieharder is not correct.

Looking again at the tests Bunnie Huang did, they were done with only
536MiB of data, which is not nearly enough for a full dieharder run
even with the default numbers of tests (i.e. without using the -m flag
to increase the number of individual tests).  A dieharder -a run will
consume about 250GB of random data as far as I can see, since I did a
run with 10GB of random data and it rewound the input file 24 times.
In the particular case of Bunnie's marsaglia_tsang_gcd tests, the
input file was rewound 15 times (assuming the "was rewound" messages
give the cumulative number of rewinds, not the number since the last
message).  It's actually a good sign that the test failed, given that
its input was effectively the same sequence repeated 16 times.

> *only then* can you be confident that - after running it not once, not
> twice, not three times, but a THOUSAND times, and performing a
> RIGOROUS statistical analysis of the results - that it's okay
> 
> right now, saying "doesn't seem to be an indicator of a problem"
> fundamentally misses this abbbbsolutely critical point that it is the
> *test of the tests* that you need to pay attention to.

A full dieharder run on darn output with default parameters takes days
- I have one running but it hasn't finished yet.  (I have a program
running on microwatt which does nothing but write random numbers
generated by darn to a socket, and I have dieharder running on a
desktop x86 box consuming those random numbers.  I get about 1.7MB/s
of random numbers from microwatt over ethernet.  I think the
bottleneck is probably the liteeth interface, which doesn't have any
DMA capability.  Top on the microwatt system shows typically 1% user
time, 83% system time and 14% softirq time.)

Once that run is finished I'll kick one off with -m 10, which will do
1000 iterations of most tests and 10000 iterations of some, and leave
that run for a couple of weeks (unless I can figure out a way to speed
up liteeth).

> to get the same type of analysis that is missing from diehard/er, what
> you will have to do is:
> 
> * run diehard/er MANUALLY 1,000 times
> * note the p-values
> * go look up some mathematical papers on statistical analysis
> * MANUALLY write your own histogram testing program analysing the 1,000 p-values
> 
> *only* then will you have achieved on-par confidence testing that STS provides.

Fortunately dieharder actually already does all that for me. :)

Paul.