combining different instances of a random generator but still maintaining low discrepancy
I am generating about 100 million random numbers to pick from 300 things. I need to set it up so that I have 10 million independent instances (different seed) that picks 10 times each. The goal is for the aggregate results to have very low discrepancy, as in, each item gets picked about the same number of times.
The problem is with a regular prng, some numbers get chosen more than others. (tried lcg and mersenne twister) The difference between the most picked and least picked can be several thousand, to ten thousands) With linear congruity generators and mersenne twister, I also tried picking 100 million times with 1 instance and that also didn't yield uniform results. I'm guessing this is because the period is very long, and perhaps 100 million isn't big enough. Theoretically, if I pick enough numbers, the results should reach uniformity. (should settle at the expected value)
I switched to Sobol, a quasirandom generator and got much better results with the 100 million from 1 instance test. (difference between most picked and least picked is about 5) But splitting them up to 10 million instances at 10 times each, the uniformity was lost and I got similar results as with the prng. Sobol seem very sensitive to sequence - skipping ahead randomly diminishes uniformity.
Is there a class of random generators that can maintain quasirandom-like low discrepancy even when combining 10 million independent instances? Or is that theoretically impossible? One solution I can think of now is to use 1 Sobol generator that is shared across 10 million instances, so effectively it is the same as the 100 million from 1 instance test.
Both the shuffling and proper use of Sobol should give you uniformity as desired. Shuffling needs to be done at the aggregate level (start with a global 100M sample having the desired aggregate frequencies, then shuffle it to introduce randomness, and finally split into the 10 values instances; shuffling within each instance wouldnt help globally, as you noted).
But that's an additional level of uniformity, you might not really need that: randomness might be enough. First of all I would check the check itself, because it sounds strange that with enough samples you're really getting significant deviations (check "chi square test" to qualify such significance, or equivalently how many are "enough" samples). So for a first safety check: if you're picking independent values, then simplify differently to 10M instances picking 10 out 2 categories: do you get approximately a binomial distribution? For exclusive picking it's a different distribution (hypergeometric iirc, but need to check). Then generalize to more categories (multinomial distribution) and only later it's safe to proceed with your problem.