Junk DNA, and the Mersenne Twist

I woke up this morning with what seems like an insight to me…. tying two completely unrelated topics together in a (meaningful to me) way in my head.

A news article about junk DNA that I read earlier this week, and a programming problem that a friend of mine has been working on.

Since this is a programming blog (yikes – its true) I’ll start with the programming problem.

Without saying too much, he’s working on a project that has a problem solved by something resembling the shuffling a deck of cards.

The other day he sent me an email describing the issue with the shuffling that he was running into:

I thought i was smart, implementing a sophisticated card shuffling algorithm.
Our random number generator sucks, we can’t even alter the seed, so i grabbed a Mersenne Twister implementation and ported it….

But then i was doing some testing with a very limited number of types ( say, 60 items of only 5 types ). About every 10th shuffle i would get a result that looked like 11112222223332233331115555 or something! not so random, really after all…

Then i read this in the wikipedia page:"…the built-in pseudorandom number generator provided by many programming languages and/or libraries may often have only 32 bits of internal state, which means it can only produce 232 different sequences of numbers. If such a generator is used to shuffle a deck of 52 playing cards, it can only ever produce a vanishingly small fraction of the 52! â‰ˆ 2225.6 possible permutations. Thus, it’s impossible for a generator with less than 226 bits of internal state to produce all the possible permutations of a 52-card deck, and for a (reasonably) unbiased shuffle, the generator must have at least about 250 bits of state.

A related problem occurs with … random floating-point number.s.. [ they ] have only finite precision. This means that there are only a finite number of possible floating point values in any given range, and if the range is divided into a number of segments that doesn’t divide this number evenly, some segments will end up with more possible values than others. While the resulting bias will not show the same systematic downward trend as in the previous case, it will still be there. hrm… i look at the code. 32 bits of state? yep. floating point number multiply? yep. and it’s all made worse because my initial array is 11111111111122222222222233333333

There’s countless ways – no doubt – of introducing enough randomness into an algorithmic deck of cards, to give a real world feel.

We each came up with decent enough methods:

Manually reorder your initial state to hide patterns; for instance either store strings of random seeds, or at the very least store types in the initial order 12456….
Simulate the real world’s algorithm more closely: divide the “deck” into two halves, work through each half in order: interleaving the two halves by including a few cards from one half, then a few cards from the other half – zippering them together with a little bit of jitter.

It stands to wonder:

why do those things work in the real world?
Mathematically, why do they hide patterns so well?

I’m not strong in math, so a technical argument in a short blog post is beyond me, but one way to think about it is that when you shuffle a real world deck of cards — the deck itself is providing both the initial state bits, and in a way, is acting as the random number generator itself.

As a generator it comes with a large amount of built in state memory using the individual cards and their position in the deck as that memory.

This already evokes the image of DNA – long strings of types ( base pairs ) chained together.
Reproduction provides shuffling; sexual reproduction even more so by bringing together two decks.

But here’s the thing: you wouldn’t want to just shuffle the whole thing – otherwise you’d just get real world card randomness – not so great for the organ(ized)ism.

On the other end of the spectrum: If DNA faithfully reproduced itself every time – we’d have exactly the same organism everytime.

Now, if you allow for cosmic rays ( ie. any external modifications ) to randomly flip some bits – you might arrive at something more like the Mersenne Twister – but over super long periods of time.

Not random enough in a short enough period of time to produce anything interesting; maybe nothing interesting in the life time of the world.

What you want is something in between: something that allows portions of the DNA to be randomly and truely shuffled over time, but something that doesn’t affect the organism most of the time.

Junk DNA seems to fit that bill.

Allow bits that don’t mean a thing 98% of the time to shift around and randomize themselves.
But then, very occasionally, allow changes in meaningful sequences of DNA to skip the reproduction instruction pointer into the randomized junk DNA sequence.

Poof – normal non-opposable toes suddenly become semi-opposable thumbs.

Natural selection then whittles down the differentiation between commonly shared junk DNA sequences with a little bit of normal Mersenne bit flipping as it goes – into something significantly usable – what we know of as an actually opposable thumb.

Neat!