wait... wait... arrgh no that doesn't quite work, because in some cases you
actually want 4 bits of the predicate mask to go to the SIMD-capable ALU,
sometimes you want 2 bits (for 2xFP32), sometimes 1 bit (for 1xFP64) and so
even an 8-bit subdivision is going to be sub-optimal.


haha.  you're going to find this amusing / ironic: this is precisely where
using CRs as predicate masks would shine.

the load on the DMs would be horrendous unless we worked out a way to "batch"
them.  and funnily enough, i've already implemented 8xCR "whole_reg" reading
(and noted a bugreport to implement that "cascade" system when it comes to
adding the DMs).

