[Libre-soc-dev] fantastically-weird regfile
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Mon Dec 14 01:50:07 GMT 2020
On 12/14/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Sun, Dec 13, 2020, 16:12 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
>> i had an idea about the CR regfile when it's vectorised. this is one
>> of the oddest designs but there is a reason for it. see
>> https://libre-soc.org/openpower/sv/svp_rewrite/svp64/discussion/ for
>> full details
>> the issue we've got is that scalar CRs were not intended to be
>> vectorised. so in scalar OpenPOWER there's only 8 of them. they're
>> "dedicated": CR0 is for INT/Logical, CR1 for FP, CR6 for SIMD VSX.
>> so here's the problem when we apply SV, which normally applies
>> "sequential" increments: any SV-Vectorised Rc=1 INT operation is
>> *automatically* going to wipe out CR1.
>> what if, then, the numbering went:
>> 0 8 16 24 32 40 48 56
>> 1 9 17 ....
>> that way it would only be vector INT operations longer than 8 that
>> would destroy CR1.
>> that in turn would imply that the CRs would be treated as an 8x8
>> matrix, dual-ported, reading horizontally *and vertically*! which is
>> just so spectacularly weird i feel it has merit just for fits and
> That's very similar to what I already did for int and fp registers, see:
jacob once it's been documented the regfile stratification,
allocation, port numbering etc can be evaluated, as well as the Dep
Matrix layout, which can be added here:
that's the task and responsibility associated with coming up with
nonlinear numbering schemes.
in SV-Orig i came up with an *algorithmic* way to remap reg numbers
(!) which would have been absolutely fantastic to do Matrix
transpositions and multiplies *in-place*.
however it is fantastically complex routing so left for another time.
the regfile allocation with the Dep Matrix example above is that
Vectors are stratified in multiples of 4.
as in: Vectors may *ONLY* be allocated to Vector FPs if RA%4 == RB%4
== RT%4 and all reg numbers are over 32. otherwise they are allocated
to *scalar* FUs which has significantly less computationsl resources
but far greater crossbar routing.
this allows us to only need 4R1W where normally we would need to pay
someone one hell of a lot of money for a custom 12R10W SRAM block.
if the numbering scheme that you propose does not match with that then
it needs to come attached with a full architectural design evaluation,
starting with a clear description of what the numbering is.
following on from that description, we can make a rough time estimate
of how long it will take to do the architectural design evaluation.
if that time is too excessive we stop right there. if however it is
not too great we can proceed with the architectural design assessment,
creating the DM diagrams, regfile porting, stratification, crossbar
routing and so on.
More information about the Libre-soc-dev