[Libre-soc-dev] high performance

Thu Oct 29 21:03:30 GMT 2020

On 10/29/20, Jacob Lifshay <programmerjake at gmail.com> wrote:
> On Thu, Oct 29, 2020, 12:40 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
> wrote:
>
>> On Thu, Oct 29, 2020 at 7:14 PM Jacob Lifshay <programmerjake at gmail.com>
>> wrote:
>
>>
>
> of course it also means upping the register file port bandwidth and
>> the operand forwarding buses, which is where it gets sliiightly tricky
>> for increasing single-core performance.
>>
>
> who says we need the register file to be homogenous? we can just have the
> first 32 integer registers be in a 8r4w or something,

ah good point. still need a multi-ported regfile design though, just
less regs.  to get the "shock" over with i ran the calculation needed
for the number of SRAM banks if you do 8r4w with the XOR trick: 44
SRAM banks.

each one of those is 8x32 bytes i.e. 256 bytes, times 2 for INT and
FP.  total of 22kbytes for an 8r4w XOR based regfile, excluding bypass
logic to get simultaneous read-write passthru (equivalent to OpFwd)

> and the upper 96
> registers can be 4r1w 4-way banked.

yeah from the comp.arch discussion on the reg-renaming scheme i was
leaning towards stratifying the rename matrix due to the high gate
count needed to support recognising multiple simultaneous regfile
redirection.

https://groups.google.com/g/comp.arch/c/vdgvrYGoxTM/m/LkYjZAxrBQAJ

if the renaming is also stratified *and* the DMs likewise...

https://libre-soc.org/openpower/sv/example_dep_matrices/

... then overall the gate count gets reduced right across the board.

> We could have the first 32 FP registers be 4r2w and the rest just like the
> upper 96 integer regs.

96 divided by 4 is err... 24 (whew took a while) and a FUREGs DM
and/or renamer at that width is... tolerable (where 128 isn't).

a dedicated renamer on each stratified layer would mean we definitely
couldn't cross them over back to other banks or to the lower 32 regs
though.  have to take that into consideration, what would the
consequences be.

>> Maybe we could compete with the Intel Core 2 or Cortex A72 in
>> > single-threaded tasks :)
>>
>> muhahah oosorry.
>>
>
> I was trying to resist the urge XD
>
>>
>> yeah i don't see why not.
>>
>
> Caches. Sad but true.

most caches are designed to be single-ported because multi-porting of
CAMs is hoooorribly expensive.  they're bad enough single-ported, to
the point where LDST tends to have totally separate one each per FU
(stratified, yet again).

if however they are small enough (32 entries or so) then you can get
away with unary encoding of the lookup and at that point what was
formerly 10x the number of gates due to massive banks of XORs is
replaced with single AND gates.

and that starts to be ok to do multi-porting.

l.