[Libre-soc-dev] high performance

Luke Kenneth Casson Leighton lkcl at lkcl.net
Thu Oct 29 19:40:07 GMT 2020


On Thu, Oct 29, 2020 at 7:14 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> I was thinking, when we build the final quad core processor, I think it'd
> be a really good idea to make 1 of the 4 cores be much better at
> single-threaded tasks:

yeah this kinda came up on the phoronix thread (the one that's still
going) :)  to adjust the cores, some to be Monster SIMD backends.
however what you're proposing is slightly different.

> I'm thinking something like fetching 128-bit chunks and decoding and
> issuing 3-4 instructions per clock cycle. We should mostly have the
> execution hardware needed already since it's needed for vector tasks, so
> all we'd have to do is make the frontend wider.

yeah exactly, that's where the DMs are going, with the unary
bitfields, it's possible easily to set multiple bits at once, and then
"multi-issue" is a matter of adding transitive (extra, cumulative) DM
relationships between the set of instructions being executed in the
same batch.

by that i mean (and i got this from Mitch), you make instr2 "depend"
on instr1's registers, and you make instr3 depend on *both* instr2 and
instr1's registers, set multiple bits in the DMs simultaneously
(because they're entirely unary encoded) and that apparently is enough
to do multi-issue execution.

real simple.

of course it also means upping the register file port bandwidth and
the operand forwarding buses, which is where it gets sliiightly tricky
for increasing single-core performance.

the assumption has been that we'll be hammering the multi-issue engine
with vector workloads that are independent and do not need cross-bank
routing i.e. the regfiles can be entirely "banked".  early schemes we
came up with HI32 LO32 odd-even (4 banks), however the other
possibility is 4x 64-bit banks which would give 8x FP32 SIMD backend
performance per clock cycle with no specialist multi-write-port
regfiles needed.

this means we can do 4R1W regfiles absolutely no problem, we do not
need 10R4W or something mad.

but.

if we do multi-issue single-threaded then this assumption, that there
will be no inter-bank crossover, is out the window.  we _can_ try to
mitigate that need by using the inter-bank cyclic buffer idea i came
up with a few months back (it sits in between FUs and Regfiles)
however at some point under specific multi-issue workload that will
start to keel over, and it introduces instruction latency anyway (it's
intended for the "just in case" rather than to be hit hard with
persistent write requests)

therefore we'd need to track down how to do multi-port write regfiles
and i *think* i saw whitequark post some research on how to do that,
using, believe it or not, XORing of regfile data across multiple
SRAMs. a quick google search shows this:
https://tomverbeure.github.io/2019/08/03/Multiport-Memories.html

which i *think* was the one she referred to.

or, we simply, as the end of that post says, rely on a custom RAM design team.


> Maybe we could compete with the Intel Core 2 or Cortex A72 in
> single-threaded tasks :)

muhahah oosorry.

yeah i don't see why not.

l.



More information about the Libre-soc-dev mailing list