[Libre-soc-bugs] [Bug 713] PartitionedSignal enhancement to add partition-context-aware lengths

Thu Oct 7 22:55:52 BST 2021

https://bugs.libre-soc.org/show_bug.cgi?id=713

--- Comment #33 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #31)
> (In reply to Luke Kenneth Casson Leighton from comment #27)
> > (In reply to Jacob Lifshay from comment #23)
> > 
> > > why go through all that when you can just derive the signal width from the
> > > lane shapes -- avoiding wasting extra bits or running out of bits, as well
> > > as avoiding the need to manually specify size? Having the size be computed
> > > doesn't really cause problems imho...
> > 
> > because the idea is to do a literal global/search/replace "Signal"
> > with "PartitionedSignal" right throughout the entirety of the ALU
> > codebase.
> > 
> > and the input argument on all current-Signals is not an elwidth, 
> > it's an "overall width".
> 
> I'm 99% sure you're mis-identifying that...

yes i was. i've got it now.

> the input argument for all
> current Signals *is* the bit-width of the current lane (aka. elwidth or
> XLEN) except that our code currently is specialized for the specific case of
> elwidth=64. 

yes.  in the discussions with Paul and Toshaan i seriously considered
an XLEN parameter in the HDL which would propagate from runtime through
a PSpec (see test_issuer.py for an example) and would allow us to test
a scalar 32 bit Power ISA core, see how many less gates are needed.
and, just for laughs, to try and XLEN=16 core.

but... time being what it is...

> This is exactly how all SIMT works (which is exactly what we're
> trying to do with transparent vectorization). The types and sizes are the
> type/size of a *single* lane, not all-lanes-mushed-together.

there is a lot of misinformation about SIMT.  SIMT is standard cores
(which may or may not have Packed SIMD ALUs) where they are "normal"
cores in every respect *except* that they share one single PC,
one single L1 I-cache, one single fetch and decoder, that *broadcasts*
in a synchronous fashion that one instruction to *all* cores.

the upshot of such a design is that whilst it allows for a much higher
computational gate density (only one L1 cache per dozens of cores) it
is an absolute bitch and a half to program.

not only that, but if you want to run standard general purpose POSIX
applications, you can't.  you have to *disable* all but one of the
cores on the broadcast bus, which of course means absolute rubbish
performance.

i an not seeing how such an architecture would help, here, and
for the above *general purpose* performance and compiler hassle
reasons would be very much against pursuing this design as a very
first processor.

> which is why we need to go all-in on SIMT, since that is the *one
> vectorization paradigm* that requires minimal modifications to our ALU code.

only if it is acceptable to have the punishment that comes with
SIMT, and to then have the exact same XLEN problem unless going
with Packed SIMD ALUs, at which point we are back to the exact
same issue (with the added complication of degraded general performance
and a hell of a lot of compiler work)

or,

having special separate scalar XLEN=8 and XLEN=16 and separate
CLEN=32 scalar cores which are disabled/idle when not given
suitable work, which is a variant on the entire reason why
OartitionedSignal was created in the first place, to avoid
exactly that kind of duplication.

-- 
You are receiving this mail because:
You are on the CC list for the bug.