[Libre-soc-bugs] [Bug 713] PartitionedSignal enhancement to add partition-context-aware lengths

Fri Oct 8 20:44:19 BST 2021

https://bugs.libre-soc.org/show_bug.cgi?id=713

--- Comment #38 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Jacob Lifshay from comment #35)

> I don't care how SIMT is traditionally implemented in a GPU, that's totally
> irrelevant and not what I intended.

yeah i understood that. i think you likely meant "Packed SIMD"

> What I meant is that our HDL would be
> written like how SIMT is used from a game programmer's perspective 

ok.  yes. this is more like SIMD programming.

> what would actually run is:
> vec_with_xlen_lanes_t a, b, c, d;
> ...
> for(lane_t lane : currently_active_lanes()) { // C++11 for-each
>     xlen_int_t p = c[lane] * d[lane];
>     a[lane] = b[lane] + p;
> }

yep. definitely SIMD (at the backend) which is what we're discussing:
the low-level hardware which is behind "lanes".

> If we want any chance of matching the spec pseudo-code, we will need
> xlen-ification of our HDL in some form, since currently it's hardwired for
> only xlen=64.

right.  ok.  so here, temporarily (i quite like the self.XLEN idea but it
is a lot of work to deploy), to *not* have to do that deployment immediately,
the trick that can be played is:

    set the SIMD ALU width to *exactly* the same width as the current scalar
    width i.e. 64.

ta-daa, problem goes away.

is it a cheat? yes.
will it work? yes.
will it save time? yes.
will it minimise code-changes? yes.
will it do absolutely everything? mmmm probably not but that
can be evaluated case by case (incremental patches)

in other words: by defining the SIMD ALU width exactly equal to the
current scalar ALU width this makes what you would like to see (and
called erroneously SIMT) exactly what i have been planning for 18 months.

originally if you recall (going back 2+ years) the idea was to *split*
the ALUs into *two* 32 bit ALUs that cooperate to do 64 bit arithmetic.
but that was actually more about the regfiles, a HI32 regfile and a LO32
regfile.

this melted my brain and for practical time reasons it's all 64 bit.

> Why not make it look more like the pseudo-code with arithmetic on an XLEN
> constant? I could write the class needed for XLEN in a day, it's not that
> complicated:

yehyeh.  my feeling is, this class should derive from Shape.  in fact it
may have to (i looked at ast.py and there are assumptions that the
first argument of e.g. Signal is a Shape() or will be converted to one.

by deriving *from* Shape() a self.XLEN can do the division etc.  self.XLEN//2
etc that we saw were needed in the pseudocode, they will almost certainly
also be needed in the HDL as well, but they would create or carry an
appropriate
elwidth set as well.

the calculation in each of the operator overloads then is based on self.width
which comes from Shape.width.

i am not entirely certain of all the details here so kinda would like to
defer it as long as needed, and go with an "on-demand" approach here.

> XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64})

i like it: you are however forgetting that the ALU width also has to be one
of the parameters.

this is a hard requirement.

i explained it in terms of the layout() function returning different
widths based on its elwidth parameters but you haven't yet taken on board
the significance of the (long) chain-of-logic that goes back to
PartitionedSignal internals.

if the width varies it creates a massive cost in gate terms by having to
Mux e.g. 22 bit PartitionedSignals onto 64 bit PartitionedSignals.

if the width does not vary then those are straight wires.

yes there wull be blank unused partitions but i worked out how to deal
with that.

by analysing the dict of elwidth->binary mask values and working out those
unused partition pieces, you can actually get all of the submodules behind
the various operators to not even bother allocating gates for that piece.

example:

* PartitionedEq works by creating a set of eqs per piece
  then *combining* those pieces together depending on the
  PPoints
* if we KNOW that some of those pieces will never be used
  then just don't do the sub-eqs for those pieces.

ta-daa, less gates all round.

but for this to work the layout() function *has* to have the width parameter
as input, hence the assertions, and that's perfectly fine.

we have to prioritise gate count, here, not programmer convenience.

> # layouts really should be a subclass or wrapper over SimdMap
> # with Shapes as values, but lkcl insisted...
> def layout(elwid, part_counts, lane_shapes):
>     lane_shapes = SimdMap.map(Shape.cast, lane_shapes).values

still doesn't have width which is a hard requirement.
> # now the following works, because PartitionedSignal uses SimdMap on
> # inputs for shapes, slicing, etc.
> 
> # example definition for addg6s, basically directly
> # translating pseudo-code to nmigen+simd.
> # intentionally not using standard ALU interface, for ease of exposition:
> class AddG6s(Elaboratable):
>     def __init__(self):
>         with simd_scope(self, IntElWid, make_elwid_attr=True):
>             self.RA = PartitionedSignal(XLEN)
>             self.RB = PartitionedSignal(XLEN)
>             self.RT = PartitionedSignal(XLEN)

it will have to be

     with simdscope(self.pspec, m, self, ....) as ss:
        self.RA = ss.Signal(pspec.XLEN)

which allows simdscope to "pick up" the dynamic runtime compile switch
between scalar and simd, but yes.

my only concern is that even flipping all the HDL over to substitute XLEN
for 64 throughout all 12 pipelines is a big frickin job, at least 10-14 days.

it's not something to be taken lightly, and *nothing* else can take place
during that time.

this is really important.

once committed to big regular patterned changes like that, they *have* to be
seen through to the end and VERY strictly under no circumstances mixed with any
other code changes.

the individual pipeline tests as well as test_issuer will all have to be kept
uptodate.

hence why it is a lot of work because we are talking *TWENTY FIVE OR MORE*
unit tests, 12 pipelines, and 5 to 8 ancillary files associated with the
core and with regfiles.

-- 
You are receiving this mail because:
You are on the CC list for the bug.