[Libre-soc-bugs] [Bug 713] PartitionedSignal enhancement to add partition-context-aware lengths

Fri Oct 8 23:30:16 BST 2021

https://bugs.libre-soc.org/show_bug.cgi?id=713

--- Comment #42 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #38)
> (In reply to Jacob Lifshay from comment #35)
> 
> > I don't care how SIMT is traditionally implemented in a GPU, that's totally
> > irrelevant and not what I intended.
> 
> yeah i understood that. i think you likely meant "Packed SIMD"

no, I meant SIMT -- explicitly where the programmer writes scalar code and the
code is auto-vectorized to a GPU-driver-chosen lane count behind-the-scenes,
where there's one instance of the scalar code for each vector lane -- like
posix threads in that you can have one instance of the scalar code per thread,
and they all run in parallel. In our case, that lane count is determined by the
scoped globals. "Packed SIMD" the programmer still has to choose explicit SIMD
register widths, explicit masking/predication, explicit SIMD types, etc.

> > What I meant is that our HDL would be
> > written like how SIMT is used from a game programmer's perspective 
> 
> ok.  yes. this is more like SIMD programming.

not really, it's just somewhat confusing because game programmers often use
explicit short vectors too -- kinda orthogonally to the SIMT vectors.

> 
> > what would actually run is:
> > vec_with_xlen_lanes_t a, b, c, d;
> > ...
> > for(lane_t lane : currently_active_lanes()) { // C++11 for-each
> >     xlen_int_t p = c[lane] * d[lane];
> >     a[lane] = b[lane] + p;
> > }
> 
> yep. definitely SIMD (at the backend) which is what we're discussing:
> the low-level hardware which is behind "lanes".

yes, but critically, the vectorization into SIMD step can be skipped (either by
setting lane counts to always be one (by setting part_counts={ElWid.I64: 1}),
or by bypassing our SIMD operators completely and generating nmigen operators
instead) and you end up with the original scalar code where, hey, it uses XLEN
everywhere...just tell it to just use the component of XLEN where elwid=64.

This allows us to convert everything one module at a time, since SIMT-ified
code can easily be used by non-SIMT code by having the globals indicate
elwid=64 and part_counts={ElWid.I64: 1}.
> 
>  
> > If we want any chance of matching the spec pseudo-code, we will need
> > xlen-ification of our HDL in some form, since currently it's hardwired for
> > only xlen=64.
> 
> right.  ok.  so here, temporarily (i quite like the self.XLEN idea but it
> is a lot of work to deploy), to *not* have to do that deployment immediately,
> the trick that can be played is:
> 
>     set the SIMD ALU width to *exactly* the same width as the current scalar
>     width i.e. 64.

so, basically you want all code to magically replace 64 with XLEN, and you try
to justify that by saying 64 is the SIMD width...

I say, that justification is additional unnecessary complication and mental
burden, if we take that approach we should still have 64-replaced-with-XLEN be
the *lane* width, not the SIMD width.

I am taking SIMT seriously and we should write our code such that the SIMD
width should essentially not affect how any of our ALU HDL is written. If it
turns out that the circuits are smaller if we choose a fixed SIMD width, then
that can be handled by the layout algorithm (where that kind of decision
belongs) and passed in through another scoped global, *not* by conflating
lane-width=64 with SIMD-width=64. That allows us to:
1. set simd-width=128 if we decide 64-bit is too small.
2. set simd-width=32 and limit to max elwid of 32 for a small embedded cpu.
3. set simd-width to some more complex formula based on requested lane sizes so
we don't outright fail where our code needs lanes with more than 64-bits, such
as int/fp multiplication, or addg6s.

I think the magic replace 64 with XLEN *could* work, but just cuz it can work
doesn't mean it's a good idea, it makes our code much harder to understand, and
it causes issues where we *really need* lane size = 64 not XLEN (such as in the
bcd<->dpd instructions). Textually replacing 64 with XLEN when we replace
Signal with PartitionedSignal isn't much harder, just one more
search-replace-and-check-all-replacements -- that works just as well as the
magically replace 64 could possibly work, and it leaves the code waay more
understandable.

> will it save time? yes.

yes, but not much.

> will it minimise code-changes? yes.

yes, but imho code understandability is faar more important than tiny git
diffs. think of it as just a refactoring, big diffs don't mean it's that hard
to understand.

> will it do absolutely everything? mmmm probably not but that
> can be evaluated case by case (incremental patches)
> 
> in other words: by defining the SIMD ALU width exactly equal to the
> current scalar ALU width this makes what you would like to see (and
> called erroneously SIMT) exactly what i have been planning for 18 months.
> 
> originally if you recall (going back 2+ years) the idea was to *split*
> the ALUs into *two* 32 bit ALUs that cooperate to do 64 bit arithmetic.
> but that was actually more about the regfiles, a HI32 regfile and a LO32
> regfile.
> 
> this melted my brain and for practical time reasons it's all 64 bit.
> 
> 
> > Why not make it look more like the pseudo-code with arithmetic on an XLEN
> > constant? I could write the class needed for XLEN in a day, it's not that
> > complicated:
> 
> yehyeh.  my feeling is, this class should derive from Shape.  in fact it
> may have to (i looked at ast.py and there are assumptions that the
> first argument of e.g. Signal is a Shape() or will be converted to one.

I think it should *specifically not* derive from Shape, because it's more
general than just Shapes -- it can have values that are:
* ints -- how it's used in XLEN
* patterns -- for .matches() and Case()
* Values -- for when we're converting from scalars to SIMD and which scalar we
use depends on elwid.
* Shapes -- input to layout()
* slices -- bit slicing
* more

> by deriving *from* Shape() a self.XLEN can do the division etc.

that has nothing to do with deriving from Shape.
> self.XLEN//2
> etc that we saw were needed in the pseudocode, they will almost certainly
> also be needed in the HDL as well, but they would create or carry an
> appropriate
> elwidth set as well.

Well, how i envisioned it is SimdMap has a separate value for each possible
elwid, allowing us to calculate a different value for each possible elwid --
implemented internally using a dict, hence the Map in SimdMap.

The way you're thinking of likely won't actually work since elwid is a Signal,
and you can't use a Signal as a Shape. If you base it on .width, then you've
locked yourself into only being able to calculate the value for one specific
elwid value, the one that matches .width.
> 
> the calculation in each of the operator overloads then is based on self.width
> which comes from Shape.width.
> 
> i am not entirely certain of all the details here so kinda would like to
> defer it as long as needed, and go with an "on-demand" approach here.
> 
> > XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64})
> 
> i like it: you are however forgetting that the ALU width also has to be one
> of the parameters.

that's passed in via the scope globals, not via a SimdMap.

> this is a hard requirement.

i can see how varying SIMD-width could need more gates, i think that should be
handled in layout via scope globals if necessary, not via messing up SimdMap.

SimdMap is purely a map from elwidths to Python values.

The type that *would* have SIMD-width, is SimdLayout, which layout() should be
morphed into the __init__ for. SimdLayout could be a Shape, but I think it's
likely better to have a .simd_shape() or just .shape() method... not deriving
from Shape avoids possible confusion between the full-SIMD-width and the lane
width.

There is a helper dataclass: SimdLane. it contains an ElWid enumerant and also
a lane index or part index or something. The idea is that a SimdLane instance
specifies a particular lane in a SimdSignal/SimdLayout.

SimdLayout would have fields, properties, or whatever, for all of:
* lane_shapes -- either a dict or a SimdMap, tbd.
* padding_kind -- is the padding a sign-extended version of
    the corresponding lane? is it zeros? ones?
    Repl(lane_bits, ...)? other? Defaults to other.
* bits_per_part -- each lane is made of 1 or more parts (determined
    by part_counts), all parts are always the same bit-width.
* signed -- signedness of all lanes
* simd_shape -- could always be unsigned, since
    signedness doesn't really matter here.
* part_counts -- number of parts per lane for each elwid,
    determines which elwid are supported. This, along with all
    the other __init__ parameters (excluding lane_shapes and
    padding_kind) should really be tossed into one globals
    object (it's SimdPartMode in simd_signal) that is the
    scoped global's value.
* other parameters that determine layout.

methods:
* get_lane_start(lane) -- returns the starting bit index of `lane`
    in the SIMD Signal. lane is a SimdLane.
* get_lane_width(lane, include_padding=False) -- returns the bit
    width of `lane` in the SIMD Signal. lane is a SimdLane.
* get_lane_slice(lane, include_padding=False) -- returns a slice()
    for the bits of `lane` in the SIMD Signal. lane is a SimdLane.
* lanes() -- gets all `SimdLane`s for this layout.
* lanes_for_elwidth(elwid) -- gets all `SimdLane`s for this layout
    for a particular ElWid enumerant.
* elwidths() -- gets all ElWid values supported by this layout
    ... essentially part_counts.keys()
* more...

> > # example definition for addg6s, basically directly
> > # translating pseudo-code to nmigen+simd.
> > # intentionally not using standard ALU interface, for ease of exposition:
> > class AddG6s(Elaboratable):
> >     def __init__(self):
> >         with simd_scope(self, IntElWid, make_elwid_attr=True):
> >             self.RA = PartitionedSignal(XLEN)
> >             self.RB = PartitionedSignal(XLEN)
> >             self.RT = PartitionedSignal(XLEN)
> 
> it will have to be
> 
>      with simdscope(self.pspec, m, self, ....) as ss:
>         self.RA = ss.Signal(pspec.XLEN)

ooh, lookie, simdscope gets self as an argument ... it can read self.pspec
itself!

XLEN should be a global constant imho.

> which allows simdscope to "pick up" the dynamic runtime compile switch
> between scalar and simd, but yes.
> 
> my only concern is that even flipping all the HDL over to substitute XLEN
> for 64 throughout all 12 pipelines is a big frickin job, at least 10-14 days.

it happens at the same time as our replacing Signal with PartitionedSignal, so
it won't be much more work, imho.

> it's not something to be taken lightly,

true.

> and *nothing* else can take place
> during that time.

false...luckily, if the globals set part_counts={ElWid.I64: 1}, then our
modules should be nearly 100% compatible with not-yet-SIMT-ified modules,
meaning that we can do it one module at a time.

> the individual pipeline tests as well as test_issuer will all have to be
> kept uptodate.
> 
> hence why it is a lot of work because we are talking *TWENTY FIVE OR MORE*
> unit tests, 12 pipelines, and 5 to 8 ancillary files associated with the
> core and with regfiles.

yes, I've done similarly-sized refactors on my own code before. they took
several days.

-- 
You are receiving this mail because:
You are on the CC list for the bug.