[Libre-soc-bugs] [Bug 713] PartitionedSignal enhancement to add partition-context-aware lengths

Fri Oct 8 00:06:40 BST 2021

https://bugs.libre-soc.org/show_bug.cgi?id=713

--- Comment #35 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #33)
> (In reply to Jacob Lifshay from comment #31)
> 
> > This is exactly how all SIMT works (which is exactly what we're
> > trying to do with transparent vectorization). The types and sizes are the
> > type/size of a *single* lane, not all-lanes-mushed-together.
> 
> there is a lot of misinformation about SIMT.  SIMT is standard cores
> (which may or may not have Packed SIMD ALUs) where they are "normal"
> cores in every respect *except* that they share one single PC,
> one single L1 I-cache, one single fetch and decoder, that *broadcasts*
> in a synchronous fashion that one instruction to *all* cores.

I don't care how SIMT is traditionally implemented in a GPU, that's totally
irrelevant and not what I intended. What I meant is that our HDL would be
written like how SIMT is used from a game programmer's perspective -- where if
a game programmer writes:
float a, b;
int c, d;
...
a = a > b ? c : d;

the gpu actually runs (vectorized with 64 lanes):
f32x64 a, b;
i32x64 c, d, muxed;
boolx64 cond;
...
cond = a > b; // lane-wise compare
muxed = mux(cond, c, d);
a = convert<f32x64>(muxed);

and if a programmer were to write (generalizing a bit to support dynamic XLEN):
xlen_int_t a, b, c, d;
...
a = b + c * d;

what would actually run is:
vec_with_xlen_lanes_t a, b, c, d;
...
for(lane_t lane : currently_active_lanes()) { // C++11 for-each
    xlen_int_t p = c[lane] * d[lane];
    a[lane] = b[lane] + p;
}

> > the input argument for all
> > current Signals *is* the bit-width of the current lane (aka. elwidth or
> > XLEN) except that our code currently is specialized for the specific case of
> > elwidth=64. 
> 
> yes.  in the discussions with Paul and Toshaan i seriously considered
> an XLEN parameter in the HDL which would propagate from runtime through
> a PSpec (see test_issuer.py for an example) and would allow us to test
> a scalar 32 bit Power ISA core, see how many less gates are needed.
> and, just for laughs, to try and XLEN=16 core.
> 
> but... time being what it is...

If we want any chance of matching the spec pseudo-code, we will need
xlen-ification of our HDL in some form, since currently it's hardwired for only
xlen=64.

Why not make it look more like the pseudo-code with arithmetic on an XLEN
constant? I could write the class needed for XLEN in a day, it's not that
complicated:

class SimdMap:
    def __init__(self, values, *, convert=True):
        if convert:
            # convert values to a dict by letting map do all the hard work
            values = SimdMap.map(lambda v: v, values).values
        self.values = values

    @staticmethod
    def map(f, *args):
        # TODO: fix bad wording, like builtin map, but for
        # SimdMap instead of iterables
        """apply a function `f` to arguments which are one of:
        * Mapping with ElWid-typed keys 
        * SimdMap instance
        * or scalar

        return a SimdMap of the results
        """
        retval = {}
        for i in ElWid:
            mapped_args = []
            for arg in args:
                if isinstance(arg, SimdMap):
                    arg = arg.values[i]
                elif isinstance(arg, Mapping):
                    arg = arg[i]
                mapped_args.append(arg)
            retval[i] = f(*mapped_args)
        return SimdMap(retval, convert=False)

    def __add__(self, other):
        return SimdMap.map(operator.add, self, other)

    def __radd__(self, other):
        return SimdMap.map(operator.add, other, self)

    def __sub__(self, other):
        return SimdMap.map(operator.sub, self, other)

    def __rsub__(self, other):
        return SimdMap.map(operator.sub, other, self)

    def __mul__(self, other):
        return SimdMap.map(operator.mul, self, other)

    def __rmul__(self, other):
        return SimdMap.map(operator.mul, other, self)

    def __floordiv__(self, other):
        return SimdMap.map(operator.floordiv, self, other)

    def __rfloordiv__(self, other):
        return SimdMap.map(operator.floordiv, other, self)

    ...

XLEN = SimdMap({ElWid.I8: 8, ElWid.I16: 16, ElWid.I32: 32, ElWid.I64: 64})

# layouts really should be a subclass or wrapper over SimdMap
# with Shapes as values, but lkcl insisted...
def layout(elwid, part_counts, lane_shapes):
    lane_shapes = SimdMap.map(Shape.cast, lane_shapes).values
    signed = lane_shapes[ElWid.I64].signed
    # rest unmodified...
    assert all(i.signed == signed for i in lane_shapes.values())
    part_wid = -min(-lane_shapes[i].width // c for i, c in part_counts.items())
    ...

...

# now the following works, because PartitionedSignal uses SimdMap on
# inputs for shapes, slicing, etc.

# example definition for addg6s, basically directly
# translating pseudo-code to nmigen+simd.
# intentionally not using standard ALU interface, for ease of exposition:
class AddG6s(Elaboratable):
    def __init__(self):
        with simd_scope(self, IntElWid, make_elwid_attr=True):
            self.RA = PartitionedSignal(XLEN)
            self.RB = PartitionedSignal(XLEN)
            self.RT = PartitionedSignal(XLEN)

    def elaborate(self, platform):
        m = Module()
        with simd_scope(self, IntElWid, m=m):
            wide_RA = PartitionedSignal(unsigned(4 + XLEN))
            wide_RB = PartitionedSignal(unsigned(4 + XLEN))
            sum = PartitionedSignal(unsigned(4 + XLEN))
            carries = PartitionedSignal(unsigned(4 + XLEN))
            ones = PartitionedSignal(XLEN)
            nibbles_need_sixes = PartitionedSignal(XLEN)
            z4 = Const(0, 4)
            m.d.comb += [
                wide_RA.eq(Cat(self.RA, z4)),
                wide_RB.eq(Cat(self.RB, z4)),
                sum.eq(wide_RA + wide_RB),
                carries.eq(sum ^ wide_RA ^ wide_RB),
                ones.eq(Repl(Const(1, 4), XLEN // 4)),
                nibbles_need_sixes.eq(~carries[0:XLEN-1] & ones),
                self.RT.eq(nibbles_need_sixes * 6),
            ]
        return m

-- 
You are receiving this mail because:
You are on the CC list for the bug.