[Libre-soc-dev] parallel reduction

Mon Sep 5 15:00:02 BST 2022

parallel reduction has been a priority design feature of SV for
over two years.  it is fundamental to Vector and SIMD ISAs.
it was designed in the specification as a "Mode" (like Saturate)
because it can be called on as only a 64-bit instruction.

i began implementing it yesterday and this morning encountered
a fatal design flaw should it be implemented as a "Mode".
in addition the complexity of adding it to ISACaller made me
realise that in HDL it would correspondingly be too complex
(too much gate latency) to implement as a "Mode".

with only four weeks left before the october Horizon 2020 cutoff
the last thing needed is changes to the spec but that is what this
will take.

absolute top priority flaw: REMAP combined with Parallel Reduce
needs chaining of two complex Schedule Maps and that is too
much.

secondary and just as equally fatal flaw: the Vector Length as
set for the number of elements to be reduced bears *no relation*
to the number of reduction operations and in "fast" parallel
reductions may even exceed VL.

it should be absolutely clear that any solutions attempting to fix
the second flaw without also fixing the first are unacceptable
and do not need to be discussed at this critical time when under
time pressure.

i am therefore proceeding immediately to making parallel reduce
a "REMAP" option.

this solves the first flaw by making it impossible by design to
chain two complex Schedules together.

it also solves the second flaw by requiring that the Vector Length
be passed in as an operand to svshape, letting svshape perform
the computation of the number of operations and setting VL
and MAXVL to that exact amount.

this has the very interesting side-effect of subtly altering the
meaning of Predicate Masks to be at the level of *individual
operations* in the parallel reduction schedule, as opposed to
being on the Vector of elements to be reduced.  that may not
be useful so will need to be considered.

there is also the benefit of having far more bits to play with
in SVSHAPEs and in the svshape instruction than there ever
could be when this was a "Mode".

a hell of a lot needs to get done very very quickly so a first
implementation after a LOT of extremely rapid updates to
the spec is going to be a cut-down one.

although i do not wish to do so it *may* be necessary to add
another variant (joining svindex and svshape2) in order to keep
instruction count down.  that can however be done after an
initial first implementation at least gives a proof of concept.

l.