[Libre-soc-dev] [RFC] svp64 "source zeroing" makes no sense

Mon Mar 22 11:01:16 GMT 2021

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Mon, Mar 22, 2021 at 8:28 AM Richard Wilbur <richard.wilbur at gmail.com>
wrote:

> > 1) Dynamic Partitioned ALUs will require receiving MULTIPLE predicate
> mask
> > bits i.e. cover multiple src/dest steps.
>
> Will the Dynamic Partitioned ALUs be receiving more than a source and
> destination mask?

neither.

Will the mask size be related to the partitioned
> size of the ALU?
>

not at all.
https://libre-soc.org/3d_gpu/architecture/dynamic_simd/?updated

> > 2) Multi-issue will require multiple src/dest steps per clock.
>
> How many predicate masks are we talking about per operation?

https://libre-soc.org/openpower/sv/overview/
for multi-operand-src operations, only one.  named "single predication"
for single-src single-dest operations, two. named "twin predication"

The code
> you posted had two:  source and destination.  Each ALU sounds like
> they might have a unique pair.
>

that would be incorrect. the front-end (ISA decoder and issuer) is
divorced, separated from, and abstracted away from, the back-end (ALUs)
entirely.  this is the point of using Cray-Style Vector ISAs: the front-end
allows variable-length, the back-end is fixed width hardware (obviously:
you cannot dynamically allocate more silicon).

therefore element operations are "fed" to the back-end in groups.  if there
is room at the back-end ALUs to fit 8 operations, then srcstep and dststep
can theoretically advance by up to 8 at a time per clock cycle.

If I'm not mistaken, the logic design I mentioned implements the
> "count leading zeros starting from bit X".  That is basically what I
> outlined.  It also is ready for the next iteration right after you use
> the count of leading zeros because it generates its next state while
> generating the count.
>

excellent.

now bear in mind that for high-performance implementations, *multiple*
srcsteps and dststeps will need to be covered.  at some point the
complexity of detecting multiple bits and the regfile routing involved with
doing so becomes so great that we will have to make compromises.

> Here is an improved, simplified implementation of what I described earlier:
> Load a 64-bit register with the value of the mask (bits 0 to
> VL-1) with all bits at and above VL set to 1.
> Connect the lowest 32 bits of register to zero-detector =
> 32-input NOR whose output runs a MUX { 0 := passes the 64 bits
> through, 1 := shifts them down 32 bits (and passes 1's in the high 32
> bits of output)}.
> Connect the output of the previous stage to a stage of half the test
> and shift size:  16, 8, 4, 2.
>

just so you know, there's a countzero module based on microwatt, here
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/logical/countzero.py;hb=HEAD

you can see it's directly equivalent to PriorityEncoder because the
correctness proof passes:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/logical/formal/proof_main_stage.py;h=e7cf254a8f31be2a2a783b47027261a7a9116ae1;hb=HEAD#l119

a masked equivalent would be handy.

zero-detector width) and the whole mask is zero. (This will happen
> only on the first iteration, as after the first update the mask
> register will always have removed all counted zeros.)
>

bear in mind that the only state information that can be stored is
SVSTATE.  anything else has to be re-created if returning from an interrupt
into the middle of a loop.

> for multi issue and Dynamic Partitioned SIMD, multiple of those are
> > required.
>
> A pair of source and destination predication masks for each ALU, right?
>

slightly-incorrect.  the fixed-width back-end ALUs know nothing about the
variable-length front-end ISA.  the ALUs currently know nothing about
predication in any way, shape or form.  there is absolutely no intention to
pass the two predicate masks down through the ALUs: what would be the
point, when *bypassing* the ALU entirely and directly writing zeros as
outputs is quicker and uses less power?

this can however get complicated when dynamic SIMD back-ends are involved.

By this do you mean "the normal Vector processor only has" output "zeroing"?
>

the original Cray Vector system for example.

I guess I can see that if you don't want certain elements of a vector
> to be multiplied or added you can simply exclude them from the source
> predication mask, no need to send it zero operands!
>

exactly.  the reason why this is not done in "normal" Vector engines is
because it introduces a READ-MODIFY-WRITE cycle if the width of the element
operation is not equal to the write-width of the regfile hardware.

for such legacy ("normal") Vector engines, it is "easier" to say "screw it"
and write say a 32-bit zero in masked-out elements along-side a 32-bit
result which fits into a 64-bit regfile entry than it is to do "argh this
element was masked out, err read the regfile, modify the top 32 bits, write
that out".

we solved this with byte-level write-enable lines on the regfile.

> I realize we don't have a lot of time to do this design.  I also
> realize we will have plenty of time to regret the mistakes we make at
> this stage--especially the ones that can't be corrected easily in a
> compatible fashion.
>

you're coming into this cold where i have had 2+ years to think about
pretty much nothing else at both the hardware, ISA, and software level
combined.

the key insight raised here is that even dst-zero and src-zero both set is
actually useful because it saves one instruction where ANDing of masks
would otherwise be necessary.

l.