[Libre-soc-dev] [RFC] svp64 "source zeroing" makes no sense

Tue Mar 23 18:28:20 GMT 2021

On Mon, Mar 22, 2021 at 5:02 AM Luke Kenneth Casson Leighton
<lkcl at lkcl.net> wrote:
>
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
>
> On Mon, Mar 22, 2021 at 8:28 AM Richard Wilbur <richard.wilbur at gmail.com>
> wrote:
>
>
> > > 1) Dynamic Partitioned ALUs will require receiving MULTIPLE predicate
> > mask
> > > bits i.e. cover multiple src/dest steps.
> >
> > Will the Dynamic Partitioned ALUs be receiving more than a source and
> > destination mask?
>
>
> neither.

What "predicate mask bits" will the "Dynamic Partitioned ALUs" be receiving?

>
> Will the mask size be related to the partitioned
> > size of the ALU?
> >
>
> not at all.
> https://libre-soc.org/3d_gpu/architecture/dynamic_simd/?updated

Ok.  I read that and a bit more.  I think I have a conceptual
understanding of the dynamically partitioned ALUs.  Very cool.

> > > 2) Multi-issue will require multiple src/dest steps per clock.
> >
> > How many predicate masks are we talking about per operation?
>
>
> https://libre-soc.org/openpower/sv/overview/
> for multi-operand-src operations, only one.  named "single predication"
> for single-src single-dest operations, two. named "twin predication"
>
> The code
> > you posted had two:  source and destination.  Each ALU sounds like
> > they might have a unique pair.
> >
>
> that would be incorrect. the front-end (ISA decoder and issuer) is
> divorced, separated from, and abstracted away from, the back-end (ALUs)
> entirely.  this is the point of using Cray-Style Vector ISAs: the front-end
> allows variable-length, the back-end is fixed width hardware (obviously:
> you cannot dynamically allocate more silicon).
>
> therefore element operations are "fed" to the back-end in groups.  if there
> is room at the back-end ALUs to fit 8 operations, then srcstep and dststep
> can theoretically advance by up to 8 at a time per clock cycle.
>
> If I'm not mistaken, the logic design I mentioned implements the
> > "count leading zeros starting from bit X".  That is basically what I
> > outlined.  It also is ready for the next iteration right after you use
> > the count of leading zeros because it generates its next state while
> > generating the count.
> >
>
> excellent.
>
> now bear in mind that for high-performance implementations, *multiple*
> srcsteps and dststeps will need to be covered.  at some point the
> complexity of detecting multiple bits and the regfile routing involved with
> doing so becomes so great that we will have to make compromises.

Yes, I would use one of the proposed modules for the source
predication mask and a separate one for the destination predication
mask.

> just so you know, there's a countzero module based on microwatt, here
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/logical/countzero.py;hb=HEAD
>
> you can see it's directly equivalent to PriorityEncoder because the
> correctness proof passes:
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/fu/logical/formal/proof_main_stage.py;h=e7cf254a8f31be2a2a783b47027261a7a9116ae1;hb=HEAD#l119
>
> a masked equivalent would be handy.

What do you mean?  Masked equivalent of the PriorityEncoder, the
countzero module?

> bear in mind that the only state information that can be stored is
> SVSTATE.  anything else has to be re-created if returning from an interrupt
> into the middle of a loop.

What information is stored in SVSTATE?

Where in the loop is the valid exit point if an interrupt occurs?

While shovelling snow and ice this morning (storm last night) I
realized that the only things the module I proposed needs to recover
state are the initial predication mask (our starting point) and the
last value of the step.

> > for multi issue and Dynamic Partitioned SIMD, multiple of those are
> > > required.
> >
> > A pair of source and destination predication masks for each ALU, right?
> >
>
> slightly-incorrect.  the fixed-width back-end ALUs know nothing about the
> variable-length front-end ISA.  the ALUs currently know nothing about
> predication in any way, shape or form.  there is absolutely no intention to
> pass the two predicate masks down through the ALUs: what would be the
> point, when *bypassing* the ALU entirely and directly writing zeros as
> outputs is quicker and uses less power?
>
> this can however get complicated when dynamic SIMD back-ends are involved.

Yes, having read a bit about the dynamic SIMD it sounds like
predication could be complicated in that instance.  It seems that
dynamic SIMD determines how you process an operand once you get it and
predication has to do with which operand we send and where we store
the result.

> By this do you mean "the normal Vector processor only has" output "zeroing"?
> >
>
> the original Cray Vector system for example.
>
> I guess I can see that if you don't want certain elements of a vector
> > to be multiplied or added you can simply exclude them from the source
> > predication mask, no need to send it zero operands!
> >
>
> exactly.  the reason why this is not done in "normal" Vector engines is
> because it introduces a READ-MODIFY-WRITE cycle if the width of the element
> operation is not equal to the write-width of the regfile hardware.
>
> for such legacy ("normal") Vector engines, it is "easier" to say "screw it"
> and write say a 32-bit zero in masked-out elements along-side a 32-bit
> result which fits into a 64-bit regfile entry than it is to do "argh this
> element was masked out, err read the regfile, modify the top 32 bits, write
> that out".
>
> we solved this with byte-level write-enable lines on the regfile.

Good choice!  I am planning to read about the Cray Vector system soon (today).

> > I realize we don't have a lot of time to do this design.  I also
> > realize we will have plenty of time to regret the mistakes we make at
> > this stage--especially the ones that can't be corrected easily in a
> > compatible fashion.
> >
>
> you're coming into this cold where i have had 2+ years to think about
> pretty much nothing else at both the hardware, ISA, and software level
> combined.

Yes, I am working to come up to speed by reading the wiki (and several
of the external links referenced).  Quite interesting stuff!

> the key insight raised here is that even dst-zero and src-zero both set is
> actually useful because it saves one instruction where ANDing of masks
> would otherwise be necessary.

Saving instructions is great!