[Libre-soc-isa] [Bug 213] SimpleV Standard writeup needed

Tue Oct 20 02:01:51 BST 2020

https://bugs.libre-soc.org/show_bug.cgi?id=213

--- Comment #78 from Jacob Lifshay <programmerjake at gmail.com> ---
(In reply to Luke Kenneth Casson Leighton from comment #75)
> (In reply to Jacob Lifshay from comment #72)
> 
> > All you need is a bitwise right shifter to send the next 8 bits from the
> > vector mask to the ALU,
> 
> jacob you're not quite getting it: this is only possible to do ("a simple
> shift") if there are no Dependency Matrices involved.

Yeah, that's true. I had been thinking that the shift operation would just be
repeated each time new bits were available from their respective sources. Scrap
that idea.

> an OoO system MUST track ALL objects regardless of size.

I was never advocating for not tracking all objects, just some of the objects
are different kinds of objects and we should treat them differently.

A demo datapath:
https://libre-soc.org/3d_gpu/int_regs_as_masks.dia.png
https://libre-soc.org/3d_gpu/int_regs_as_masks.dia.svg

The demo datapath leaves the FU registers implicitly part of the corresponding
ALU/FU combos due to me not wanting to draw for 4hr.

Muxes in the datapath diagram are actually bidirectional, they have separate
muxes for each direction internally.

> the significance of this had not really sunk in properly for me because i
> had not realised the latency problem you highlighted.
> 
> we have two choices at each end of the spectrum (and some in between)
> 
> * bitlevel predicate Dependency Matrices: one bit per element
> * "one hit" (one scalar) predicate masks (with associated latency)
> 
> when doing bitlevel DMs one optimisation in the VL instruction issue phase
> is to notice the following:
> 
> 
> * VL=16
> * elwidth=16
> * SIMD width=64
> * therefore 4x ops can be batched to each ALU
> 
> *BUT*
> 
> to do that, you need 4 bits of predicate i.e. 4 predicate regs to be passed
> to those ALUs.
> 
> now, if you start having to get those 4 bits (which can't do the shifting
> you suggest *because they haven't been read yet*) it quickly becomes hell.
> 
> note that DMs track regs *before the contents are available*.  we don't
> *have* the contents of the predicate mask available at the time in order to
> be able to shift it!

yup.

> consequently you have to do that shadow trick, and only when the reg is read
> *then* you can finish off the bitlevel analysis (shifting if necessary) and
> send it on to each ALU.

Another possible scheme is to have each FU take the mask into it's source latch
whenever the mask is ready, then, if the mask is set to 0, the FU can signal
the required circuitry to cancel itself. That way, the mask just becomes a
normal dependency, rather than needing to be so special.

> even having an internal PRF ARF special designation: the protection needed,
> i did try once the idea of making VL a pointer to a reg rather than an
> immediate, and hoo-boy was it convoluted.
> 
> 
> you need to think through: what is the logic needed to implement 8-bit
> vector mask *when you do not have access to the mask yet*, how will the mask
> get into the Shadow Matrices, and how does it work for all possible elwidths
> and all possible values of VL.

obviously it's just another dependency -- just like a data input.

The exact dependencies can be calculated at instruction decode time and stored
in latches/flip-flops wherever they are needed.

-- 
You are receiving this mail because:
You are on the CC list for the bug.