[Libre-soc-isa] [Bug 1056] questions and feedback (v2) on OPF RFC ls010

Wed May 31 17:26:10 BST 2023

https://bugs.libre-soc.org/show_bug.cgi?id=1056

--- Comment #38 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to Paul Mackerras from comment #33)
> (In reply to Jacob Lifshay from comment #31)
> > 
> > it is required in hardware that supports both endians since the byte
> > reversal hardware is what changes wether little-endian or big-endian element
> > indexing is used, by byte reversing inputs/outputs of operations (or any
> > logically equivalent method that is likely much more efficient).
> > 
> > e.g., assuming your endian proposal with VL=4 r3=0x0123_4567_89ab_cdef
> > sv.ori/sw=8/dw=16 *r3, *r3, 0
> > in LE mode produces:
> > r3=0x0089_00ab_00cd_00ef
> > in BE mode produces:
> > r3=0x0001_0023_0045_0067
> 
> Interesting example. I'll have to think about how I would implement that.

the key bit about the example jacob gives is, the source and destination
widths are different but obviously having full crossbars to SIMD ALUs
in front of regfiles may be far too many gates for some implementations
to handle.

therefore it is noted in the spec that "some implementations may be
slower if the source and dest elwidths are not the same" 

> Ignoring BE for the moment, what kind of structure do you have in your
> design for handling this kind of source/destination width mismatch? Is it
> something like a bunch of multiplexers ahead of the ALU, or is there a more
> clever way to do it?

i was planning 4 (or 8) lanes of *completely independent* ALUs, on Modulo-4
or Modulo-8.  this may or may not be before or after Register-Renaming.

preceded (for read) / followed (for write) by a 64-bit-wide cyclic
shift queue, in lieu of a full crossbar.  "routing" becomes a simple
count-down with the difference between "(RA modulo 4)" (or mod8) and
"target ALU lane-position".  when that count-down reaches zero, the
In-Flight data is delivered.

this would work perfectly fine for both an Out-of-Order system and In-Order
but an In-Order one would be rather unhappy about the variable-latency
it introduces.  have to be quite careful about that, but even the latency
can be Deterministically calculated (assuming count-down Hazard Protection,
one per register-to-be-written, just like in Microwatt)

REMAP is where that gets *really* expensive (and hairy) but it is still doable.

the principle difference between Simple-V and other Vector ISAs:

normally the logic that would go into e.g. "xxperm" deep down in
one of the pipelines has been "promoted" up to first-order routing not
only on registers but *actual bytes* and now *interacts* with Register
Hazard Management.

it means Hardware Engineers get a bit of a jaw-dropping shock
("you want to do whuuuu?") but if you spend 4+ years thinking it
through it does actually work.

Brad very kindly prompted me here to expand "hphint" to a
first-priority means of making Hardware Engineers lives tolerable:
you can set hphint *GREATER* than MAXVL and it tells Multi-Issue
Hardware that (hphint/MAXVL) batches may be spammed to backend
hardware in each clock cycle *WITHOUT* having to check Register
or Memory Hazards on *any* of the elements within each multi-issue
batch.

-- 
You are receiving this mail because:
You are on the CC list for the bug.