[Libre-soc-bugs] [Bug 230] Video opcode development and discussion

bugzilla-daemon at libre-soc.org bugzilla-daemon at libre-soc.org
Fri Dec 11 16:24:20 GMT 2020


https://bugs.libre-soc.org/show_bug.cgi?id=230

--- Comment #13 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
(In reply to cand from comment #12)
> Commented on the horizontal mapreduce.
> 
> That has the issue that's it's a massive PITA to code, plus it's slow. Plus 
> there's the "access to non-4-offset regs stalls".

i realised i made an error in the assessment.  let us assume that there is a
(very long) vector in memory that needs to be mapreduced/multiplied.  N=1000
for example.  the implementation would be a pair of while-loops, where the MVL
(Max Vector Length) is set to a *fixed* amount, taking up a fixed range of the
register file.

* the inner loop simply performs standard VL-driven multiplies, 8 at a time (or
  whatever MVL is set to) storing results in a secondary array

* at the end of that inner loop, the remaining (non-power-of-two) amount which
  can be anywhere between 0 and 7 results in a perfectly-acceptable SIMD
  auto-predicated operation.  no complexity at the source code level, here.

* the outer loop now kicks in, making the secondary array the primary array
  and entering the inner loop, once again, but with *half* the number of
  elements.

* finally, N (the length of our ever-halving array) reaches a value between
  0 and 7, at which point we end up with some rather small VLs.  this is
  the *only* point at which a little more thought is needed, to deal with
  things like 6 3 2 1.

* if N was in fact limited to e.g. 4 rather than 8, the "last cleanup" is
  extremely small.

* (actually for the very small end-of-loop it would probably be better to
  just do a series of accumulator multiplies and not worry about it, let
  the OoO issue engine deal with it).

coding that up at the *hardware* level fills me with some... trepidation
(see below).

if it is at the *software* level, note that there is *no* explicit decision
to reduce VL by half each time (this was a mistake to suggest doing that),
you use the fact that VL is set to MIN(MVL, RA) where RA is the "desired"
(maximum) number of elements wished to have multiplies executed.


>  Even if there's no ready 
> operation, it should be made easier and faster than a manual mapreduce loop.

sigh :)  the issue is that we've got a hugely complex system already, where
SV implements, in between the issue and execution phase, the following loop:

function op_add(rd, rs1, rs2) # add not VADD!
  rd  = int_vec[rd ].isvector ? int_vec[rd ].regidx : rd;
  rs1 = int_vec[rs1].isvector ? int_vec[rs1].regidx : rs1;
  rs2 = int_vec[rs2].isvector ? int_vec[rs2].regidx : rs2;
  for (i = 0; i < VL; i++)
    ireg[rd+id] <= ireg[rs1+irs1] + ireg[rs2+irs2]; # <- actual operation here
    if (int_vec[rd ].isvector)  { id += 1; }
    if (int_vec[rs1].isvector)  { irs1 += 1; }
    if (int_vec[rs2].isvector)  { irs2 += 1; }

applying a mapreduce algorithm *could* be done but it's yet another outer loop
even on top of that.

and, more than that: where do you store the intermediary results used in
between levels in the mapreduce?  (they certainly can't go in the regfile)

how do you deal with the dependencies on those intermediaries?

what happens if there's an interrupt in the middle of that?  context-switching
needs to swap *everything*... where do those intermediaries go?  (they can't go
into the regfile)

so the answers to these questions - the complexity that's introduced when
trying to implement it - is why i've stayed away from this one.

to put a ballpark figure on timescales for a proper implementation (start to
finish, including Specifications) i'd put about 4-5 months on this one task
alone.

for dotproduct we may *have* to implement something like it, but this is an
outlier and, again, i'm inclined to defer it because of the complexity.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the libre-soc-bugs mailing list