[Libre-soc-dev] svp64 questions: variable parallelism vs predictability

Sat Dec 26 12:48:07 GMT 2020

On Saturday, December 26, 2020, Alexandre Oliva <oliva at gnu.org> wrote:

> While some languages specify a strict ordering for array assignments,
> requiring the infrastructure to satisfy certain guarantees, other leave
> it unspecified, reducing the complexity for the implementation or giving
> it more flexibility and room for optimizations, while shifting
> trouble-avoidance onto another layer.
>
> These are all valid design decisions, they just need to be documented.

done, the reminder is really appreciated

> E.g., the fact that the loop over the vector was intended to be taken as
> sequential, rather than parallel, was not obvious to me, especially
> given that AFAICT there isn't an option to count downwards rather than
> upwards, as an implied ordering would suggest.

indeed. this has been bugging me for some time.  in the "1 2 3D matrix"
enhancement this was possible, the issue is of course, it takes precious
bits.

data can always be LDed with inverse counts, or a mv performed with
countdown elements.  this by an index instruction followed by a subtract.

>
> Now, detecting overlaps and enforcing serialization is something that
> requires additional gates, and it's a job that a compiler could do
> without much effort, so we could relieve the hardware and rely on
> compilers to do this job.  E.g., overlaps could be ruled out entirely
> (whether or not traps are mandated), or even specified and documented as
> means to detect the amount of parallelism that a certain hardware
> implementation offers, to then select ifuncs that best exploit the
> available amount of parallelism.
>
> OTOH, multi-issue already requires detection of dependencies across
> insns, so it's not obvious that attempting to relieve the hardware from
> this complexity in the vector case would accomplish much.

hence the reason for using OoO.

>
> Anyway, it looks like a decision has already been made.  I'm a little
> surprised that it enforces the strictest possible ordering, rather than
> enabling flexibility to improve parallelism,

small secret: if we were doing an actual Vector Processor you would be
totally correct although it would be moot because traditional Vector ISAs
have vector regs and there *is* no possibility for inter-element overlap.

here we are leveraging scalar systems to lift up a layer between decode and
issue and shove in a loop.

> but I'll just hope these
> possibilities have been considered when the time was right to make these
> decisions, and that the right decisions were made back then ;-)

it was that the fallback is of course scalar issue in a loop, and even an
inorder system should cope with that.

the possibility for some very cool but weird inherent reduce operations
where the source overlapped the dest by one was too good to pass up.

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68