[Libre-soc-dev] SVP64 Scalar Map-reduce mode added

Luke Kenneth Casson Leighton lkcl at lkcl.net
Thu Jun 10 14:16:47 BST 2021

crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Thu, Jun 10, 2021 at 6:12 AM Lauri Kasanen <cand at gmx.com> wrote:

> This would probably go in the "optimized MP3 SV" phase. An easy change
> at the source level anyway. (and the pysim counters would need to show
> somehow the efficiency difference between the two)

remember that Vector ISAs are length-agnostic Hardware-agnostic "Abstractions".

i found an ARM SVE2 tutorial which makes this very clear.

it is linked on the wikipedia page on Vector processing now.

strictly speaking there will be no difference as far as the program is
concerned, regardless of hardware.

bear in mind, repeating again: Vector ISAs are hardware-agnostic, so
trying to design for targetting one hardware implenentation only is a
severe violation of the fundamental design principle of Vector ISAs.


* mapreduce-scalar is suited for things like carry-in carry-out chains.

hardware CAN STILL OPTIMISE THIS to be fully parallel (by doing carry
propagation but completely behind the scenes)

* mapreduce-vector is suitable for very large VL with very complex
(long clock latency) operations such as CORDIC, DIV, MOD, RSQRT, LOG

again however it is entirely up to the hardware implementor to choose
whether or not to optimise.

the program *must* still not try to "optimise" for that hardware, it
will cause other hardware to be more inefficient than it should be.

optimisations should be for the following:

* programs as small and as compact as possible.
* maximising the use of the register file, VL should be as large as possible

maximusing regfile, sigh, i have to increase the simulator register
file size from 32 to 128 to be able to do that.  should not be a
problem, only that the CR must also increase, which will be more fun.

programs being compact, this is to get L1 cache usage down.  this
reduces power consumption.

using large VL, basically what then happens is that instruction fetch
and decode does NOTHING, it sits idle.

again, this is BY DESIGN, and it reduces power consumption.

all of optimisation of Vector ISA programs are therefore *not*
targetted at *performance*, they are targetted at *efficiency*.

then on hardware that happens to have massive bandwidth it screams
along but does so with less power.

however, agsin, it is critically important, Lauri, that you not think
in terms of "how to maximise performance", but instead to optimise for
hardware-agnostic *efficiency*.


More information about the Libre-soc-dev mailing list