[Libre-soc-dev] video assembler

Luke Kenneth Casson Leighton lkcl at lkcl.net
Fri May 14 14:14:31 BST 2021

Lauri  i had a bit of a think, and also encountered this page:


it helps clarify some of the automatic benefits of Vector ISAs but does not
illustrate quite as starkly the difference as those two strncpy examples.

apart from FAIL-FIRST-LDST mode, the Vector version may result in internal
micro-ops that look pretty much exactly like the SIMD case, *but without
the programmer having to explicitly write them*

it did not really occur to me, and i find it fascinating therefore, that
you may not yet fully appreciate the stark simplicity and general purpose
nature of Vector operations.

a general rule of thumb is that the assembly code for Vector ISAs *has* to
be written with the fact in mind that it is going to be run on *multiple
platforms* with completely and utterly different back-end architectures and
levels of performance.

thus it is *not your responsibility* to try to heavily optimise the
assembly code for multiple different platforms, it is *our* responsibility,
the hardware designers, to come up with optimised *hardware* that makes the
*exact same assembler* run considerably faster.

it is therefore your role to provide us (the hardware engineers) with that
feedback (just as we did earlier in the preliminary analysis)

thus in many ways it is completely misleading to try to use instruction
counters to guage the performance (more on this below)

the job of the Vector assembly code is to hit whatever hardware is
underneath with as much data as it can possibly achieve.

what *is* therefore useful is to measure the saturation level of each of
the hardware Function Units, to see if they are 100% occupied, and whether
anything is stalling out.

[consequently, hence why i said: if no such assembler exists *that
iterative feedback analysis process cannot even begin*]

also what is useful is, once an algorithm has been written, is to see if
there are better instructions or less instructions that could be used *or
added to the ISA*, based on an assumption that the "best" hardware
implementation will ensure 100% throughput without stalls.

now, that all said...

there is actually an area of SVP64 where some level of optimisation may
actually be necessary, and it's down to the fact that, unlike traditional
Vector ISAs, we overload the standard regfiles.

traditional Vector ISAs have *actual* vector reg numbers, that sit in a
completely separate regfile from FP and INT.

thus, at the outset, we have no idea how large to make each "Vector".  too
large and you have "spill" of extremely large numbers of registers swapping
through L1/L2 caches.  too small and in-order systems will likely not
perform as well.

thus it occurs to me that perhaps, from the outset, it may be a good idea
to write the assembler using gnu binutils "macro" substitution, where the
number of registers used in a batch is a #defined compile-time number.

rather than:

      sv.addi 8.v, 12.v, 2

we have instead:

    #define BATCH_SIZE 4
    # define firstparam (BATCH_SIZE+8)
    # define secondparam (firstparam+BATCHSIZE)

     sv.addi firstparam.v, secondparam.v, 2

in this way it becomes possible to change the Vector Batch size without a
total assembler rewrite.

the other thing is: the huge (confusing) simplicity of Vector-based
assembler compared to SIMD means that multiple iterations should in fact be

i.e if you are writing hundreds of lines of assembler like in that VSX
strncpy example, something is badly wrong and we should stop immediately
and investigate why.

is this making sense?


crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

More information about the Libre-soc-dev mailing list