[Libre-soc-dev] 3D MESA Driver

Sun Aug 9 12:27:58 BST 2020

On Sunday, August 9, 2020, vivek pandya <vivekvpandya at gmail.com> wrote:

>
>>
>> however this will be done *automatically* by the hardware, precisely so
>> that you, as a compiler writer, do not have to massively complicate the
>> compiler (see "SIMD considered harmful" article).
>>
> I am not able to see much perf benefit except following things with this
> idea
> 1) PC increment will be skipped by doing loops in HW.
> 2) Interesting things will be load/store if we have capabilities to
> read/write 128 bits or 256 bits data in one cycle.
>

yes.  this is done by DataMerger.  while there are up to 8 LDST Units each
issuing one "element", the DataMerger is designed to spot batches of same
addresses i.e requests with the same top 4-5 bits.

those are then "merged" into a single massive 128 bit request (actually, 2
of them, odd and even based on bit 4 of the address).

>
> According to me, the compiler's task is not much simplified. As I can see,
> register allocation needs to be aware of VL and for that we need to extend
> the concept of liveness for vectors.
>

simon moll and robin have been working on VL for RISC-V, and ARM and NEC
have also been contributing to LLVM vector intrinsics.

it is ok but annoying to set a fixed VL (8 or 4) which would effectively
turn things into a SIMD-like ISA.

about that:
https://www.sigarch.org/simd-instructions-considered-harmful/

at least with VL it is possible to dynamically set the cleanup size.
goodbye SIMD mess.

> (For LLVM we might not need to extend but TOT PPC backend can't be used,
> so we need to modify it to adapt)
> I see that graphics API (vec2/3/4) can be directly mapped to VL.
> And it will be required to modify vectorization pass to convert loops to
> VL of appropriate size.
>

basically, see that article above for an example.  we are doing exactly the
same thing except replace vfmadd with "fmadd where registers are marked as
vectors".

if it is easier feel free to consider registers 32 to 128 as vectors whilst
leaving 0 to 31 as scalar.

they _can_ be mixed...

>
>>
>> so in this way, we do not have to invent vector instructions add8i,
>> add16i, add32i etc. we can simply use "addi" for all of them by setting
>> elwidth
>>
> This is very interesting. I would like some examples here.
> So if I say addi R4, R4, 2 VL=4 elwidth=8
> will it use 8 bits in R4,R5,R6,R7 ?
>

let me think it through.  R4 is 64 bit.  elwidth=8 means (c union analogy
here) use the int_regfile[4].b[loopindex]

so no, definitely not lower 8 bits of R4 R5 R6 R7.  this *is* possible to
do but i will leave the explanation of how for another time.

>  or 4 input and output will be packed in lower 32 bits of R4 (that will
> require pack/unpack) ?
>

this one.  very important to note: the top 32 MSBs are NOT altered.

i have very specifically added byte-level write-enable lines to the
register files in order to not need a "read modify write" cycle, here.

so this means that if we want to do vec3 RGB 8 bit operations, there is no
need to do pack/unpack.

* one set of operations does the vec3 RGB as 24 bit writes, using elwidth=8
* another set of operations sets the exact same register to an elwidth=64

no need to do a MV/register copy.  no need for special vector pack/unpack
instructions.

>>
>> # SUBVL
>>
>> sometimes, especially for vec2/3/4, you want to do loops on vectors of
>> vec2/3/4.  this is what SUBVL is for, and effectively it is a sub-sub-loop
>> on PC, intuitively as might be expected.
>>
>> however one key thing: predicate bits do *not* extend down individually
>> to SUBVL.  they apply to the *whole* vec2/3/4.
>>
>> this saves a lot of bits when setting up predicates.  it would be
>> necessary to do bit level mask manipulation in order to expand 0b0110 into
>> 0000 1111 1111 0000 for an array of vec4 for example and that is costly.
>>
>> Here it will be better to have concrete example.
>

VL=2 SUBVL=2 predicate = 0b01 elwidth=64 addi r4, r4, 3

this will do:

first inner loop on SUBVL

* addi r4, r4, 3 # vloop=0, subvloop=0, pred[vloop]=1
* addi r5, r5, 3 # vloop=0, subvloop=1, pred[vloop]=1

second inner loop on SUBVL

* SKIP addi r6, r6, 3 # vloop=1, subvloop=0, pred[vloop]=0
* SKIP addi r6, r6, 3 # vloop=1, subvloop=1, pred[vloop]=0

skipped because the predicate VL bit is *zero* for those 2 operations.

note that there were *not* 4 bits in the predicate despite 4 scalar
instructions, because the predicate bit granularity only applies to VL
loops *not* SUBVL.

i.e. it is NOT this:

VL=2 SUBVL=2 predicate = >>>0b0011<<<

l.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68