[libre-riscv-dev] 1R1W regfiles
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Thu Dec 20 08:00:29 GMT 2018
On Thu, Dec 20, 2018 at 7:29 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> Assuming the VPU load doesn't require 8/16-bit matrix support, the Vulkan
> driver doesn't require anything smaller than 32-bit matrices so I think we
> could just trap and emulate or split into several micro-ops for non-aligned
> 8 and 16-bit remapping and just have the compiler avoid non-aligned
> 8/16-bit operations.
interestingly, with the idea of passing all 8/16-bit operations
(including xBitManip) through the 8/16 bank of Function Units, and
semi-micro-coding the shuffling of src1/src2 bytes through xBitManip,
i believe it would actually require effort to *stop* 16-bit operations
from being successfully processed.
> We could share a single multi-stage byte crossbar for load gather/store
> scatter and for vector swizzle.
load gather/scatter fits into the current idea/concept - at the 64/32
and 16/8-bit level - just like any other operation. 64/32 uses the
4x4 register bank crossbars / multiplexers. xBitManip ALUs perform
the 16/8-bit targetting.
(each outstanding LD/ST operation needs its own Function Unit. 64/32
LD/STs go onto the 32-bit Function Unit Matrix. 16/8 LD/STs go onto
the 8-bit Function Unit Matrix. see section 11.4.11 and 11.4.12 of
mitch's book chapters that i forwarded to you)
even one multi-stage byte crossbar is an insane amount of gates, and
they're a massive part of what xBitManip is, anyway. by passing the
data _through_ xBitManip, the pre-existing features of Function Units
(src register "readiness") means that we can put in e.g. just the one
xBitManip ALU and have it process src1 *and* src2 pretty much
automatically... as part of the *pre-existing* infrastructure.
a single multi-stage byte crossbar on the other hand is a dedicated
specialist resource (an absolutely enormous one, at that), which has
to be special-cased.
> This allows us to treat everything as if it is processed in 32-bit or
> larger chunks with either muxing in the old bytes internally in the alu for
> forwarding or we could specify that SV muxes in zeros/ones on unused
> sub-register elements.
it's part of the specification that zeroing is optional, precisely to
avoid the situation where only one bit is set in the predicate
(explicit or implicit), resulting in atrocious performance as an
entire bank of lanes is dominated by (waiting on) a one-byte
muxing in on "unused" sub-register elements... in my mind the word
"unused" carries with it impliication that the rest of a register is
not important. i feel that they are: running RV32 applications, a
whopping *half* of the entire RV64 register file is completely and
utterly wasted as a sign bit.
so this is why i explain in the specification that setting different
element widths is as if the regfile is typecast, *and*, in the REMAP
CSR, i added "offsets" that allow src1 hi-word src2 lo-word or any
other arbitrary offsets to be carried out.
this treats the *entire* register file as "peer elements" rather than
"unused" elements. and yes, it's a pain in the neck, however i am
slowly getting there, with something that's not completely insane.
the instruction decode phase when REMAP is activated is going to be
an absolute pig.
More information about the libre-riscv-dev