[Libre-soc-dev] microwatt / libresoc dcache

Fri May 7 00:18:38 BST 2021

On Thu, May 06, 2021 at 08:24:28PM +0100, Luke Kenneth Casson Leighton wrote:
> allo again paul,
> 
> for reference here is dcache.py:
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;hb=HEAD
> 
> pretty much near-identical to dcache.vhdl, one major difference: the bottom
> 3 LSBs of the address are *not* copied onto the WB bus (as previously
> discussed, a 32 bit address @ 64 bits wide data must put *29* MSBs onto the
> WB Bus, *not* the full 32)

Right, that's something we need to fix throughout microwatt.

> appreciated the input yesterday about dcache.vhdl, the 3 cycles:
> 
> * AGEN (address generation)
> * ST data drop
> * actual fetch.

The 2nd cycle does TLB and cache tag matching.  I'm not sure exactly
what "ST data drop" is; if I recall correctly, writes into the cache
data RAM are done on the clock edge at the end of the third cycle, and
there are forwarding paths so that a store followed in the next cycle
by a load to the same doubleword returns the data from the store.

> so this is where it gets interesting: we also have an AGEN Phase in
> Libre-SOC, but because the intent is to be an Out-of-Order design plus also
> to allow single regfile read port thru triple read ports as a config
> option, we have *no idea* if the two RA / RB regs for AGEN will come before
> *or after* the RS from a STORE operation!
> 
> therefore i had to stall the introduction of the AGEN assertion into
> dcache.py until the ST reg has been read (many cycles later, at present).

So stores can't be issued until all the operands are available; makes
sense.

> c'est la vie :)
> 
> my question to you is about the cache sram reading (not writing)
> 
> https://github.com/antonblanchard/microwatt/blob/master/cache_ram.vhdl#L70
> 
> here you can see ADR_BUF=true, and it is set in dcache.py
> 
> a normal SRAM you would expect a 1 clock cycle delay, all good.  except

The VHDL construct ram(to_integer(unsigned(rd_addr))) doesn't of
itself imply a clock edge; it's like a combinatorial RAM not a
synchronous RAM.  (Imagine a bunch of flip-flops connected to the data
inputs of a multiplexer whose address input is rd_addr.)  Putting that
inside a process(clk) begin if rising_edge(clk) then ... construct
makes that a 1-cycle synchronous RAM.

> here, an *extra* cycle of delay is added.  after assertion of the read it
> is *two* cycles before the data appears on the read data output.

I think you're attributing a cycle of delay to the ram() construct,
which it doesn't have.  The dcache definitely does writeback two
cycles after address generation; I have traces showing that.

We do manage to get from the register at the output of the dcache RAMs
all the way to the data input of the register file RAM in one cycle,
which is a bit of a stretch, and at higher frequencies would need more
pipeline stages.

> i have no idea why, and i'm not skilled enough at VHDL to work out how to
> remove it.
> 
> any chance of making that a config-selectable option in dcache.vhdl?  i can
> then see how that was done and make corresponding edits.

The way it is now, the data and the way number arrive at the same
time (at the start of the third cycle) and go into the way select
multiplexer.  Having the data arrive a cycle earlier wouldn't help all
that much since we would have to latch it until the way number
arrives.

Paul.