[Libre-soc-dev] [OpenPOWER-HDL-Cores] microwatt / libresoc dcache

Fri May 7 04:47:28 BST 2021

On Fri, May 07, 2021 at 03:54:49AM +0100, Luke Kenneth Casson Leighton wrote:
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> 
> On Fri, May 7, 2021 at 3:31 AM Paul Mackerras <paulus at ozlabs.org> wrote:
> 
> > Right, that's something we need to fix throughout microwatt.
> 
> has to be done in one hit, for hints search "r1.wb.adr" in here:
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;hb=HEAD
> 
> should be readable and the similarity clear.
> 
> > > * AGEN (address generation)
> > > * ST data drop
> > > * actual fetch.
> >
> > The 2nd cycle does TLB and cache tag matching.  I'm not sure exactly
> > what "ST data drop" is;
> 
> i mean "store data is dropped in".  there are code comments saying
> "place the store date in one cycle after putting the address in",
> something like that.

Yes, I had temporarily forgotten, but yes dcache does expect the data
one cycle after the address, which gives loadstore1 a cycle to do
store data formatting.

> > So stores can't be issued until all the operands are available; makes
> > sense.
> 
> means we have to do some quite extensive modifications to dcache.py's
> FSM, adding a latch for "has AGEN been done", and one for "has store
> register been received", and only if the two are true can the dcache
> store be issued.
> 
> complicated and fun :)
>
> > > a normal SRAM you would expect a 1 clock cycle delay, all good.  except
> >
> > The VHDL construct ram(to_integer(unsigned(rd_addr))) doesn't of
> > itself imply a clock edge; it's like a combinatorial RAM not a
> > synchronous RAM.  (Imagine a bunch of flip-flops connected to the data
> > inputs of a multiplexer whose address input is rd_addr.)  Putting that
> > inside a process(clk) begin if rising_edge(clk) then ... construct
> > makes that a 1-cycle synchronous RAM.
> 
> yes... starting at line 44.
> https://github.com/antonblanchard/microwatt/blob/master/cache_ram.vhdl#L44
> 
>     process(clk)
>     ...
>     begin
>         if rising_edge(clk) then
>             ...
>             if rd_en = '1' then
>                 rd_data0 <= ram(to_integer(unsigned(rd_addr)));
>             end if;
>         end if;
>     end process;
> 
> however look at line 70, there's *another* rising_edge block:
> 
>     buf: if ADD_BUF generate
>     begin
>         process(clk)
>         begin
>             if rising_edge(clk) then
>                 rd_data <= rd_data0;
>             end if;
>         end process;
> 
> and rd_data is declared as a *signal* at line 45, not a variable:
> 
>     signal rd_data0 : std_logic_vector(WIDTH - 1 downto 0);
> 
> as best i can tell, of reading VHDL, that means that when ADD_BUF=true
> there is not a one-clock delay on rd_data output, there is a *two*
> clock delay.

It's an SRAM array with a register on the address input and a register
on the data output.  The total delay is a register setup time plus a
clock cycle plus a register clock-to-output delay, which is hopefully
considerably less than two clock cycles.

When you talk about a two clock delay, I don't know whether you are
rounding up or rounding down.  In other words, do you mean 2 cycles
plus setup plus clock-to-output, or are you considering 1 cycle plus
setup plus clock-to-output to be two cycles?

> or, i just simply don't know what "<=" in VHDL does when it involves
> signals going through other signals.  however given that forward1_data
> and forward2_data exhibit the same pattern, and have documented
> comments explaining their purpose, i *believe* i am making a correct
> inference about VHDL syntax.

My working model is that references to signals in a process block
always refer to the value as of the beginning of the block, but
combinatorial processes are executed repeatedly until signal values
settle, whereas a sequential process (using if rising_edge(clk)) is
only executed once on the rising edge of the clock.

> > > here, an *extra* cycle of delay is added.  after assertion of the read it
> > > is *two* cycles before the data appears on the read data output.
> >
> > I think you're attributing a cycle of delay to the ram() construct,
> > which it doesn't have.
> 
> ah i see where the confusion might be: no, i'm not talking about the
> ram() construct, i'm referring to cache_ram.vhdl.
> 
> > The dcache definitely does writeback two
> > cycles after address generation; I have traces showing that.
> 
> i'm not referring to writeback: i'm referring to the actual output -
> d_out (the output from dcache.vhdl).

If you present an address on cycle 0, d_out will be valid on cycle 2,
assuming a cache hit.

> > We do manage to get from the register at the output of the dcache RAMs
> > all the way to the data input of the register file RAM in one cycle,
> > which is a bit of a stretch, and at higher frequencies would need more
> > pipeline stages.
> 
> interesting.  a valuable insight i will bear in mind, given that we
> intend (down the line) to target ASICs at 2 ghz.
> 
> > The way it is now, the data and the way number arrive at the same
> > time (at the start of the third cycle) and go into the way select
> > multiplexer.  Having the data arrive a cycle earlier wouldn't help all
> > that much since we would have to latch it until the way number
> > arrives.
> 
> these are valuable insights to understanding the code, i think we're
> talking at cross purposes though. by noticing that you thought i was
> talking about the ram() construct when i was referring instead to
> cache_ram.vhdl itself, that has been cleared up, and my follow-up
> message with a strategy to reduce the latency of output signals from
> dcache.vhdl is clear?

No, I think I would need to see a patch to understand what you're
driving at; but from what I gather, you're worried about the path
through the cache data RAM - but that's not the critical path.  The
path through the TLB and cache tag RAMs, the tag matching and the way
determination is the critical path.

Paul.