[Libre-soc-dev] [OpenPOWER-HDL-Cores] microwatt / libresoc dcache

Paul Mackerras paulus at ozlabs.org
Fri May 7 04:47:28 BST 2021

On Fri, May 07, 2021 at 03:54:49AM +0100, Luke Kenneth Casson Leighton wrote:
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
> On Fri, May 7, 2021 at 3:31 AM Paul Mackerras <paulus at ozlabs.org> wrote:
> > Right, that's something we need to fix throughout microwatt.
> has to be done in one hit, for hints search "r1.wb.adr" in here:
> https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;hb=HEAD
> should be readable and the similarity clear.
> > > * AGEN (address generation)
> > > * ST data drop
> > > * actual fetch.
> >
> > The 2nd cycle does TLB and cache tag matching.  I'm not sure exactly
> > what "ST data drop" is;
> i mean "store data is dropped in".  there are code comments saying
> "place the store date in one cycle after putting the address in",
> something like that.

Yes, I had temporarily forgotten, but yes dcache does expect the data
one cycle after the address, which gives loadstore1 a cycle to do
store data formatting.

> > So stores can't be issued until all the operands are available; makes
> > sense.
> means we have to do some quite extensive modifications to dcache.py's
> FSM, adding a latch for "has AGEN been done", and one for "has store
> register been received", and only if the two are true can the dcache
> store be issued.
> complicated and fun :)
> > > a normal SRAM you would expect a 1 clock cycle delay, all good.  except
> >
> > The VHDL construct ram(to_integer(unsigned(rd_addr))) doesn't of
> > itself imply a clock edge; it's like a combinatorial RAM not a
> > synchronous RAM.  (Imagine a bunch of flip-flops connected to the data
> > inputs of a multiplexer whose address input is rd_addr.)  Putting that
> > inside a process(clk) begin if rising_edge(clk) then ... construct
> > makes that a 1-cycle synchronous RAM.
> yes... starting at line 44.
> https://github.com/antonblanchard/microwatt/blob/master/cache_ram.vhdl#L44
>     process(clk)
>     ...
>     begin
>         if rising_edge(clk) then
>             ...
>             if rd_en = '1' then
>                 rd_data0 <= ram(to_integer(unsigned(rd_addr)));
>             end if;
>         end if;
>     end process;
> however look at line 70, there's *another* rising_edge block:
>     buf: if ADD_BUF generate
>     begin
>         process(clk)
>         begin
>             if rising_edge(clk) then
>                 rd_data <= rd_data0;
>             end if;
>         end process;
> and rd_data is declared as a *signal* at line 45, not a variable:
>     signal rd_data0 : std_logic_vector(WIDTH - 1 downto 0);
> as best i can tell, of reading VHDL, that means that when ADD_BUF=true
> there is not a one-clock delay on rd_data output, there is a *two*
> clock delay.

It's an SRAM array with a register on the address input and a register
on the data output.  The total delay is a register setup time plus a
clock cycle plus a register clock-to-output delay, which is hopefully
considerably less than two clock cycles.

When you talk about a two clock delay, I don't know whether you are
rounding up or rounding down.  In other words, do you mean 2 cycles
plus setup plus clock-to-output, or are you considering 1 cycle plus
setup plus clock-to-output to be two cycles?

> or, i just simply don't know what "<=" in VHDL does when it involves
> signals going through other signals.  however given that forward1_data
> and forward2_data exhibit the same pattern, and have documented
> comments explaining their purpose, i *believe* i am making a correct
> inference about VHDL syntax.

My working model is that references to signals in a process block
always refer to the value as of the beginning of the block, but
combinatorial processes are executed repeatedly until signal values
settle, whereas a sequential process (using if rising_edge(clk)) is
only executed once on the rising edge of the clock.

> > > here, an *extra* cycle of delay is added.  after assertion of the read it
> > > is *two* cycles before the data appears on the read data output.
> >
> > I think you're attributing a cycle of delay to the ram() construct,
> > which it doesn't have.
> ah i see where the confusion might be: no, i'm not talking about the
> ram() construct, i'm referring to cache_ram.vhdl.
> > The dcache definitely does writeback two
> > cycles after address generation; I have traces showing that.
> i'm not referring to writeback: i'm referring to the actual output -
> d_out (the output from dcache.vhdl).

If you present an address on cycle 0, d_out will be valid on cycle 2,
assuming a cache hit.

> > We do manage to get from the register at the output of the dcache RAMs
> > all the way to the data input of the register file RAM in one cycle,
> > which is a bit of a stretch, and at higher frequencies would need more
> > pipeline stages.
> interesting.  a valuable insight i will bear in mind, given that we
> intend (down the line) to target ASICs at 2 ghz.
> > The way it is now, the data and the way number arrive at the same
> > time (at the start of the third cycle) and go into the way select
> > multiplexer.  Having the data arrive a cycle earlier wouldn't help all
> > that much since we would have to latch it until the way number
> > arrives.
> these are valuable insights to understanding the code, i think we're
> talking at cross purposes though. by noticing that you thought i was
> talking about the ram() construct when i was referring instead to
> cache_ram.vhdl itself, that has been cleared up, and my follow-up
> message with a strategy to reduce the latency of output signals from
> dcache.vhdl is clear?

No, I think I would need to see a patch to understand what you're
driving at; but from what I gather, you're worried about the path
through the cache data RAM - but that's not the critical path.  The
path through the TLB and cache tag RAMs, the tag matching and the way
determination is the critical path.


More information about the Libre-soc-dev mailing list