[Libre-soc-dev] microwatt / libresoc dcache

Fri May 7 06:36:25 BST 2021

On Fri, May 07, 2021 at 05:17:44AM +0100, Luke Kenneth Casson Leighton wrote:
> On Friday, May 7, 2021, Paul Mackerras <paulus at ozlabs.org> wrote:
> 
> >
> > When you talk about a two clock delay, I don't know whether you are
> > rounding up or rounding down.  In other words, do you mean 2 cycles
> > plus setup plus clock-to-output, or are you considering 1 cycle plus
> > setup plus clock-to-output to be two cycles?
> 
> 
> clk             rise  fall  rise fall  rise  fall
> rd_in:          1      0      0     0     0      0

Do you mean rd_en?  There's no signal rd_in in that code.

> rd0_data:   00  00     NN  00  00  00
> rd_data:     00  00    00    00  NN 00
> 
> the 1st process i understand to be a 1 clock delay from rd_in to rd_data0
> being set.
> 
> the 2nd process, by also being a rising edge and by taking its output from
> rd_data0 and placing it into rd_data, introduces a 2nd clock delay.
> 
> thus the total time taken is 2 clock cycles from when rd_in went high to
> when rd_data is valid.

If you mean rd_en, that is always 1 anyway.  It's the address to data
latency that matters.

The synthesis tools don't necessarily consider themselves bound to put
the registers where a literal reading of the code would imply.  I
believe that with ADD_BUF=true, the tools effectively put a register
on the address input of the SRAM array and one on the data output of
the array.  That means that the access time of the SRAM array is
between the two clock edges.

If you have ADD_BUF=false, it doesn't make the access time of the SRAM
array disappear; it still exists, and shows up as increased setup time
required on the address and/or increased clock-to-output time on the
data output.  So in terms of the overall timing, the difference
between the latency with ADD_BUF=true and the latency with
ADD_BUF=false is less than a whole clock cycle, and at 100MHz it's
noticeably less.  (I am not certain where the register ends up
relative to the SRAM array in the ADD_BUF=false case.)

The other point, which you don't seem to have taken in yet, is that
this is NOT the critical path.  There is no point getting the data out
substantially before the hit_way is known, and for the sake of timing,
that has a register (r1.hit_way) in the path.  So r1.hit_way is not
valid until cycle 2 (counting cycle 0 as the one where the address is
presented to the dcache).  Thus getting valid data from the cache data
RAM in cycle 1 won't make anything any faster.

> essentially, i am questioning why ADD_BUF was added.

To match the latency of the path to the way multiplexer data inputs
with the latency to the way multiplexer address inputs (the way
multiplexer is the statement "data_out := cache_out(r1.hit_way);").

I believe also that using ADD_BUF=true may reduce the setup time
required on the address inputs of the cache data RAM (I would need to
do more research before positively asserting that though).  It does
certainly seem to give more predictable timing from vivado.

Paul.