[Libre-soc-dev] microwatt / libresoc dcache

Fri May 7 06:46:48 BST 2021

On Fri, May 07, 2021 at 04:57:09AM +0100, Luke Kenneth Casson Leighton wrote:
> On Friday, May 7, 2021, Paul Mackerras <paulus at ozlabs.org> wrote:
> 
> > On Fri, May 07, 2021 at 12:27:38AM +0100, Luke Kenneth Casson Leighton
> > wrote:
> >
> > > the question i have is: is control_writeback making its decisions from
> > the
> > > *current* r1 or is it making its decisions from the *future* r1?
> >
> > That block is combinatorial, since it's process(all) and has no if
> > rising_edge(clk) then ... statement.  So it's the current r1.
> 
> 
> ok whew.  so i am reading VHDL correctly.  same thing repeated below in
> different ways, reinforcing the understanding.
> 
> 
> > If you mean you're seeing valid data two cycles after presenting the
> > address, that's how it's meant to work.
> 
> 
> i believe it possible to remove one of those on real mode LDs.

You still have to match the cache tag(s), and if the cache is set
associative, you have to decide which way matches.

> now, it may be useful to introduce a pipeline stage elsewhere, it just
> seems anomalous design.
> 
> i think what i am saying is that cache_ram.vhdl having the ADD_BUF delay
> inside *cache_ram.vhfl itself* is completely unclear.
> 
> i would expect that rd_data0 to be in *dcache.vhdl*, placed into a data
> structure that is explicitly marked and documented, "this is part of a read
> pipeline"
> 
> whereas right now there's one pipeline path in dcache.vhdl for control
> signals... oh and a totally separate pipeline which happens to be the exact
> same length except it's in cache_ram.vhdl involving rd_data and rd_data0.
> 
> this does not seem sensible from a code maintenance and clarity perspective.

Doing it like this means that the code matches the patterns that the
tools are looking for, meaning that the tools can use the registers
that are built into the block RAM primitives, which is better for
timing and layout.  If you do what you suggest the tools tend not to
be able to work out that they can do that.

> >  The constraint is really the
> > TLB and cache tag matching and consequent hit/miss detection and cache
> > way determination, which takes up essentially the whole of cycle 1.
> 
> 
> ah additional context (sorry) i am tacking the phys path at the moment.

Like I said that doesn't get you out of tag matching.

In fact, when I added the TLB, I did it in such a way that the TLB tag
lookup and matching occurs in parallel to the cache tag lookup and
matching, so the TLB adds very little overhead.  If you configure an
N-way TLB and a M-way cache, we actually use N * M comparators on the
cache tags in order not to have the TLB tag match, hit detection and
way selection in series with the cache tag comparisons.  We do the N*M
comparisons in parallel and then use the TLB hit way to determine
which set of cache tag comparison results is valid.

Paul.