[Libre-soc-dev] Microwatt L1 cache snoop idea for future SMP

Mon Jan 31 01:23:38 GMT 2022

On January 31, 2022 12:21:23 AM UTC, Paul Mackerras <paulus at ozlabs.org> wrote:

>Hmmm, you also need to snoop DMA stores to memory and also DMA loads
>if you have write-back caches that DMA doesn't go through.

ah of course, so some extra snoopings then.

>You don't necessarily need to snoop all stores by other CPUs either,
>unless you're trying to implement stronger memory consistency
>semantics than the ISA provides. 

Power i assume is not Intel Total Store Order.

> For example, if one CPU does
>multiple stores to the same location without any barriers, other CPUs
>won't necessarily see any value stored except the last.

Mitch Alsup explained how he did the AMD Opteron multi issue LD/ST hazard detection, it is pretty cool, must tell you about it some time.

>What we've done so far in microwatt is to use a write-through L1
>cache, and have the cache snoop all writes to memory.  That means that
>the L1 cache needs only two read ports on the cache tag RAM,
>regardless of how many CPUs there are, and it gives us snooping of DMA
>writes as well.

except when going SMP, if there are N processors and on each there is only 1 available snoop port for write, that is an N to 1 contention on every processor. writes would slow down massively.

to "fix" that the immediate thought is, well expand the snoop ports, N+1 per processor (the +1 being for DMA i assume there is just the one DMA Engine rather than M DMA engines)

but now the address/tag lookup is (N+1) Read SRAM Ports and that quickly gets out of hand.

so the idea here is this:

* to assume that actual number of conflicts is much smaller than the number of memory operations checking *if* there is a conflict
* thus it is not necessary to do all the N+1 cache_valids clearing from all N SMP plus 1 DMA in the exact same cycle

a for-loop N+1 around this:

                -- Do invalidations from snooped stores to memory
                for i in way_t loop
                    if snoop_valid = '1' and read_tag(i, snoop_tag_set) = snoop_wrtag then
                        cache_valids(snoop_index)(i) <= '0';
                    end if;
                end loop;

instead a multi-cycle (roundrobin?) invalidation.

otherwise, even though cache_valids is a few bits, the multiplexing on it quickly gets out of hand.

>Microwatt's reservation logic currently works on effective addresses,
>which is wrong (the ISA requires the reservation to be linked to the
>underlying real address), and doesn't cancel the reservation on other
>stores (so far just DMA) to the underlying real address, which it
>should.

oh whoops i forgot, of course, LRSC.

AH, you remember the discussion we had a few months back about LR/SC? and how in RISCV they allow some rules for circumstances under which forward progression is guaranteed?

they are probably based on a massive simplification that *any* snooping attempt will invalidate the Reservation... *except* under certain circumstances where LR and SC are within 16 instructions of each other (i.e. same cache line) etc etc in which case telling the snoopees to stall is no big deal.

must tell you about Mitch's algorithm but it is 1am and K9 mailer crashed and ate one reply already.

>   I have patches to address these bugs and to implement the
>sync instruction, which would be needed for correct operation with
>more than one CPU.

superb.

>> it occurred to me then that even if just the address comparison part
>> of cache_tags() was done as multiple duplicated lookups, a MUX onto a
>> single writer of cache_tags or just cache_valids would be much less
>> resources.
>
>Sorry, I don't follow what you're proposing.

above.  reduce massive muxes on cache_valids clearing if supporting ing multiple snoop sources in the same cycle. first stage detect snoop second stage invalidate.

problem is if detected and intending to take several cycles for clearing cache_valids now you need a stall / busy back to each snoopee.

need to draw it 

l.