[libre-riscv-dev] TLB key for CAM

Wed Mar 27 07:16:22 GMT 2019

On Wed, Mar 27, 2019 at 4:39 AM Daniel Benusovich
<flyingmonkeys1996 at gmail.com> wrote:

> >  would it be possible to remove the 2nd cache, write a unit test to
> > prove one cache, then add it back in and modify the unit test to do 2
> > again?
>
> I can keep it at two. I was not sure how the second level would be
> implemented yet.

 recursively, i should imagine.

 except, where the first level TLB comes from the L1 cache of one SMP
core, the second level TLB would have *multiple* cores so a 4-way bus
(we're aiming for 4 cores) and only the one L2 cache (with the same
4-way bus to/from all 4 cores).

> > > After that is done I am not really to sure as the TLB will be
> > > fundamentally done.
> >
> >  great!  would that mean it would be at the point where it could do
> > page-walking in hardware?  or are we sticking to software?
>
> I believe the consensus was software. Though there is an option in the
> RISCV mode selection (if I am remembering correctly) to use hardware
> walking as opposed to software when a miss occurs.
> I would have to look back again though writing a page walker would not
> be the end of the world.

 oh cool!  it's funny to see, initially you said, "i have nooo idea at
all how to do this" and now you're saying "yep it's nooo problem" :)
funny how understanding evolves once there is some infrastructure in
place.

 a hardware TLB would make our lives a lot easier when it comes to
testing, and the reason is that the RISC-V ISA unit tests for virtual
memory rely on a hardware TLB.

 if we wanted to test a software TLB, we actually have to *have*
software... and have software that actually works, before VM will ever
function.

 so if you _are_ able to write one in hardware, that would i think be
really good.  we will need to look at adapting it for GPU usage,
however that can be done later.

 btw this algorithm looks really quite straightforward:
https://git.libre-riscv.org/?p=riscv-isa-sim.git;a=blob;f=riscv/mmu.cc;h=021f587eaac5ac23c42ae0fa8b88c93c4ca27ec5;hb=6fecdb16d72b71734b35f494023f5edc8804327c#l161

 *sigh* and this one... is about as readable as egyptian hieroglyphics
or cueniform...
 https://github.com/freechipsproject/rocket-chip/blob/master/src/main/scala/rocket/PTW.scala

this guy looks like he's got the right idea... except the comments are
in chinese and i think he's been editing auto-generated chisel output
(chisel to verilog conversion):
https://github.com/baochuquan/RISCV-MMU/blob/master/PTW-origin/PTW.v

this one looks *really* nice:
https://github.com/pulp-platform/ariane/blob/master/src/tlb.sv

*that's* more like it.  it's got actual comments, it's only 248 lines
long including the preamble...

here's the ariane PTW code:
 https://github.com/pulp-platform/ariane/blob/master/src/ptw.sv

wow, actual comments in there too, which like actually makes it
readable!  i can actually begin to understand what the heck a PTW is
all about, from what the ariane team have done.

they have a state machine that travels down the levels (1G, 2M, 4k) so
presumably would be using progressively more bits of the address in
the lookup... yes, here it is:
https://github.com/pulp-platform/ariane/blob/master/src/ptw.sv#L281

looks pretty clear to me, what do you think?

> >  next component... and/or discuss/review the levels of TLB idea, i
> > really liked that concept, to have a 2-level TLB with reduced CAM
> > sizes for the 1st level.   being able to check the peformance of that,
> > via a unit test that emulates workloads would be really useful.

> I agree it is a cool idea and efficient idea. I was not confident in
> implementing the second cache as a CAM as it would be a bit much size
> and power wise.

 well, the issue there is, i believe, that if there's only the 1
level, each core will need a much larger CAM, and there would now be
*four* of them.  whereas for a 2nd level TLB, there would only be the
one.  larger, yes, but only one of them.

 so it is a trade-off.

> However, once we go into a more common version of caching it will no
> longer be a one cycle search.
> Pretty much it would delay getting a miss while we search the second
> level cache.

 yeah, exactly... however there would be quite a lot of instances
where it didn't miss.

 ... *click*... yes, ok, i see what you're saying: the logic of a miss
becomes a bit more involved, as the 1st stage TLB has to communicate
with the 2nd stage TLB about "misses".

*thinks*.... actually, all that's needed there is for the 2nd stage
TLB to understand the "page fault" generated by the 1st stage TLB,
that could be the signal to *activate* the 2nd stage TLB.  so a small
amount of glue logic could be used, without needing to have different
code for 1st and 2nd stage TLB.

does that make any sense?  sorry, different topic entirely from your
question, i appreciate that.

> If that is alright I could move forward, maybe with a set
> associative cache with a generic size?

 set-associative cache (which we'll need for SMP anyway), so go for it.

> If so more modules to write! Hooray!

 :)

> >  oh!  btw, VectorAssembler.py can be replaced with this:
> >
> >  vector = []
> >  for index in range(len(something):
> >       ematch = entry_array[index].match
> >       vector.append(ematch)
> >
> > encoder.i.eq(Cat(*vector))

> Absolutely gorgeous. I was thinking that Cat should be able to do this
> but didn't get around to making it work thank you!

 it's a python trick.

> I will put that in post haste (as long as those dreaded loops don't come back).

 they shouldn't.  in the graphs you should get a block with numbers
0:1 in one column and 0:1 1:2 2:3 3:4 in the next, basically exactly
how VectorAssembler works.

l.