[Libre-soc-dev] microwatt-libre-soc interoperable verilator snapshots / debugging

Sun Jan 9 12:08:40 GMT 2022

hi folks, got an... interesting one.

i've been tracking microwatt extremely closely, very deliberately, as we
ramp up capability in libre-soc, so that we don't bite off more than we
can chew.  this has been even to the extent of adding the exact same
DMI interface and using it in verilator simulations, so that single-step
*full* register dumps can be done in *both* systems.  the XICS interface
is the same, the MMU is the same, the wishbone interfaces apart from
the bug with the LSBs is the same.

those register dumps allowed me to literally do a "diff -u" on the full
trace logs in order to find instruction-level discrepancies.

that was 2 years ago.

now we have an MMU, i am repeating the same exercise, but this
time with an enhanced version of microwatt's built-in verilator "mini-soc".
https://git.libre-soc.org/?p=microwatt.git;a=shortlog;h=refs/heads/verilator_trace
(note, yes, this is off of an older version of microwatt that doesn't have
the 3-stage loadstore pipeline because that's too hard to track)

like qemu, it's capable of reading its "image" from a file, by using a
c++-emulated version of BRAM.  i have also added bare-minimum
"dumping" of the PC (which branch-prediction interferes with,
interestingly), and the MSR, as well as track every BRAM LD/ST.

i've also made it possible to drop in *any* core - Bill could drop in A2O
or A2I for example.  the interface is shown in core_dummy.vhdl and
it is exactly that of core.vhdl.

the problem is: we're at a stage where libre-soc's in-order core is still
being developed, and execution of linux-5.7 is at an astoundingly slow
rate of only 1,000 instructions per second.  microwatt is executing at
around 8-10,000 per second.

to give an example: last night, to get to running this line:
[    0.000000] Linux version 5.7.0-00030-gabe0e1dab0a2-dirty

took *three hours* on Libre-SOC and about half an hour for microwatt.

except, libre-soc *didn't* get to that point: it went into a hard-loop
at address 0xa598 when executing 0xf82d0190 (a "std") in MSR.IR/DR
mode.

getting to that point is... excruciatingly painfully slow, and i am thinking
ahead that there is going to be another thing to fix, and another, and
another, and going to an FPGA isn't going to help, because then access
to the full VCD signal traces (even if it's done in c++ in a clumsy way
by poking around the auto-generated verilator data structures) is lost.

the thought that therefore occurred to me was to be able to do this:

* assume that up to a certain instruction, microwatt and libre-soc
  have performed identically
* run microwatt
* in verilator c++ do a **FULL** system-wide state dump.  all
  registers, all TLB entries, everything.
* terminate the microwatt simulation
* run libre-soc
* have verilator RELOAD the entire state and continue executing...
  ...from where MICROWATT left off.

now, verilator itself has the ability to do save/restore of full HDL
state, but that's not actually useful, here.  one single simple change
to the HDL and that entire verilator-formatted-state is useless
because the signals contained in the verilator-save-state no longer
match the actual HDL.

by saving the *system* state (oh and making sure it's compatible
between systems) it becomes possible to perform snapshot-and-rerun
*even if the HDL changes* in either microwatt or libre-soc.  obviously,
if the number of registers changes, or if the size of the caches changes
(number of TLB / PTE ways) that's not true, but we can live with that.

i thought about the possibility of using the linux built-in save system,
but that doesn't help either, because you have to get to a certain
point in execution of the linux kernel before you can even _consider_
a restore... and i am debugging exactly that very early execution
point.

[very interestingly however if this works it would be possible to
massively speed up linux built-in save/restore by bypassing the
need to perform much of the linux early boot]

i am not so concerned about the cache contents: if there's some
cache misses that's fine, but TLB misses are a bit of a big deal.
it also means that writing to the full register set on the DMI interface
needs adding.

memory-dump is dead-simple, just save the memory inside
the c++ verilator simulation which is being used to feed the BRAM
bus.  50 lines of c++ should take care of that.

questions:

1) does this sound at all reasonable and generally useful for research
    purposes?

2) is there any other way to achieve this?  kgdb remote? other?

l.