[libre-riscv-dev] pipeline sync issues

Tue Apr 9 12:07:38 BST 2019

On Tue, Apr 9, 2019 at 7:57 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
>
> continuing from a private discussion:
>
> Luke:
> > i found
> > out why UnbufferedPipeline (was BreakReadyChainStage) and the
> > (newly-added) PassThroughHandshake (was RegStage) won't work: they all
> > have loops that make their p_o_valid be set *not* on the clock cycle,
> > instead ever so slightly off it.

> >
> > BufferedPIpeline (now renamed to SimpleHandshake) and BufferedPipeline
> > (now renamed to BufferedHandshake) both have the condition that the
> > data and its associated ready signal are latched together and use
> > sync.

  ... turns out i was wrong about this: _all_ the code has these
propagation delays.  so it's still not been identified why the
BufferedHandshake unit linked with PassThroughHandshake or
UnbufferedPipeline are barfing.

 the important thing to note however is:

 * BufferedHandshake connected to SimpleHandshake WORKS
 * BufferedHandshake connected to PassThru (RegStage), Unbuffered
(ReadyStageChain) or Unbuffered2 (CombStage) do NOT.
 * SimpleHandshake connected to PassThru (RegStage), Unbuffered
(ReadyStageChain) or Unbuffered2 (CombStage) WORK.

which makes absolutely no sense whatsoever, and it's really really
important to track down why.

> The way I had designed them is that all signals only need to have the
> correct values at the clock edge. The signals change right after the clock
> edge because they are connected to the output of flip-flops and the clock
> edge is what makes the flip-flop outputs change state. the reason they
> don't change exactly at the clock edge is because the signals have to first
> propagate through combinatorial logic between the flip-flop outputs and the
> actual signals that you were watching. This happens on the order of 100ps
> (in the simulation).

 a really close zoom-in on gtkwave on both the succeeding and failing
confirms this (and shows no obvious difference, i.e. all the signals
are indeed changing at the same time, by the same amount after the
clock).

> Also, some of the test code I wrote changes test input signal states a
> small fraction of a clock cycle after a clock edge because I have the test
> process delay a little bit right after the clock edge in order to be able
> to read the circuit's outputs after sufficient time had elapsed for all the
> combinatorial circuits to propagate the correct signal levels from the
> output of the flip-flops. if there are signals that are produced
> combinatorially from the circuit's inputs then they will change shortly
> after the circuit's inputs change. This happens on the order of 10-100ns
> since the clock in the simulation runs at 1MHz.

 that's a good idea.  do you think you could introduce some of that
into the buf_pipe_test.py unit tests?  i appreciate it's a bit...
cut/paste-heavy, i haven't stopped to reorg yet.

> In both of those cases, the circuit is functioning correctly as designed,

 ok that's really good to hear.

 so, back to square one on the ongoing investigation.

 after our [offline] conversation i wrote out the 16-way truth table
for BufferedHandshake, and confirmed that it is indeed a 16-way
karnaugh map.  compared to SimpleHandshake (which follows the pattern
used by Wishbone and AXI4), that's an 8-way map.

 the reason for the 16-way map appears to be down to the move to the
"stalled and still active" state, which *only* happens when:

* the output (to next) is valid
* the input (from next) is NOT ready
* the output (to previous) has been indicated (by the previous clock)
as ACCEPTING (ready)
* the input (from previous) is valid (and therefore MUST be accepted)

in a "Simple" handshake (no buffering), it is REQUIRED that p_o_ready
be equal to n_i_ready, because these are the only circumstances under
which it is safe for data to pass through.

i.e. - bear in mind that we are thinking about the conditions for the
*next* cycle: if the next stage says (on this clock) that it's ready,
then it is declaring that it *will* accept input on the next clock...
therefore this condition can be propagated to the *previous* stage,
because the *previous* stage knows then that, on the next clock, the
current stage has room to accept data.

btw also as part of the same investigation, i realised that
BufferedHandshake is basically a "2-entry FIFO with a pre-processing
opportunity on the incoming data".  so i took a look at nmigen FIFO,
and guess what!  it has the *exact* same ready/valid on in/out
signalling!

they're called
        self.writable = Signal() # not full
        self.we       = Signal()
and
        self.readable = Signal() # not empty
        self.re       = Signal()

* writable = p.i_ready
* we = p.o_valid
* readable = n.o_ready
* re = n.i_valid

so that got me thinking: a stage which can handle the inclusion of a
FIFO as part of its job is probably enough to "drive" the API in
roughly the right direction.

not only that, but if properly separated, a combinatorial join
(StageChain) of a 2-entry FIFO "stage" with a pipeline ALU-style
"stage" dropped on top of the SimpleHandshake "Control" will *MAKE* a
BufferedHandshake pipeline stage.

 i think really, we very much need to to stop and think for a bit,
what is it that we *actually* need (what high-level functionality),
then work out some building blocks, documenting all of that properly
as we go. [ the exploration i've been doing has been necessary so that
i actually know the problem space ].

l.