[libre-riscv-dev] buffered pipeline

Thu Mar 14 18:20:29 GMT 2019

On Thu, Mar 14, 2019, 04:57 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> there's... something about this that doesn't feel right, which perhaps a
> more comprehensive test will pick up.
>
> i *think* it's down to the use of combinatorial logic for the BSY/STB
> signals, which in a long pipeline will result in an ever-increasing
> propagation delay that will dramatically reduce the maximum clock rate.
>
> as an example, mitch was involved in the AMD K9 architecture which had a
> requirement of a mere 16 gates chained together on any given stage.
>
since, for the simple stages, they just OR the ~Q output of a flip-flop
with the accepting input, the or gates can be reassociated into a balanced
tree by the logic optimiser, meaning we could have 64k simple stages in
series and, ignoring wire delays and assuming we use 4-input or gates, we
would end up with a gate delay of 8 gates just counting the or gates since
log(64k)/log(4) = 8.

If that is still too much, we can change the simple stage so it deasserts
input.accepting when output.accepting is deasserted even if it is empty at
the moment. that would change the simple stage so the *.accepting signals
are directly connected together.

>
> to understand that more: it looks really simple, at the moment, just chain
> the BSY/STB lines together, because it's a simple example and no actual
> stalling is required (or implemented) in any given stage.
>
> however if say a given stage has particularly complex analysis logic for
> whether the stage should stall or not, that complex logic *accumulates* and
> propagates up and down the entire pipeline chain.
>
we simply add a stage that has enough buffering so that the accepting line
is driven from it's internal flip-flop and doesn't have to propagate from
the following stages in a single clock cycle. it would add an extra cycle
of delay every time the successor stage stalls, but would go back to not
having extra delay when not stalling.

>
> this is what dan was talking about in his post
> https://zipcpu.com/blog/2017/08/14/strategies-for-pipelining.html
>
> i *believe* you may have implemented the "simple handshake" protocol.
> chaining several stages and throwing ten thousand values at the input will
> help determine a bit more.
>
> also, see the attached screenshot, there's a spike which has me very
> concerned.  that really *really* should not be happening, as it will cause
> data instability.
>
that spike is caused by the signals changing because of the clock pulse, if
the signals change too soon at the inputs of flip-flops the toolchain will
add enough combinatorial delay to meet the hold times of the flip-flops. I
don't know for 28nm cmos, but some flip-flops actually have enough internal
delay that they have zero hold time. So, basically, I don't think that will
actually be a problem since most signals change because of a clock pulse
and the rest of them don't cause problems.

you also get a spike on o when you have the following (not tested):

a = Signal(1)
b = Signal(1)
a_comb = Signal(1)
o = Signal(1)
m = Module()
m.d.sync += a.eq(~a)
m.d.sync += b.eq(a)
m.d.comb += a_comb.eq(a & 1)
m.d.comb += o.eq(a_comb ^ b)

there's a glitch because a_comb changes after b does. the glitch doesn't
affect registers that use o as an input however, since the clock cycle is
longer than the glitch by design since the glitch takes place between the
minimum and maximum propagation delays and the clock cycle is longer than
the ff prop delay + logic prop delay + wire delay + ff setup time.

you only have to worry about glitches when crossing clock domains and when
using a combinatorial signal as a clock or asynchronous reset. edge
triggering takes care of ignoring glitches on the D inputs in all other
cases.

>
> ASICs push the boundary on what can be fitted into a given clock pulse: the
> inputs absolutely have to be stable at the moment the clock rises!
>
agreed. they don't have to be stable (and usually aren't) after the changed
signals make it out of the flip-flops and into the combinatorial logic
though, hence why a good flip-flop generally has a propagation delay larger
than the hold time so following logic doesn't need a non-zero minimum
propagation delay.

>
> l.
> _______________________________________________
> libre-riscv-dev mailing list
> libre-riscv-dev at lists.libre-riscv.org
> http://lists.libre-riscv.org/mailman/listinfo/libre-riscv-dev
>