[libre-riscv-dev] buffered pipeline

Wed Mar 13 02:38:02 GMT 2019

On Tue, Mar 12, 2019 at 3:11 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> the strategy I'm planning on using for the simple barrel processor is just
> to have the pipeline never stop, if we encounter a reason an instruction
> can't proceed in the current cycle, it is shunted into a delay pipeline to
> be retried the next time around.

 dan's post contains some other strategies that may help here.  i will
be implementing the IEEE754 FPU pipeline as a non-stoppable design
(potentially adding detection to see if anything is in any stage, and
stopping the whole pipe if it isn't), with a variation of the
single-stage buffered pipe to take *multiple* inputs (multiple strobe
lines) and multiplex a given input group to the output (along with its
multiplexer ID).

 dan, this is probably extremely similar to wishbone or AXI N-to-1 bus
arbitration.

 that's what this is about:
https://git.libre-riscv.org/?p=ieee754fpu.git;a=blob;f=src/add/nmigen_add_experiment.py;h=f53037d1a88c912566cd13fd32db1945346a1751;hb=HEAD#l81

 except... due to using john dawson's STB/ACK strategy, it can only
handle one incoming set of operands every 2 clock cycles.

 my point is, jacob: to handle the delay-shunting you'll almost
certainly need to deploy the exact same strategy (and hence could use
exactly the code that i am writing).

 the requirements of a barrel processor (with a delay phase) are:

 * to have a round-robin test of whether an instruction shall be
passed into the pipeline
 * to have no delays except if an instruction cannot proceed
 * if an instruction cannot proceed, it must not be lost (buffered)
 * all other instructions must continue unaffected
 * on detection of no longer being busy, the buffered instruction must
rejoin the round-robin scheduling
 * it must be possible for MULTIPLE instructions to be busy (and buffered).

so you need an *array* of instruction store/delay buffers, an *array*
of STB and BUSY lines to look after them, where unstalled instructions
are to be multiplexed to a single output of data, STB, and BUSY.

that's *exactly* what i am working on, right now.

the code that i'm writing specifically meets these very precise
requirements, with the exception that i am using a priority encoder
instead of a round-robin selection strategy.

> For stallable pipelines, I think we should name the pipeline control
> signals pred_sending, succ_sending, pred_accepting and succ_accepting.

 funnily enough i added prefix letters as the first thing when writing
the first unit test, i named them i_p_stb, o_n_stb, o_p_busy and
i_n_busy, and wrote this ascii art which is now in the docstring:

        stage-1   i_p_stb  >>in   stage   o_n_stb  out>>   stage+1
        stage-1   o_p_busy <<out  stage   i_n_busy <<in    stage+1
        stage-1   i_data   >>in   stage   o_data   out>>   stage+1
                              |             |
                              +------->  process
                              |             |
                              +-- r_data ---+

 the shortened names need a seconds' thought, however i believe
they're clear, and, crucially, do not result in line-wrap to use them.
also, "STB" for "Strobe" is a standard hardware convention
synchronously indicating "data ready right now".

> A simple example stage:
>
> module stage(clk, rst, pred_sending, pred_accepting, pred_data,
> succ_sending, succ_accepting, succ_data);
>     input clk;
>     input rst;
>     input pred_sending;
>     output pred_accepting;
>     input [63:0] pred_data;
>     output succ_sending;
>     input succ_accepting;
>     output [63:0] succ_data;
>
>     reg data_valid;
>     reg [63:0] data;
>     wire next_data_valid;
>
>     assign succ_sending = data_valid;
>     assign pred_accepting = ~data_valid | succ_accepting;
>     assign next_data_valid = pred_sending | (~succ_accepting & data_valid);
>
>     assign succ_data = data + 1; // stage operation
>
>     initial data_valid = 0;
>     initial data = 0;
>
>     always @(posedge clk or posedge rst) begin
>         if(rst) begin
>             data_valid <= 0;
>             data <= 0;
>         end
>         else begin
>             data_valid <= next_data_valid;
>             data <= pred_data;
>         end
>     end
> endmodule

 from what i understand, data will be lost, here, under certain
conditions. or, it will be sub-optimal (result in unnecessary delays).
i'm not skilled enough in logic analysis to identify which.

 dan's original post makes it clear that there are 4 cases involved
(it's not quite as straightforward as it first appears).  there's a
situation where the input has valid data (and the next stage is busy
so a stall must happen), yet because this is a;; based on clocks,
there's not yet been an opportunity to *tell* the input "please stop
sending".

 so due to that one-clock delay where you are *going* to tell the
input "please stop sending", you absolutely must buffer the input
data, otherwise it's irrevocably lost.  at the same time, you tell the
input that on the next clock, "please stop sending".

 now, when the next stage is no longer busy, the processing must
"flip" to process the *stored* data, *not* the incoming data.  the
stage's attention is therefore effectively multiplexed between the
input and the buffer.

 in other words it's quite a complex state machine, for such a
seemingly-innocuously-simple set of requirements.

l.