[Libre-soc-dev] video assembler

Tue May 11 16:11:27 BST 2021

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Tue, May 11, 2021 at 2:13 PM Lauri Kasanen <cand at gmx.com> wrote:

> If it can't measure even roughly what is better, that is no false
> objection. It's essential to writing optimized codecs. Even at that
> level we could point to the measurements and say "in theory this is x%
> faster for this use, everything else being the same". It's far from
> what could be done with a higher-level simulator, not upstreamable in
> any way, but not useless.

i do get that: we are about... probably... 8 months away from that and
only have 6 months of time available.  we *have* to start doing...
*something*.

> Ie, without it, there is nothing for me to write. The grant was about
> writing optimized codecs, incl new instructions.

it's possible to justify that unless we have unit tests which in at
least some bare minimum capacity demonstrate the functionality of the
new instructions, we cannot even *begin the process* of writing
optimised codecs, that there is, indeed, "something to write".

it's also reasonable to make some assumptions about what the hardware
might do, as long as they're documented as part of the creation of the
algorithm.

it doesn't have to be perfect first time, Lauri.

> > 2) they help show where things should go for the next phase (full
> > algorithms)
>
> I disagree. A blindly written inner loop is of zero use for the actual
> code. It most likely needs a full rewrite once proper measurements can
> be done.

that's fine - we expected that anyway.

 allow me to illustrate with an example.  let us take something simple
like audio clipping.  this would be, at its most basic, pretty much a
single instruction (vectorised).  copy sample @ 32bit into 16bit
Vector with SVP64 "saturation" enabled:

    setvli 16
    sv.mv/sw=32/dw=16/satu  r8.v, r24.v

now let us go through four different scenarios:

1) the present TestIssuer which, for SVP64 instructions, has an IPC
somewhere barely above 0.05
2) a single-issue in-order system with no SIMD back-end for SVP64
3) a single-issue in-order system with a SIMD back-end for SVP64
4) an eight-way issue out-of-order system with QTY 8 SIMD Function Units.

now let's go through the 4 cases, looking at the number of samples
each would generate:

1) this would generate approximately 0.05 samples per hardware clock
2) this would generate 1.0 samples per hardware clock
3) this would likely generate 2.0 samples per hardware clock because
the SIMD back-end could fit 2x 32-to-16 SIMD operations per cycle
4) this would likely generate 16.0 samples per hardware clock because
the SIMD back-end could fit 16x2=32 32-to-16 SIMD operations per
cycle.

(i am assuming above that the LD/ST engines are capable of keeping up,
in each case)

question: would the actual *code* change (be optimised) in each case?
*no it would not*.  why? because for that inner loop, the simple
32-bit input, 16-bit output, SVP64 saturation is the *only* fastest
(optimised) code  in each and every case.

question: if the LD/ST engines were unable to keep up, in one case,
would the actual *code* be changed? no it would not.  why? because no
amount of changes would help.

> I understand the time constraints, and that you want unit tests.
> However I can't justify those under the video grant, as they will be
> practically useless for the grant's purpose.

we decide what the scope is.  they can be justified as being a phase
along the way: a sub-milestone.

> For unit tests, it's not useful to have inner video data either. It's
> superfluous, a waste of time, when good tests test boundary conditions
> and a few samples between.

ok in a pre-existing environment, where the hardware has already been
100% developed, is stable, and known categorically 100% to work, this
is a reasonable expectation.

in our case i have found repeatedly that it is simply unreasonable and
impractical: we *have* to have intermediary testing, in order to
ascertain if there are any interactions between instructions (for
example) that cause incorrect results to be generated.  i will be
giving a talk about this exact thing, at ISC2021
https://meep-project.eu/events/ics-2021

basically we have done:

* modules plus unit tests
* pipeline using above modules, plus unit tests on the pipeline
* pipeline with a Computational Unit front-end, plus running the exact
same unit tests
* a core that uses all the pipelines and only NOW adds register files,
but has no "Issue" engine, running the exact same unit tests
* ***FINALLY*** an actual "processor" which has an Instruction Fetch /
Issue / Execute engine, connected to the core, that runs *the exact
same unit tests*

the development process went from the lowest level up to the higher
level, and if anything breaks (which it does), the unit tests allow us
to track down precisely where that breakage is.

at the lower levels, the unit tests are divided down so that they can
only be executed on a given pipeline.  obviously, there is no point
trying to pass an ADD operation to the Logical pipeline: it can't do
it.  at the higher levels, we can add in more "general" unit tests
that mix-and-match instructions from different pipelines.

> I'm not opposed to writing tests, I know there's not many people
> around, but under which budget could they be justified?

the NLnet Video one.  we have done this already a number of times:
worked towards a goal in incremental steps by writing tests that
confirm the lower levels before moving on to the higher-level goals,
as above.

now, what we can *NOT* do, definitively, is, say, take the (new)
bitmanip NLnet grant and somehow "re-allocate" its budget towards the
creation of low-level video tasks... unless by a coincidence there
happened to be a bitmanip instruction that was needed *by* Video (that
would be fine).

it is however perfectly reasonable to justify an incremental
development - even such a "tiny" one - that proves that the
instructions (which *LATER* will *BECOME* optimal) are even
"functional at a level where they complete tasks of *any* kind" that
happen to be related to Audio / Video.

there must be something that's pretty basic which we can start from,
involving for example, pixel data manipulation.  conversion from 32bpp
to 16bpp or... something.

> Or, to put it another way, is it a lot of work for you to have the
> sub-steps counted separately?

not at all.  we run the entire project based on sub-steps.  the
largest sub-division we've done is 35 sub-steps, some of which have
*yet more* subdivisions:
https://bugs.libre-soc.org/show_bug.cgi?id=383

> That would enable the work to not be a
> waste from the video grant's perspective.

yes.

l.