[Libre-soc-bugs] [Bug 228] VP9 optimizations

Fri Sep 30 09:26:13 BST 2022

https://bugs.libre-soc.org/show_bug.cgi?id=228

--- Comment #6 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
from Konstantinos:

Ok, so a little overview first. The goal was to port VP8 and VP9 code into
SVP64 assembly.
In order to establish that any new code works properly with VP8 and VP9, the
library itself includes a testsuite which provides multiple unit tests that
ensure a function will produce the same result always -and thus producing the
same bit-exact video on all platforms.
So, the best way to ensure that our SVP64 VP8 and VP9 code works is by running
the testsuite. But we cannot do that, because there is no hardware-capable of
SVP64 yet, not even in FGPA form. The only thing available is what the Python
Power Simulator (pypowersim), which is actually the reference simulator.

Now, we *could* in theory run the whole VP8 & VP9 test suite inside this
simulator, but since it's at least 2000-5000 times slower, this means that what
takes now a few seconds in the testsuite it would take about 10 hours! So we
had to find an alternative -until we can run on actual hardware, FPGA, or a
faster simulator (cavatools?).
What I came up with, and it proved to be working great, is to run the whole
testsuite in native mode, and run *only* the SVP64 functions inside pypowersim.
For this reason, I created a wrapper function, that provides the glue code from
the native C code to the pypowersim -which runs in Python. I'm using Python C
API, and literally construct the arguments that are needed by the function in
question, for example, this function which can be seen in variance_ref.c:

uint32_t vpx_get4x4sse_cs_svp64(const uint8_t *src_ptr, int src_stride,
                                const uint8_t *ref_ptr, int ref_stride)

By convention to the ABI, this takes 4 arguments, in registers (GPRs) 3, 4, 5,
6 and returns in register (GPR) 3.
So, for this case I wrote a function vpx_get4x4sse_cs_svp64() in
variance_svp64_wrappers.c, which does exactly that, in the following steps:

* Sets up the Python C API for use inside C
* Constructs the pypowersim state object, with Python Objects for memory,
registers, mmu, svstate, etc.
* Creates the python object arguments to be passed to the simulator as
registers
* Calls the function -which btw, has been compiled in SVP64/LibreSOC mode by
the fork of binutils assembler that Dmitry has been working on
* This actually starts the simulator and RUNS in LibreSOC/SVP64 mode, just as
if we would have started the process manually!
* After a while, it completes and returns a result object, which we read and
get the result from the expected register (GPR 3).
* We return it to the testsuite and it is checked against the reference value,
if it is the same, that means our function produced the right results, if not,
we keep trying until our problem was fixed!

Similarly for other functions, we pass a buffer or have a buffer returned,
which means we have to copy data to/from the simulator.

The end result was that this method has proved to be invaluable and sped up
development by at least an order of magnitude. I plan to be using the same
method for all other audio/video codecs, I'm actually doing the AV1 which
should be done these days. I've made it  reusable so it could be used in any
other similar software that needs to be ported to SVP64.

Now, it would be possible to port some functions directly without what I did,
but it would be a much slower process, and we would never know if it would
actually work, until we would try to integrate this code with the library
itself -and its testsuite. And we would have to wait for actual hardware for
that.

-- 
You are receiving this mail because:
You are on the CC list for the bug.