[Libre-soc-dev] clamping/saturation semantics

Fri Dec 11 19:26:56 GMT 2020

On Fri, Dec 11, 2020, 10:59 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
wrote:

> On Fri, Dec 11, 2020 at 7:23 AM Lauri Kasanen <cand at gmx.com> wrote:
> >
> > On Thu, 10 Dec 2020 18:07:23 +0000
> > Luke Kenneth Casson Leighton <lkcl at lkcl.net> wrote:
> >
> > > does this look like a reasonable general-purpose algorithm, applicable
> > > to all operations, whether exts*, mr, or 2/3 arithmetic ops?
> > >
> > > * saturation is done on the result at the **source** elwidth
> >
> > This would be a problem. For many cases, dst width != src width.
> >
> > Say you have gathered stuff to u16 and then want to scale that into
> > u8, clamped. That's a u16 * u16 = u8 op - different src and dst
> > elwidths.
>
> ok, so this example is why i asked.  2 bits, signed-unsigned, is not
> enough.  hence the addition of two *more* bits specifying the
> saturation quantity: 2^8, 2^16, 2^32.  actually then the table may be:
>
> * none / reserved
> * byte s/u
> * half s/u
> * word s/u
>
> which only needs 3 bits, one reserved encoding.
>
>
> the issue is: that's starting to becone an awful lot of bits,
> relatively speaking.  yes we happen to have 2 spare, yes these can be
> passed as state/context just like immediates down to the FUs, yes we
> can make those 3 bits mean something different for FP and logical FUs.
>
> however we may need those bits for something else.  it is all a balance.
>
> Jacob pointed out when we had similar pressure on swizzle that one
> possibility was to create a mv.swizzle operation, only taking 1 src,
> and performing macro-op fusion.  it's expensive but doable.

mv.swizzle would be a 1 src 1 dest op.
we could fuse it with a succeeding op:
mv.swizzle v1, v2, swizzle_immed
add v1, v1, v3
to make:
add v1, swizzled-v2, v3

importantly the dest of the mv.swizzle must be the same as the dest of the
following op, otherwise the fused op would have too many destination regs
to be efficient.

>
> a similar case applies here.  in other words we have three options:
>
>   * create a suite of operations that take
>      clamp ranges as part of the op.
>
> or:
>
>    * perform 16 bit arith
>    * copy src u16 clamped into u8 dest
>    * copy u8 src into u16 dest
>
> or:
>
>    * perform 16 bit arith @ 8bit clamp
>
> the last is clearly favourable, the former least.
>

I think you might misunderstand:
from what I recall from playing around with audio code,
the operation required would be the following translated to a sequence of 1
or more instructions:
read inputs at src size
scale to account for src/dest size mismatch // if src size==dest size, this
step isn't needed
perform arith at higher precision
clamp result to dest size // not src size, not any other size
write result to dest

Jacob