[Libre-soc-dev] SVP64 parallel map-reduce idea
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Sat Jun 12 00:31:50 BST 2021
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
On Sat, Jun 12, 2021 at 12:05 AM Jacob Lifshay <programmerjake at gmail.com>
> nope -- remember there's 1 predicate bit for each whole subvector, not one
> per subvector element. so f32x4x64 needs 64 predicate bits, each bit
> predicates a whole f32x4 subvector.
the value - the temporary result - not the SPR. i wasn't referring to
the predicate. ah you thought i was. ok. right. yes, that's solved
already: CR bits you can read on-the-fly, and for INT pred, well,
it's just one INT read. which is tolerable.
A reduce over subvectors should produce
> a subvector as the result,
that's an additional separate option - reminder of the spec:
00 1 SVM CRM subvector reduce mode, SUBVL>1
we could get away with recalculating just the predicate bits -- that should
> be waay easier to do.
yes, and if progress has got past the first layer there's no need for
the predicate. which, in turn, tells us that it should be possible to
have a re-entrant algorithm which reads only the bits of the predicate
you may be somewhat behind on development, Jacob: even the
TestIssuer reads predicate bits in advance. CRs shouldn't have
to, but Cesar found it easier to "accumulate" all CR bits into a
mask. it has to be re-created every time on interrupt return,
horribly slow, but the code's clear and readable, which is more
> there is a specific advantage of allowing the target (destination)
> > vector to be used as temporary storage: combined with a fixed
> > (predictable) algorithm there *might* be use-cases for the *whole*
> > temporary results rather than throwing them away.
> yup! I can imagine how the temporary results are used to produce something
> like a prefix-sum.
yehyeh. it gets particularly complex if the sources overlap, and... hmmm...
overlapping with the destination, that's... it gets complicated there.
for "standard" SVP64, Program Order is used to make iterative sums.
x[i+1] = y[i] + x[i] by issuing an instruction where destination overlaps
but allowing the same thing for parallel reduce? can't get my head round it.
> > if you read the sections on Reduce mode you'll see it's clearly
> > laid out, the ability to use the vector result for temporary
> > calculations.
> yup! that's what the pseudo-code I wrote does.
ah, star. it's quite late, i'm going to have to read it several times,
possibly even make it executable and print out some "results"
(plus indices) to get it fully.
More information about the Libre-soc-dev