[Libre-soc-isa] [Bug 794] SVP64 REMAP for utf8

Mon Aug 22 19:48:37 BST 2022

https://bugs.libre-soc.org/show_bug.cgi?id=794

--- Comment #12 from Luke Kenneth Casson Leighton <lkcl at lkcl.net> ---
found one that is obvious and simple to understand.

https://codereview.stackexchange.com/questions/159814/utf-8-validation/159832#159832

class Solution(object):
    def validUtf8(self, data):
        """
        Check that a sequence of byte values follows the UTF-8 encoding
        rules.  Does not check for canonicalization (i.e. overlong encodings
        are acceptable).

        >>> s = Solution()
        >>> s.validUtf8([197, 130, 1])
        True
        >>> s.validUtf8([235, 140, 4])
        False
        """
        data = iter(data)
        for leading_byte in data:
            leading_ones = self._count_leading_ones(leading_byte)
            if leading_ones in [1, 7, 8]:
                return False        # Illegal leading byte
            for _ in range(leading_ones - 1):
                trailing_byte = next(data, None)
                if trailing_byte is None or trailing_byte >> 6 != 0b10:
                    return False    # Missing or illegal trailing byte
        return True

    @staticmethod
    def _count_leading_ones(byte):
        for i in range(8):
            if byte >> (7 - i) == 0b11111111 >> (7 - i) & ~1:
                return i
        return 8

*now* it is obvious that validation starts by counting the number
of 1s in the first character, then you must check that the top 2 bits
of UTF8 characters must be 0b10.

this simplicity is utterly destroyed by efforts made by optimised SIMD.
attempting to even understand the validation algorithm from looking at
optimised SIMD is not only wasting time it risks making mistakes.

SimpleV is such a different paradigm we literally have to go back to
scalar unoptimised implementations.

this algorithm is quite fascinating, one byte will contain a count
of the number of bytes that need to be checked for a match with 0b10------
needs some thought.

-- 
You are receiving this mail because:
You are on the CC list for the bug.