[Libre-soc-dev] Libre/Open blockchain / cryptographic ASICs

Sun Feb 14 00:27:36 GMT 2021

Good morning Luke,

> > Another point to ponder is test modes.
> > In mass production you need test modes.
>
> > (Sure, an attacker can try targeted ESD at the `TESTMODE` flip-flop repeatedly, but this risks also flipping other scan flip-flops that contain the data that is being extracted, so this might be sufficient protection in practice.)
>
> if however the ASIC can be flipped into TESTMODE and yet it carries on
> otherwise working, an algorithm can be re-run and the exposed data
> will be clean.

But in most testmodes I have seen (and designed) all clocks are driven externally from a different pin (usually the serial interface) when in testmode.
If the CPU clock is now controlled by the attacker, how do you run any kind of algorithm?

(This could be an artifact of how my old design company designed testmodes, YMMV.)

Really the concern here is that testmode is entered while the CPU has key material loaded into registers, or caches, then it is possible, if those registers/caches are in the scan chain, to exfiltrate data.
Does not matter if the chip is now in a mode that cannot execute algorithms, if it was doing any kind of computation involving privkeys (including say deriving its public key so that PC-side hardware can get the `xpub`) then key material may be in scan chain registers, clock is now controlled by the attacker, and possibly scan mode as well (which disables combinational circuitry thus none of your algorithms can run).

>
> > If you are really going to open-source the hardware design then the layout
> > is also open and attackers can probably target specific chip area for ESD
> > pulse to try a flip-flop upset, so you need to be extra careful.
>
> this is extremely valuable advice. in the followup [1] you describe a
> gating method: this we have already deployed on a couple of places in
> case the Libre Cell Library (also being developed at the same time by
> Staf Verhaegen of Chips4Makers) causes errors: we do not want, for
> example, an error in a Cell Library to cause a permanent HI which
> locks us from being able to perform testing of other areas of the
> ASIC.
>
> the idea of being able to actually randomly flip bits inside an ASIC
> from outside is both hilarious and entirely news to me, yet it sounds
> to be exactly the kind of thing that would allow an attacker to
> compromise a hardware wallet. potentially destructively, mind, but
> compromise all the same.

Certainly outside of the the old company design philosophy I have seen many experts strongly protest against a design philosophy which assumes that any flip-flop could randomly switch.

Yet the design philosophy within the old company always had this assumption, supposedly (according to in-company lore) because previous engineers had actually found the hard way that random bitflips did occur, and for e.g. automobile chips the risk was too great to not have strong mitigations:

* State machines had to force unused states into known states.
  For example a state machine with 3 states needs 2 bits of state, but 2 bits of state is actually 4 states, so there is a 4th unused state.
  * Not all state machines needed this rule but during planning we had to identify state machines that needed this rule, and often we just targeted having 2^n states just to ensure that there were no unused states.
  * I even suggested the use of ECC encoding for important state machines and it was something being investigated at the time I left.
* State machines that otherwise did not need the above rule were strongly encouraged to clear state at display frame vsync.
  This ensured that any unexpected states they had would only last up to one display frame, which was considered acceptable.
* Flip-flops that held settings were periodically reloaded at each display frame vsync from a flash memory (which apparently as a lot more immune to bitflips).

It could be an artifact as well that the company had its own in-house foundry rather than delegate out to TSMC or whatnot --- maybe the technology we had was just suckier than state-of-the-art so bitflips were more common.

The reason why this stuck to mind is because at one time we had a DS test where shooting the ESD gun could sometimes cause the chip to fail (blank display) until reset, when the expectation was that at most it would flicker for one display frame.
And afterwards we had to go through the entire RTL looking for which state machine or settings register was the culprit.
I even wrote a little Verilog-PLI plugin that would inject deterministically random data into flip-flops in the model to try to catch it.
Eventually we found a bunch of possible root causes, and on the next DS iteration testing we had fun shooting the chip with the ESD gun over and over again and sighing in relief that the display was not failing for more than one frame.

The chip was a display driver for automotive, apparently at the time cars were starting to transition to using LCD for things like speedometer and accelerometer rather than physical dials.
And of course the display suddenly switching off while the car is running at high speed due to some extra-powerful pulse elsewhere was potentially dangerous and could distract the driver, so that is why we were paranoid about such sudden bitflips potentially leading to such massive cascade of upsets as to make the display fail permanently.

I think being excessively cautious for cryptographic chips should be standard as well.
And certainly test mode exfiltration of data is always an issue, JTAG is very standard way of reading memory.

Regards,
ZmnSCPxj