# Cryptographic hardware engineering

### Francesco Regazzoni

### Why do we need to know this?

Focus on Digital Systems

# City...

æ

イロン イロン イヨン イヨン

### Formula 1...

Francesco Regazzoni 14 August 2018, Calgary, Canada

æ

(日) (四) (王) (王) (王)

### Contents



### Side Channel

Francesco Regazzoni 14 August 2018, Calgary, Canada

э

3

- From an high level specification (usually abstract) to detailed design by decomposition and successive refinement
- Answers to the question: "What do we build?"
- Handles the complexity

From detailed primitive blocks to a larger and more complex functional block by combining primitives blocks

Answers to the question: "How do we build it?"

Focuses on the details

 Designs usually proceed from both directions simultaneously

### Let's focus on the Top-Down part

### The dream

### From power Point to GDSII

### Let's focus on the Top-Down part

### The dream

### From power Point to GDSII

### Why not?

### It is too difficult!

### From Idea to ASIC: the design flow....

э

글 > - 4 글 >

# Once upon a time....

Francesco Regazzoni 14 August 2018, Calgary, Canada

- Functionality: what the system will do
- Performance: establish the solution space

### Cost: set the limits to what is possible

Function: what is expected from the chip

Performance: how fast? what area? how much power?

I/O requirements: how does the chip integrates with the rest of the system? Design: propose an architecture

- Design Entry: Description in HDL
- Verification: Ensuring that is doing what is supposed to do
- Synthesis: Mapping to target technology

Scan Insertion: adding structure for testing

### **Design: Main Architectural Transformation**

Francesco Regazzoni 14 August 2018, Calgary, Canada

- Have the block diagram
- Have your state diagram
- Define the interface
- Define the block one by one



# Tips about HDL coding

# HDL is not C++

- You are drawing block diagrams
- Always keep in mind the actual implementation
- FSM "can" be written behavioral

# Separate Datapath and Control

- Datapath performs operations on data
- Multiplexers control alternative pats for operations
- FSM control multiplexers and enable signals

# Keep it simple

- Use simple constructs
- Do not use a single sequential process
- Avoid complex sequential codes (nested ifs...)

# Front-End Design Flow (ASIC)

Francesco Regazzoni 14 August 2018, Calgary, Canada

B) (B)

# Depends on the HDL code....

# And on the tool version! (Also Significantly!)

### **Back-End Design Flow (ASIC)**

Placement: putting gates on a chip

- Clock Tree insertion: Distributing the clock signal
- Routing: Connecting the gates
- Chip finishing: PADs, ...

DRC and LVS: ensure the physical layout is correct

### **Back-End Design Flow (ASIC)**

Francesco Regazzoni 14 August 2018, Calgary, Canada

### Contents





э

< A

Area: total area occupied by the circuit
Throughput: data processed per time unit
Power: Peak power for operation
Energy: Total energy to complete a task
Latency: time to process a given data item
Time to Market: total time to complete the design

### How to Improve Performance

### Use More Advanced Technology

- Easy way
- Not Always Feasible
- Cost Increases
- Other Problems (adapting, missing IPs)

### Make a Better Implementation

- Hard (requires bran)
- Not Always Feasible (limit in what can be done)
- Academia Should Concentrate on this

# Academia vs Industry

# Both

Have Similar Design Flows

- Use Same Tools
- Define Performance Similarly

### ...but in Industry

Chip has to work within the specification

- goal is to sell a product
- Cost and Time to design are important
- Manufacturability is crucial
- Conservative design approach is followed

(日) (同) (三) (三)

# EDA Tools are designed for Industry

### Industry

- Circuit has to work in the worst case
- Constraints are used to define specification
- EDA tools ensure that the circuit performance meets specification

# Academia

- Interested only on limits (smallest, fastest...)
- Performance numbers needed for comparison
- Not easy to get fair numbers from tools
- Constraints are "used" to explore design space

- Overhead is not always included (Large I/O bandwidth, Very fast clocks, Frequency scaling)
- Additional Verification Efforts (Interfaces between different clock domains,...)
- Testing overhead (difficult to test asynchronous designs)

## Quality of Results Depends on Design State

# Synthesis

- First level of Physical Properties
- Routing overhead unknown
- Timing and Power models are mean values

# Post Layout

- Routing is know, more accurate timing
- Not all post-layout include proper provision for testing and power

# Actual Measurements

- Real proof of concept
- Performance affected by practical problem, worst then expected
- Require a costly test infrastructure

### How much silicon area will be used for the circuit?

# Why Care

- Silicon Cost
- Feasibility (will it fit?)

# "Should" be easy to determine

- Does not change during/after fabrication
- Reliable and accurate post-layout numbers

# Units

- ► mm<sup>2</sup>: correct, technology dependent
- GE: commonly used, not accurate

### Total Area divided by the 2-input NAND

### Coarse measure: varies at least 10% between different technologies

- Numbers reported after synthesis (routing missing, power missing, clock missing,...)
- Post layout gate counts (still missing power, routing, ...)
- Smallest rectangle where the block fits?
- Total die area (area can be imposed mini@sic, if the chip has more than one design?)

- Synthesis does not tell the whole story (routing overhead can be between 10% and 500% design dependent)
- Post layout numbers are more reliable
- Total chip is easy to determine (not often the whole chip is of interest)

### Speed

### How fast is my circuit?

- Latency: time required to perform some action, measured in units of time (nanoseconds, clock periods, ...).
- Throughput: the number of such actions executed per unit of time, measured in units of whatever is being produced.
- Critical path: the path in the entire design with the maximum delay

Clock frequency = 
$$\frac{1}{T_{clk}}$$
 ( $t_{clk} \ge$  critical path)

### Timing Depends on Parasitic Load

### How do I know it?

- Ignore (in modern process it is dominant)
- Use models for synthesis (statistical look up tables function of fan out)
- Extracted (need the place and routed circuit, different level of accuracy)

- Specify Voltage and Frequency
- Specify how circuit activity is determined
- Specify how the power was measured
- Specify the corner

### Contents

Hardware Desigr



3 Reconfigurable Devices

Cryptographic Algorithms

Two interesting approaches



э

< A

- Field Programmable Gate Arrays (FPGAs)
- Reconfigurable hardware devices
- Trade offs between ASIC and microprocessors
- Current progresses allow to store a complete SoC on FPGA

- Non recurring engineering costs
- Reduced time to market
- Always the latest technology

# **High Level View**

- Configurable blocks (look-up-tables)
- Configurable routing matrix
- Input/Output blocks
- Memory configuration
- Advanced processing elements (DSP, whole processors)

#### **Generic Architecture**

Francesco Regazzoni 14 August 2018, Calgary, Canada

## **Generic Block**

Francesco Regazzoni 14 August 2018, Calgary, Canada

3

# **Generic** Routing

Francesco Regazzoni 14 August 2018, Calgary, Canada

#### **Specific Block**

Francesco Regazzoni 14 August 2018, Calgary, Canada

< 口 > < 同

#### Contents

Cryptographic Algorithms



ヨト æ

< 口 > < 同

- Well defined (we have golden model)
- Simple (based on few operations)
- Easy to test (few vectors guarantee a good coverage)

### What Cryptographic Algorithm needs?

- Fast Speed (sometimes)
- Low Latency
- Low Area
- Low Power
- Low Energy

# Not many exceptions

- Scalability (number of rounds, state size, word size power of 2)
- Be careful of auxiliary functions (padding with counters of 64 bits)
- Do not base algorithms on the capabilities of current processors

#### State Size Determines the Minimum Area

- Flip-Flops (unlimited access, large)
- Shift Registers (small, access limited)
- RAM arrays (small, access by word)

#### Tricks Example: scan-registers

Francesco Regazzoni 14 August 2018, Calgary, Canada

#### Tricks Example: S-box

Francesco Regazzoni 14 August 2018, Calgary, Canada

#### Trick Example FPGA: Xilinx Virtex-5

Larger and more complex devices

- Embed multipliers, RAM memories, full processors
- Slice:
  - 4 flip-flops
  - 4 6-input LUTs
  - 2 multiplexers (F7MUX and F8MUX)
- Slices can be configured as distributed RAMs

Very suitable for mapping 8-bit input Look-up-tables

### Sbox of Oswald and Schramm

- S-box: inversion over *GF*(2<sup>8</sup>) and affine mapping (easy to mask):
  - Transform the masked input to the composite field  $GF(2^4) \times GF(2^4)$
  - invert it there efficiently
  - transform it back to the  $GF(2^8)$

## Sbox of Oswald and Schramm

- S-box: inversion over *GF*(2<sup>8</sup>) and affine mapping (easy to mask):
  - Transform the masked input to the composite field  $GF(2^4) \times GF(2^4)$
  - invert it there efficiently
  - transform it back to the  $GF(2^8)$

#### Oswald and Schramm approach for software:

- ▶ perform the inversion in  $GF(2^4)$  combining XOR operations with four pre-computed tables:  $T_{d_1}$ ,  $T_{d_2}$ ,  $T_m$  and  $T'_{inv}$ .
- ▶ Transform the result back to  $GF(2^8)$  with two additional tables:  $T'_{map}$  (from  $GF(2^8)$  to  $GF(2^4) \times GF(2^4)$ ) and  $T'_{map^{-1}}$  (from  $GF(2^4) \times GF(2^4)$  to  $GF(2^8)$ )
- The affine transformation is integrated with the isomorphic mapping

- Virtex-5 maps well 8-bit input Look-up-tables
- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$

- Virtex-5 maps well 8-bit input Look-up-tables
- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T'_{inv}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T'_{inv}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T'_{inv}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$   $\checkmark$
- $T'_{map}$ : input an element of  $GF(2^8)$ , output an element of  $GF(2^4)$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T'_{inv}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$   $\checkmark$
- $T'_{map}$ : input an element of  $GF(2^8)$ , output an element of  $GF(2^4) \checkmark$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T'_{inv}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$   $\checkmark$
- $T'_{map}$ : input an element of  $GF(2^8)$ , output an element of  $GF(2^4) \checkmark$
- $T'_{map^{-1}}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T'_{inv}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$   $\checkmark$
- $T'_{map}$ : input an element of  $GF(2^8)$ , output an element of  $GF(2^4) \checkmark$
- $\blacksquare \ T'_{map^{-1}}:$  input two elements of  $GF(2^4),$  output an element of  $GF(2^4) \checkmark$

Virtex-5 maps well 8-bit input Look-up-tables

- $T_{d_1}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_{d_2}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T_m$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4) \checkmark$
- $T'_{inv}$ : input two elements of  $GF(2^4)$ , output an element of  $GF(2^4)$   $\checkmark$
- $T'_{map}$ : input an element of  $GF(2^8)$ , output an element of  $GF(2^4) \checkmark$
- $\blacksquare \ T'_{map^{-1}}:$  input two elements of  $GF(2^4),$  output an element of  $GF(2^4) \checkmark$

#### All these tables have input size of 8 bits: fit perfectly our target FPGA

#### Contents

Hardware Design
 Performance
 Reconfigurable Devices

Cryptographic Algorithms

Two interesting approaches



5

э

3



æ

#### **Processor Customization**



э

イロト イポト イヨト イヨト

#### **Processor Customization**



э

イロト イポト イヨト イヨト

#### **Processor Customization**



э

イロト イポト イヨト イヨト

#### Symmetric-key cryptography

- bit permutation
- rotation
- addition modulo  $2^n$  (in ARX-based ciphers)
- $\blacksquare S = A \oplus B \oplus C_{in}C_{out} = AB + (A+B)C_{in} = AB \oplus AC_{in} \oplus BC_{in}$
- addition modulo 2, i.e. exclusive OR (XOR)
- substitution box (S-box)
- quadratic functions (for threshold implementations)  $f(x, y, z, w) = a_0 \oplus a_1 x \oplus a_2 y \oplus a_3 z \oplus a_4 w \oplus a_{12} x y \oplus a_{13} x z \oplus a_{14} x w \oplus a_{23} y z \oplus a_{24} y w \oplus a_{34} z w$

### cFA

- Fine-grained reconfigurable architecture
- Matrix of configurable Full Adder (cFA) cells
- One cFA (6 inputs and 2 outputs) can be programmed to 8 functions
- 8 functions are in standard cell libraries: re-use ASIC synthesis tools



|                  |        |        | 3         | x - 9x a         | rea decr | ease      |         |                  |        |
|------------------|--------|--------|-----------|------------------|----------|-----------|---------|------------------|--------|
|                  |        |        |           |                  |          |           |         |                  |        |
|                  |        |        |           |                  |          |           |         |                  |        |
| cipher           | Xilinx |        |           |                  |          | cFA array |         |                  |        |
|                  | SLICEL | SLICEM | area      | critical<br>path | conf     | cFA       | area    | critical<br>path | conf   |
| AES-128          | 404    | 0      | 179,053   | 4.95             | 105,040  | 624       | 27,886  | 4.97             | 14,976 |
| PRESENT-80-D3    | 70     | 0      | 31,024    | 1.65             | 18,200   | 190       | 8,491   | 0.95             | 4,560  |
| PRESENT-80-D2    | 74     | 0      | 32,797    | 1.65             | 19,240   | 139       | 6,212   | 1.64             | 3,336  |
| SPECK-128/128    | 130    | 0      | 57,616    | 5.99             | 33,800   | 294       | 13,139  | 9.91             | 7,056  |
| NOEKEON          | 168    | 0      | 74,458    | 3.30             | 43,680   | 288       | 12,871  | 2.41             | 6,912  |
| KTANTAN-64       | 80     | 0      | 35,456    | 2.2              | 20,800   | 119       | 5,318   | 1.95             | 2,856  |
| AES-128-TI       | 2,076  | 120    | 1,092,679 | 2.2              | 582,640  | 3,058     | 136,656 | 1.54             | 73,392 |
| PRESENT-80-TI    | 350    | 0      | 155,120   | 1.65             | 91,000   | 638       | 28,511  | 1.01             | 15,312 |
| SPECK-128/128-TI | 436    | 0      | 193,235   | 1.10             | 113,360  | 1556      | 69,535  | 1.33             | 37,344 |
| NOEKEON-TI       | 846    | 0      | 374,947   | 2.75             | 219,960  | 952       | 42,543  | 2.12             | 22,848 |

Ξ.

### Contents





э

< A

## What are Physical Attacks

#### Physical attacks recover secrets by exploiting the implementation

### Physical Attacks: the Weakest Point

#### Active Passive Fault Injection Power Analysis Timing Analysis

## Why Physical Security is so Important Today?

### Long Time Ago Past Present

## Why Physical Security is so Important Today?

Long Time Ago Past Present

### Mainframes Personal Computer Pervasive

From Axel Poschmann

Paul Kocher, Joshua Jaffe, and Benjamin Jun, "**Differential Power Analysis**", in Proceedings of *Advances in Cryptology-CRYPTO'99*, Santa Barbara, California, USA, August 15-19, 1999.

### Cheap

Powerful

#### Examples

- Simple Power Analysis
- Differential Power Analysis

# Power consumption **independent** from processed key dependent data



# Power consumption **independent** from processed key dependent data



# Power consumption **independent** from processed key dependent data



# Power consumption **independent** from processed key dependent data



### They can be implemented in Software or in Hardware

## Approach One

## **INPUT**:

## HDL Description

- Technological Library (area, timing, power)
- Synthetic Library (multipliers...)
- Constraints

## **OUTPUT**:

- DPA resistant Gate Level Netlist
- Estimation of area, timing, power (!)
- Timing constraints

## Approach Two

## INPUT:

- HDL Description
- Technological Library (area, timing, power)
- Synthetic Library (multipliers...)
- Constraints (limit the gates)

## **OUTPUT**:

Gate Level Netlist

# "Cell Substitution":

- Replace cells
- Reload in the tool for correct area and timing constraints

## Place and Route Input and Output

### INPUT:

- Gate level description of the circuit
- Physical view of the library (pin placement, ...)
- Constraints from synthesis

## **OUTPUT**:

- Gate Level Netlist
- Position and interconnection of the gates
- Estimation of area, timing, power (!)

### What About Resistance Assessment?

## Simulation is an Early Estimation

## Last Words



э

3

< 口 > < 同

### Thank you for your attention!

## mail: regazzoni@alari.ch