

#### The Next Generation of Cryptanalytic Hardware

Microsoft Friday Security Talks, 8/19/05

David Hulton <dhulton@picocomputing.com> Embedded Systems Engineer, Pico Computing, Inc. Chairman, ToorCon Information Security Conference

Greg Edvenson <gedvenson@picocomputing.com> Embedded Systems Engineer, Pico Computing, Inc.

Pico Computing, Inc. Research & Development

Microsoft Friday Security Talks 8/19/2005

### Goals

- Introduction
  - What is an FPGA?
  - Gate Logic / Verilog
- ImpulseC
  - Introduction
  - Demo
- Cracking \w Hardware
  - History
- Chipper
  - Lanman/NTLM
  - Demo
  - Performance



#### Introduction to FPGAs



- Field Programmable Gate Array
  - Lets you prototype IC's
  - Code translates directly into circuit logic





The basic building blocks of any computing system

| not     | ~a       | not  | A |
|---------|----------|------|---|
| <b></b> | T        | n    | _ |
| or      | a   b    | or   |   |
| and     | a&b      | and  | A |
| xor     | a ^ b    | xor  | : |
|         | •        |      |   |
| nor     | ~(a   b) | nor  |   |
| nand    | ~(a & b) | nand |   |
| xnor    | ~(a ^ b) | xnor |   |





Build other types of logic, such as adders:



#### What is Gate Logic?



#### • Which can be chained together:





### What is Gate Logic?



- And can be used for storing values:
  - Feedback
  - Flip-Flop / Latch







JK Flip-Flop









This can be implemented with electronics:

NOT



AND



#### What is an FPGA?



- An FPGA is an array of configurable gates
  - Gates can be connected together arbitrarily
  - States can be configured
  - Common components are provided
  - Any type of logic can be created

### What is an FPGA?



- Configurable Logic Blocks (CLBs)
  - Registers (flip flops) for fast data storage
  - Logic Routing
- Input/Output Blocks (IOBs)
  - Basic pin logic (flip flops, muxs, etc)
- Block Ram
  - Internal memory for data storage
- Digial Clock Managers (DCMs)
  - Clock distribution
- Programmable Routing Matrix
  - Intelligently connects all components together



### FPGA Pros / Cons



- Pros
  - Common Hardware Benefits
    - Massively parallel
    - Pipelineable
  - Reprogrammable
    - Self-reconfiguration
- Cons
  - Size constraints / limitations
  - More difficult to code & debug

### Introduction to FPGAs



- Common Applications
  - Encryption / decryption
  - AI / Neural networks
  - Digital signal processing (DSP)
  - Software radio
  - Image processing
  - Communications protocol decoding
  - Matlab / Simulink code acceleration
  - Etc.

### Introduction to FPGAs



- Common Applications
  - Encryption / decryption
  - AI / Neural networks
  - Digital signal processing (DSP)
  - Software radio
  - Image processing
  - Communications protocol decoding
  - Matlab / Simulink code acceleration
  - Etc.

## **Types of FPGAs**

Pico

- Antifuse
  - Programmable only once
- Flash
  - Programmable many times
- SRAM
  - Programmable dynamically
  - Most common technology
  - Requires a loader (doesn't keep state after power-off)

## Types of FPGAs

- Xilinx
  - Virtex-4
  - Optional PowerPC Processor
- Altera
  - Stratix-II





- Hardware Description Language
- Simple C-like Syntax
- Like Go Easy to learn, difficult to master





One bit AND

• C

Verilog

u\_char and(u\_char a, u\_char b) { return((a & 1) & (b & 1)); }

module and(a, b, c); input a, b; output c; assign c = a & b; endmodule

• Gate







#### 8 bit AND

• C

Verilog

u\_char or(u\_char a, u\_char b) { return(a & b); }

module or(a, b, c); input [7:0] a, b; output [7:0] c; assign c = a & b; endmodule

Gate







- 8 bit Flip-Flop
  - C

```
module ff(clk, a, c);
input clk;
input [7:0] a;
output [7:0] c;
reg [7:0] c;
```

always @(posedge clk) c <= a; endmodule

Gate



## ROM

• C

Verilog

• Gate



u char rom[] =  $\{2, 3, 5, 7\};$ 





Pico Computing, Inc. Research & Development

Microsoft Friday Security Talks 8/19/2005





Pico Computing, Inc. Research & Development

Microsoft Friday Security Talks 8/19/2005



### **CoDeveloper™ with Impulse C™**



- Providing automatic conversion of C-language routines to FPGA hardware
  - Automated extraction of parallel hardware from untimed C
  - Proprietary optimization features
    - Instruction and resource scheduling, loop unrolling and pipelining
- Providing a programming model and tools for system-level parallel programming
  - Allows multiple software and hardware processes to be created, connected and synchronized, all in C-language
  - Programming model designed for software programmers
    - Based on Streams-C, from Los Alamos National Labs

## **Application Domains**



- Applications requiring repetitive computations at very high speed
  - Application can be modularized (partitioned) between hardware and software
  - Dataflow-oriented, high degrees of parallelism
  - Pipelined algorithms (e.g. filters)
- For processing streams of data in real time
  - Imaging and audio
  - Communications
  - Digital Signal Processing (DSP)
  - Encryption and decryption
  - Bioinformatics and supercomputing



Platform libraries support existing FPGA synthesis and embedded compiler environments Automatic generation of hardware/software interfaces is optimized for target platforms

process



FPGA hardware is automatically created from C language processes

The result? Accelerated software with minimal need for hardware or FPGA design knowledge

### Example: FIR Filter



- Data and coefficients passed into filter via data stream
  - Could also use shared memory
- Algorithm written using untimed, hardwareindependent C code
  - Using coding styles familiar to C programmers
- Software test bench written in C to test functionality
  - In software simulation
  - In actual hardware

```
void fir(co stream filter in, co stream filter out) {
  int32 coef[TAPS]; int32 firbuffer[TAPS];
  int32 nSample, nFiltered, accum, tap;
  co stream open(filter in, O RDONLY, INT TYPE(32));
  co stream open(filter out, O WRONLY, INT TYPE(32));
                                                                     Declare stream interfaces
  // First fill the coef array with the coefficients...
  for (tap = 0; tap < TAPS; tap++)
                                                                     Open the streams
     co stream read(filter in, &nSample, sizeof(int32));
     coef[tap] = nSample;
                                                               Read in the coefficients
  // Now fill the firbuffer array with the first n values...
  for (tap = 1; tap < TAPS; tap++)
     co_stream_read(filter in, &nSample, sizeof(int32));
                                                                       Read in the first n values
     firbuffer[tap-1] = nSample;
  // Now we have an almost full buffer and can start processing waveform samples...
  while ( co stream read(filter in, &nSample, sizeof(int32)) == co err none) {
     firbuffer[TAPS-1] = nSample;
     for (accum = 0; tap = 0; tap < TAPS; tap++)
       accum += firbuffer[tap] * coef[tap]:
     nFiltered = accum >> 2;
                                                                  Process the incoming stream
     co stream write(filter out, &nFiltered, sizeof(int32));
                                                                  and perform the filter operation
     for (tap = 1; tap < TAPS; tap++)
                                                                  to generate outputs
       firbuffer[tap-1] = firbuffer[tap];
  co_stream_close(filter in);
                                                          When done, close the streams
  co stream close(filter out);
}
Pico Computing, Inc.
                                                                                       Microsoft Friday Security Talks
```

8/19/2005

**Research & Development** 



This test can be performed in desktop simulation (using Visual Studio or some other C environment) or can be performed using an embedded processor for the producer/consumer modules.

### Impulse C Configuration Function



void config fir(void \*arg) stream declarations co stream waveform raw; co stream waveform filtered; process declarations co process producer process; co process fir process; stream creation co process consumer process; waveform raw = co stream create("waveform raw", INT TYPE(32), BUFSIZE); waveform filtered = co stream create("waveform filtered", INT TYPE(32), BUFSIZE); producer process = co process create("producer process", (co function)test producer, 1, waveform raw); process creation fir process = co process create("filter process", (co function)fir, 2, waveform raw, waveform filtered); consumer\_process = co\_process\_create("consumer\_process",(co\_function)test\_consumer, 1, waveform filtered); // Assign processes to hardware elements process configuration (hardware) co process config(fir process, co loc, "PE0");

### **Desktop Simulation**



Impulse C is standard C with the addition of the Impulse C libraries, which means that any standard C development environment can be used for functional verification and debugging.



### Parallel Programming Model



- Communicating Process Model
  - Buffered communication channels (FIFOs) to implement streams
  - Supports dataflow and message-based communications between functional units and local or shared memories
  - Supports parallelism at the application level and at the level of individual processes (via automated scheduling/pipelining)



### An Impulse C<sup>TM</sup> Process









#### Demonstration

### History of FPGAs and Cryptography



- Minimal Key Lengths for Symmetric Ciphers
  - Ronald L. Rivest (R in RSA)
  - Bruce Schneier (Blowfish, Twofish, etc)
  - Tsutomu Shimomura (Mitnick)
  - A bunch of other ad hoc cypherpunks

## History of FPGAs and Cryptography



| Budget        | Tool      | 40-bits    | 56-bits    | Recom               |
|---------------|-----------|------------|------------|---------------------|
| Pedestrian Ha | acker     |            |            |                     |
| Tiny          | Computers | 1 week     | infeasible | 45                  |
| \$400         | FPGA      | 5 hours    | 38 years   | 50                  |
| Small Compar  | ıy        |            |            |                     |
| \$10K         | FPGA      | 12 min     | 556 days   | 55                  |
| Corporate De  | partment  |            |            |                     |
| \$300K        | FPGA      | 24 sec     | 19 days    | 60                  |
|               | ASIC      | 0.18 sec   | 3 hrs      |                     |
| Big Company   |           |            |            |                     |
| \$10M         | FPGA      | 0.7 sec    | 13 hrs     | 70                  |
|               | ASIC      | 0.005 sec  | 6 min      |                     |
|               |           |            |            |                     |
| \$300M        | ASIC      | 0.0002 sec | 12 sec     | 75                  |
| Computing Inc |           |            |            | Microsoft Eridov So |

Pico Computing, Inc. Research & Development Microsoft Friday Security Talks 8/19/2005 History of FPGAs and Cryptography



- 40-bit SSL is crackable by almost anyone
- 56-bit DES is crackable by companies
- Scared yet?

#### This paper was published in 1996



- 1998
  - The Electronic Frontier Foundation (EFF)
  - Cracked DES in < 3 days</p>
  - Searched ~9,000,000,000 keys/second
  - Cost < \$250,000



- **2001** 
  - Richard Clayton & Mike Bond (University of Cambridge)
  - Cracked DES on IBM ATMs
  - Able to export all the DES and 3DES keys in ~ 20 minutes
  - Cost < \$1,000 using an FPGA evaluation board</p>



- **2002** 
  - Rouvroy Gael, Standaert Francois-Xavier and others from the UCL Crypto Group
  - Implemented a linear cryptanalysis attack on DES
  - Used FPGAs to generate dictionary tables
  - Chosen-plaintext attack can recover key in 10 seconds with 72% success rate



- **2004** 
  - Philip Leong, Chinese University of Hong Kong
  - IDEA
    - 50Mb/sec on a P4 vs. 5,247Mb/sec on Pilchard
  - RC4
    - Cracked RC4 keys 58x faster than a P4
    - Parallelized 96 times on a FPGA
    - Cracks 40-bit keys in 50 hours
    - Cost < \$1,000 using a RAM FPGA (Pilchard)</li>





- Currently Supports
  - Unix DES
  - Windows Lanman
  - Windows NTLM (full-support coming soon)
  - Multiple Cards/FPGAs ;-)

### Lanman Hashes



- Lanman
  - 14-Character Passwords
  - Case insensitive (converted to upper case)
  - Split into 2 7-byte keys
  - Used as key to encrypt static values with DES

# Chipper



- Hardware Design
  - Pipeline design
  - Internal cracking engine
    - passwords = lmcrack(hashes, options);
  - Interface over PCMCIA/CompactFlash
  - Can specify cracking options
    - Bits to search
      - e.g. Search 55-bits (instead of 56)
    - Offset to start search
      - e.g. First card gets offset 0, second card gets offset 2\*\*55
    - Typeable/printable characters
    - Alpha-numeric
    - Allows for basic distributed cracking & resume functionality

### Interface Layout





Pico Computing, Inc. Research & Development

### Interface Layout



|      | HASHES/PASSWORDS                                                |
|------|-----------------------------------------------------------------|
|      | fieldicibia 9 8 7 6 5 4 3 2 1 0 fieldicibia 9 8 7 6 5 4 3 2 1 0 |
| 0x0  | hash/pass[0][31:0]                                              |
| 0x4  | hash/pass[0][63:32]                                             |
| 0x8  | hash/pass[0][95:64]                                             |
| 0xC  | hash/pass[0][127:96]                                            |
| 0x10 | hash/pass[1][31:0]                                              |
| 0x14 | hash/pass[1][63:32]                                             |
| 0x18 | hash/pass[1][95:64]                                             |
| 0x1C | hash/pass[1][127:96]                                            |
| 0x20 | ···                                                             |
| •••  | <br>++                                                          |

### Password File Cracker





Pico Computing, Inc. Research & Development Microsoft Friday Security Talks 8/19/2005







## Interface Layout





### Pico Computing, Inc. Research & Development

Microsoft Friday Security Talks 8/19/2005



# Interface Layout



### Pico Computing, Inc. Research & Development







Pico Computing, Inc. Research & Development

# Interface Layout





# Interface Layout





Pico Computing, Inc. Research & Development Microsoft Friday Security Talks 8/19/2005

# Chipper



- Software Design
  - GUI and Console Interfaces
  - WxWidgets
    - Windows
    - Linux
    - MacOS X (coming soon)
  - Supports cracking 128 keys in parallel on each card
  - Supports 4x fast mode for just one hash pair
  - Can automatically load required FPGA image
  - Supports multiple card clusters

### Lanman Cracking



- PC (3.0Ghz P4 \w rainbowcrack)
  - ~ 2,000,000 c/s
- Hardware (Low end FPGA \w Chipper)
  - 125Mhz = 125,000,000 c/s per core
  - 500Mhz = 500,000,000 c/s for fast mode!

| Туре          | P4    | E-12 | 8 E-12 |
|---------------|-------|------|--------|
| 64-characters | 25 D  | 2 H  | 18 M   |
| 48-characters | 3.4 D | 20 M | 1.5 M  |
| 32-characters | 4.7 H | 1 M  | 9 S    |

### Pico E-12



- Pico E-12
  - Compact Flash Type-II Form Factor
  - Virtex-4 (LX25 or FX12)
    - 1 Million Gates (~25,000 CLBs)
    - Optional 450 MHz PowerPC Processor
  - 128 MB PC-133 RAM
  - 64 MB Flash ROM
  - Gigabit Ethernet
  - JTAG Debugging Port



E-12 Card

**PicoCrack Demonstration** 



### Demonstration

Pico Computing, Inc. Research & Development Microsoft Friday Security Talks 8/19/2005

# OpenCiphers.org

- Sourceforge project
  - Chipper
  - Lanman & NTLM cracking cores
  - Modular Exponentiation
  - A5/2 (for some GSM research)



# **Technology Trends**



- Technology Trends
  - Embedded platforms are either cheap and slow or expensive and fast
  - There will always be a cost factor with regards to crypto
  - This has plagued smart cards, speedpasses, mobile devices, etc.
  - The future is definitely implementing more advanced cryptanalysis attacks
  - As cheap chips get faster, the workload for brute-force increases exponentially with the keysize

### Hardware Trends



- FPGAs are increasing according to Moore's Law
  - Density Increasing
  - Clock Speed Increasing
  - Components Created and expanded to fit markets
  - Cost Dropping
- Slowly starting to compete with ASICs
- Starting to become cheap enough for consumers
  - FPGA software accelleration will start to become mainstream
  - Fast parallel computing will increase in popularity
- Future Applications
  - Neural Networks & Self-reconfiguration
  - Attacks on WEP/WPA/GSM
  - Analysis and Correlation

### **Conclusions / Shameless Plugs**



- ToorCon 7
  - September 16th-18th, 2005
  - Convention Center, San Diego, CA
  - http://www.toorcon.org
- ShmooCon 2
  - January, 2006
  - Washington DC
  - http://www.shmoocon.org

### Questions ? Suggestions ?



- OpenCiphers
  - http://www.openciphers.org
- OpenCores
  - http://www.opencores.org
- Xilinx
  - ISE Foundation (Free 60-day trial)
  - http://www.xilinx.com
- Pico Computing, Inc.
  - http://www.picocomputing.com