The OpenCiphers Project
This sourceforge project is dedicated to exploring the uses of ASICs, FPGAs and other forms of programmable hardware with modern cryptography. Currently the project is headed by David Hulton <dhulton@picocomputing.com> through work that he is doing which is funded by Pico Computing, Inc. All of the cores and software provided are licensed under the BSD License and are primarily optimized for Pico's hardware platforms but can be easily adapted for any FPGA based systems. All of the performance specs and information provided is based off the Pico E-12 LO card which utilizes a Virtex-4 LX25 FPGA. These are the current projects being worked on within the OpenCiphers Project:
Lanman Brute-Force Password Recovery [CVS]
This set of cores and software provides a mechanism for doing fast brute-forcing of the password keyspace for Lanman passwords which are commonly found on Windows machines. It includes a modified DES core (original core provided by Rudolf Usselmann of OpenCores.org and ASICS.ws) and a key generator and comparator engine that scales from cracking 128 password hashes in parallel to 2 with varying cracking rates:
Hashes
|
Cooling
|
Clock Speed
|
Cores
|
Keys per Sec
|
Key Checks per Sec
|
128
|
Yes
|
175MHz
|
1
|
175,000,000
|
22,400,000,000
|
128
|
No
|
125MHz
|
1
|
125,000,000
|
16,000,000,000
|
2
|
Yes
|
175MHz
|
4
|
700,000,000
|
1,400,000,000
|
2
|
No
|
125MHz
|
4
|
500,000,000
|
1,000,000,000
|
NTLM Brute-Force Password Recovery [CVS]
This implementation is very similar to the Lanman one but incorporates a custom MD4 core with a custom NTLM password generator and cracks full 128-bit hashes instead of the 64-bit lanman hash pairs. Because MD4 requires more gates than DES, we were only able to fit 32 compares on our slower cracking engine. For cracking only one hash, there is an optimized version that utilizes 3 MD4 cores to crack 3x faster.
Hashes
|
Cooling
|
Clock Speed
|
Cores
|
Keys per Sec
|
Key Checks per Sec
|
32
|
Yes
|
175MHz
|
1
|
175,000,000
|
5,600,000,000
|
32
|
No
|
125MHz
|
1
|
125,000,000
|
4,000,000,000
|
1
|
Yes
|
175MHz
|
3
|
525,000,000
|
525,000,000
|
1
|
No
|
125MHz
|
3
|
375,000,000
|
375,000,000
|
Modular Exponentiation Core [CVS]
This core is a basic implementation of the high-radix montgomery multiplication with parallelized square-multiply. It utilizes the Virtex-4 DSP48 slices and BlockRAM for performing the computations and requires no setup step (residue and p-setup is done in the core). The core performs everything in 16-bit operations so it requires the following for a 1024-bit operation:
n = number of bits (1024)
w = number of bits in a word (16)
e = number of words (n / w) (64)
Operation
|
Complexity
|
Clock Cycles
|
Speed
|
Total Time
|
Residue Calculation
|
rc = n * e
|
rc = (1024) * (64) = 65536
|
150MHz
|
437us
|
P-Setup Calculation
|
ps = (log2(w) + 3) * 2
|
ps = (log2(16) + 3) * 2 = 14
|
200MHz
|
70ns (done in parallel with Residue Calculation)
|
Montgomery Multiplication
|
mm = e2
|
mm = (64)2 = 4096
|
200MHz
|
20us
|
Montgomery Exponentiation
|
moe = n * mm
|
moe = (1024) * (4096) = 4194304
|
200MHz
|
20.97ms
|
Modular Exponentiation
|
me = moe + MAX(rc, ps)
|
me = (4194304) + MAX(14, 65536 * (200 / 150)) = 4281685
|
200MHz
|
21.41ms
|
RC4 Core [CVS]
This rc4 core is specifically made to be small and to only compute the first 2 bytes of PRGA. The idea for this core is to have a higher level core that feeds it possible WEP keys and then verifies if the key is correct by seeing if the PRGA ^ packet0 == 0xAAAA with the first 2 bytes of multiple packets. The core also supports optimizations such as, if you're checking a bunch of keys where the first 10 bytes are always the same, you can tell it to start with the precomputed S-Box state in round 10 so you save 10 clock cycles with each key try.
SHA-1 Core [CVS]
This core is a tiny implementation of SHA-1 that is optimized more for size than speed. It requires less than 500 Slices and 4 BlockRAMs and can be clocked up to 120MHz on the Virtex-4 (80 clock cycles are required for valid data). It uses a simple bus interface to write values to it and pull out results. The end goal of this project is to create a full core that is able to accelerate WPA-PSK cracking through hooks into coWPAtty and/or aircrack. The small size should allow us to parallelize multiple instances of the SHA-1 core on an FPGA to multiply the performance.
WPA-PSK PBKDF2 Core [CVS]
This is the SHA1 core adapted for doing WPA-PSK cracking. It uses BlockRAMs to buffer the SHA1 input and output values to streamline throughput. It's setup to accomodate larger FPGA designs that can use more SHA1 cores to increase performance.
A5/1 GSM Core [CVS]
This is a baseline implementation of the A5/1 algorithm used with most GSM cellphones. It currently isn't fully optimized for speed, but is done with a fairly small footprint using a simple linear feedback shift-register and state-machine.
Supercooling FPGAs with Liquid Nitrogen
The Virtex-4 FPGAs we've been using have been having problems with overheating (basically going into thermal runaway until the power supplies shut down). The problem is that synthesis says that the chips can be clocked fairly fast with our cores (200MHz+), but when it comes to actually running the design, the chip becomes too hot and won't actually support the reported clock speed. This is supposedly due to the fact that smaller transistors (90nm now) end up leaking more energy, which increases the thermal problems with the Virtex-4 over the older chips. So, we decided we would try to supercool them to try and fix the problem and see if we could actually clock them as fast as reported.
The result of the experiment was being able to clock our lanman hash cracking designs up to 175MHz (before, we were able to get 125MHz with just a heatsink attached to the card). Clock speeds above that resulted in inconsistent data, which is probably due to the signals traveling much faster than they're anticipated to during synthesis and place & route. Xilinx has options in their tools for specifying operating temperature which might help us run our design consistently at the faster clock speed, but we haven't tried that yet. More details will be posted here if we experiment with this more in the future.
Presentations
Here is a list of different presentations we've given at various conferences and shows. Please note that the older presentations are a bit out-dated and may contain old information. Feel free to email me with any questions or comments at dhulton@picocomputing.com.
|