Design and implementation of an FPGA-based TRNG for Linux kernel integration with on the fly analysis

Maat: px
Weergave met pagina beginnen:

Download "Design and implementation of an FPGA-based TRNG for Linux kernel integration with on the fly analysis"

Transcriptie

1 Design and implementation of an FPGA-based TRNG for Linux kernel integration with on the fly analysis Joost Verplancke Thesis voorgedragen tot het behalen van de graad van Master of Science in de ingenieurswetenschappen: elektrotechniek, optie Elektronica en geïntegreerde schakelingen Promotor: Prof. dr. ir. I. Verbauwhede Academiejaar Master of Science in de ingenieurswetenschappen: elektrotechniek

2 Design and implementation of an FPGA-based TRNG for Linux kernel integration with on the fly analysis Joost Verplancke Thesis voorgedragen tot het behalen van de graad van Master of Science in de ingenieurswetenschappen: elektrotechniek, optie Elektronica en geïntegreerde schakelingen Promotor: Prof. dr. ir. I. Verbauwhede Assessoren: Prof. dr. ir. B. Preneel Prof. dr. ir. P. Wambacq Begeleiders: ir. B. Yang ir. V. Rozĩc Academiejaar

3 Copyright KU Leuven Without written permission of the thesis supervisor and the author it is forbidden to reproduce or adapt in any form or by any means any part of this publication. Requests for obtaining the right to reproduce or utilize parts of this publication should be addressed to ESAT, Kasteelpark Arenberg 10 postbus 2440, B-3001 Heverlee, or by A written permission of the thesis supervisor is also required to use the methods, products, schematics and programs described in this work for industrial or commercial use, and for submitting this publication in scientific contests. Zonder voorafgaande schriftelijke toestemming van zowel de promotor als de auteur is overnemen, kopiëren, gebruiken of realiseren van deze uitgave of gedeelten ervan verboden. Voor aanvragen tot of informatie i.v.m. het overnemen en/of gebruik en/of realisatie van gedeelten uit deze publicatie, wend u tot ESAT, Kasteelpark Arenberg 10 postbus 2440, B-3001 Heverlee, of via info@esat.kuleuven.be. Voorafgaande schriftelijke toestemming van de promotor is eveneens vereist voor het aanwenden van de in deze masterproef beschreven (originele) methoden, producten, schakelingen en programma s voor industrieel of commercieel nut en voor de inzending van deze publicatie ter deelname aan wetenschappelijke prijzen of wedstrijden.

4 Preface With this thesis comes an end to my engineering studies. As such, a few words of gratitude are in order. For starters I would like to thank my promotor, professor Ingrid Verbauwhede, and assessors, professors Bart Preneel and Patrick Wambacq, for reading this text. Special thanks also go to my assistants Bohan and Vladimir. Without their regular advice, the road to completion would have been a lot more rocky. I would also like to thank the friends I have made over the last few years. Thanks for the awesome times, hilarious discussions and memorable moments. Names should not be needed, you know who you are. Eternal gratitude goes to my girlfriend, Lien. Other people could be depended upon when everything went well, you also put up with me when I needed support the most. Thank you! Thanks, of course, to everyone that proofread my thesis and corrected my numerous mistakes: Ankatrien, Bo, Deevid, Hilde, Josephine, Lien, Niels, Raf and Sarina. This text would not be as clear as it is today, without you. Finally I would like to thank my family for supporting me through thick and thin, and showing me the way to become the person I am today, and was always meant to be. Joost Verplancke i

5 Contents Preface Abstract Samenvatting List of Figures and Tables List of Abbreviations and Symbols 1 Intro A brief discussion of Random Number Generators Linux and /dev/random Current state-of-the-art Our goals Entropy sources Ring Oscillators Open Loop Structures Transition Effect Ring Oscillators On the fly Testing Error Types Generic tests Specific Tests Implementation on the LX9 MicroBoard FPGA side of the USB-link Aggregator block User Control Options Solutions for cheaper FPGAs Integration and Results Substituting /dev/random Results Conclusion 37 A Highlighted VHDL-code 41 B NIST test-suite report 47 Bibliography 55 i iii iv ix xi ii

6 Abstract Random numbers are important now more than ever. Cryptography especially relies heavily on randomness to generate unpredictable keys. The usage of longer keys and the increase in available computing power has made some doubt the security and reliability of existing solutions. True random number generators are expected to solve these issues. /dev/random is a software implementation of these generators, found natively in most Linux distributions. However, it has a very low generation rate and is as such not good enough to provide acceptable service uptime for many systems. This thesis proposes a TRNG design that can be seamlessly integrated into Linux. By using FPGA as the hardware platform, cheap and reconfigurable systems can be produced. Entropy sources were put into place based on Transition Effect Ring Oscillators. In order to ensure healthiness of the entropy pool, statistical tests were applied to the generated data. These tests were simplified algorithms based on the mathematical constructs found in draft b by NIST. Approved data was transferred to the computer and integrated into the Linux OS using virtual serial ports. The results were evaluated using the AIS-31 standard, applied to multiple megabytes of generated data, and satisfactory passing grades were found. The performance of the final system was heavily limited by the USB-interface between the FPGA and computer, but cost-performance factors ended up close to state-of-the-art solutions found on the market. iii

7 Samenvatting Introductie en context Willekeurige getallen zijn belangrijker dan ooit in de hedendaagse wereld. Zowel cryptografie als wetenschappelijke simulaties vertrouwen erop om grote hoeveelheden van deze getallen te kunnen opvragen. Om aan deze groter wordende vraag te kunnen voldoen, wordt op Random Number Generators gerekend. Deze bestaan in verschillende vormen, maar de grootste opdeling wordt gemaakt door te kijken of de getallen echt willekeurig zijn of pseudo -willekeurig. Deze laatste worden gegenereerd door algorithmes en hebben het nadeel dat ze rechtstreeks afhankelijk zijn van de voorheen gegenereerde getallen. Vooral voor cryptografische toepassingen is dit een groot nadeel. De toegang tot grote hoeveelheden rekenkracht laat aanvallers theoretisch gezien toe de toekomstige getallen te voorspellen. Op deze manier wordt data die tot dan toe beschermd geacht werd, vrij toegankelijk voor mensen met slechte bedoelingen. Echte willekeurige getallen komen uit True Random Number Generators en zijn meestal gebaseerd op fysieke processen waar onvoorspelbare evenementen bij komen kijken. Deze worden dan omgezet naar digitale elektronische signalen en gebruikt waar nodig. Een belangrijk voorbeeld van een TRNG is de implementatie van /dev/random voor Linux kernels. Deze software vergaard ruis uit device drivers, register toegangen en dergelijke om willekeurige bits te genereren. Deze manier van werken heeft echter het grote nadeel dat de throughput aan gegenereerde bits zeer laag kan liggen. Op het testsysteem voor deze thesis werd slechts één byte per drie seconden gegenereerd. Om Linux kernels toegang te geven tot meer random bits, werd voor deze masterproef een TRNG ontworpen en geïmplementeerd op een FPGA platform. Dit brengt verschillende voordelen met zich mee, zoals een lage kost en een grote flexibiliteit door de mogelijkheid de FPGA te herconfigureren wanneer dit nodig zou blijken. Als evaluatie werd het ontworpen systeem vergeleken met andere bestaande oplossingen die op dit moment op de markt te vinden zijn. Entropiebronnen Om een entropiebron op een FPGA te plaatsen, zijn er verschillende mogelijkheden. Ring Oscillatoren en Open Loop Ketens werden onderzocht, maar afgewezen. iv

8 Samenvatting ROs zijn zeer gevoelig aan aanvallen van buitenaf en hebben veel last van frequency locking. Dit betekent dat de willekeur en non-correlatie tussen verschillende vrij lopende ROs verdwijnt, hetgeen de aanwezige entropie doet kelderen. Om dit tegen te gaan moeten heel veel ROs tegelijk gebruikt worden, wat veel oppervlakte zou kosten op de FPGA. Open Loop Ketens induceren metastabiliteit in registers door met opzet de set-upen hold-voorwaarden van deze elementen te schenden. Intrinsieke ruis heeft op dat moment een effect op de uiteindelijke data in het register. Deze circuits zijn niet aangewend vanwege het minutieuze plaatsingswerk vereist om de vertragingen naar de ingangssignalen van het register correct te krijgen. Registers op een FPGA zijn immers ontworpen om elke vorm van instabiliteit tegen te gaan. Het circuit waar uiteindelijk voor gekozen werd, is de Transition Effect Ring Oscillator (TERO). Deze wekken net zoals de Open Loop Keten metastabiliteit op, maar deze keer in speciaal daarvoor ontworpen combinatorische kringen. Deze hebben een voorwaards en achterwaards pad, waarin bij excitatie pulsen beginnen lopen, hetgeen oscillaties op de uitgangsnodes teweeg brengt. Het aantal oscillaties dat optreedt voor de kring tot rust komt, vertoont willekeurig gedrag. Dit aantal kan worden gemeten door middel van een 1-bit counter in de vorm van een Toggle Flip-Flop. Deze circuits zijn vooral ontworpen en onderzocht door Varchola et al. en zij hebben twee varianten voorgesteld. De eerste soort bestaat uit twee XOR poorten en twee AND poorten, maar vertoont onstabiel gedrag en is zeer gevoelig aan zijn locatie op de chip. Als verbetering daarop werd een circuit gebruikmakend van twee NAND poorten voorgesteld. De twee takken kunnen hierbij naar believen worden verlengd of verkort door toevoegen van even aantallen inverters. Deze circuits kunnen niet gewoon in VHDL geschreven worden, maar moeten geïnstantieerd worden door middel van Xilinx primitieven. Voor elke logische poort aanwezig in het circuit moet een LUT beschreven worden die dan handmatig geconnecteerd wordt. Ook de plaatsing dient handmatig geforceerd te worden met behulp van UCF bestanden. Live tests De gegenereerde bits moeten willekeurig zijn om de gebruikmakende applicaties en gebruikers niet in gevaar te brengen. Hiertoe moet er genoeg entropie aanwezig zijn in de geproduceerde data. Slechte, voorspelbare data zou nooit naar de eindgebruiker mogen worden gebracht en wordt dus weggefilterd. Om een oordeel te kunnen vellen over de getallen, zijn vier tests geïmplementeerd. Twee hiervan zijn statistische, generische tests, gebaseerd op de NIST test suite, de andere twee zijn specifieke tests om de correcte werking van de TEROs te controleren. De generische tests werken telkens op data pakketten van 200 bits en beslissen of deze mogen worden doorgestuurd of niet. Elke Tero kring krijgt een eigen generische tester en specifieke testcircuits toegewezen. Op deze manier kunnen fouten lokaal opgemerkt worden en falende TERO kringen uitgeschakeld worden. v

9 Samenvatting De eerste statistische test is de Frequency MonoBit Test, deze test controleert of het aantal 1 -en en 0 -en in een pakket dicht genoeg bij 50% ligt. Het aantal 1 -en wordt geteld en vergeleken met een onder en bovengrens. Deze grenzen zijn afgeleid uit de mathematische formules gevonden in NIST Draft door de significantieniveaus vast te leggen en zo elke flexibiliteit weg te nemen. Deze thesis heeft twee MonoBit Tests in elk generisch tester-blok. Een strenge op significantieniveau α = en een lakse op niveau α = Het falen van de lakse test heeft tot gevolg dat de data geweigerd wordt, bij het falen op de strenge test wordt de bijhorende entropiebron ook uitgeschakeld tot de volgende globale systeemreset. Als de data de MonoBit Test doorstaan heeft, moet ook aan de Runs test voldaan worden alvorens het fiat voor verzending gegeven wordt. De Runs test telt het aantal wisselingen van 0 naar 1 en omgekeerd en kijkt of de frequentie waaraan deze wisselingen gebeuren, niet te hoog en niet te laag ligt. Opnieuw wordt bij het inlezen van de bits het aantal wisselingen geteld en bij de beoordelingsfase vergeleken met een boven- en ondergrens. Op dezelfde manier als bij de Frequency MonoBit Test, kunnen deze door simplificatie van de formules uit NIST Draft worden afgeleid. Het grootste verschil is dat ze voor de Runs Test afhankelijk zijn van het aantal 1 -en of 0 -en geteld tijdens de MonoBit Test. De specifieke tests controleren eerst of de TERO kringen bij excitatie beginnen oscilleren. Indien dit niet binnen een aanvaardbare tijd gebeurt, wordt besloten dat de TERO kapot is, en deze uitgeschakeld. Indien dit wel gebeurt wordt gekeken of de kring terug stopt met oscilleren, anders kan natuurlijk geen bit worden uitgelezen. Niet stoppen met oscilleren wordt niet gezien als fout, maar zal ervoor zorgen dat het subsysteem bij de vast zittende TERO hangt en geen bits meer uitgeeft tot er terug settling optreedt. Vervolledigen van de implementatie Om het systeem op de FPGA te vervolledigen, is er nog een interface naar de computer nodig en een tussentrap tussen de aangemaakte TERO pakketten en deze USB-Link. De volledige implementatie werd gedaan op het LX9 MicroBoard van AVNet. Als interface tussen de computer en de FPGA wordt de Universal Asynchronous Receiver Transmitter (UART) aanwezig op het LX9 MicroBoard gebruikt. Deze kan zendsnelheden tot bits / s aan. Dit zal de bottleneck van het volledige systeem blijken te zijn, aangezien de TERO kringen veel sneller goedgekeurde data aanmaken dan de ingebouwde UART kan verzenden. Een snellere interface zou een grote verbetering in algemene systeemsperformantie met zich mee moeten brengen. Als tussenstap tussen de geteste entropiebronnen en de UART werd een Aggregator-blok aangemaakt. In dit blok zitten bit grote registers. De eerste twee worden gevuld met gebufferde data afkomstig van de geteste TERO blokken. Het laatste is een shift register waarvan de 8 minst significante bits aan de UART worden aangeboden. Wanneer de volle 200 bits uit dit Send-register verzonden zijn, wordt het hervuld met data uit de andere twee registers. Afhankelijk van of XOR-Postvi

10 Samenvatting Processing aan staat of niet, wordt voor dit hervullen één of beide registers gebruikt. Indien XOR-PP ingeschakeld staat, worden beide registers bit per bit ge-xor-d en dit resultaat in het Send-register geladen. Indien deze Whitening-procedure uit staat, wordt het onderste register doorgekopiëerd naar het Send-register. De boekhouding en het inladen van de registers wordt gecontroleerd door twee gekoppelde FSMs. De UART Controler verzorgt de communicatie naar de UART en het inlezen van data in het Send-register. De Readout Controller buffert de data afkomstig van TEROs naar de tussenregisters en wist de relevante registers wanneer deze uitgelezen worden. Communicatie tussen de FSMs gebeurt door middel van het lezen van de state van de andere machine. De schakelaars en drukknop aanwezig op het LX9 MicroBoard werden gebruikt als input om de werking van het systeem te beïnvloeden at runtime. Belangrijke statussignalen, zoals deze die goedgekeurde data signaleren, werden naar buiten gebracht naar de LEDs. Een schatting van prijs voor productie van het volledige systeem kan gemaakt worden aan de hand van prijzen voor goedkopere Spartan-6 FPGA s en extra benodigde componenten. De prijzen zouden neerkomen op een e25 wanneer de LX9 gebruikt wordt, zoals hier het geval is, of e15 voor een implementatie waarbij de LX4 het gekozen platform is. Integratie in Linux Om de uitgelezen data in Linux te integreren worden calls naar /dev/random omgeleid door middel van een SymLink. Om dit te doen werken moet het nieuwe doel natuurlijk wel dezelfde interface hebben als /dev/random. Er moet dus een blokkerende seriële poort aangemaakt worden. Aangezien de Virtuele Comm Poort drivers voor de ingebouwde CP2102/9 een niet blokkerende device-poort aanbieden, wordt een paar virtuele seriële poorten aangemaakt. Hiervoor wordt Socat gebruikt. Een python script wordt in de achtergrond gedraaid om de data van de VCP naar de ingang van het VSP-paar te versluizen. De uitgang is dan precies het target nodig voor de SymLink. Resultaten De throughput van data naar Linux werd gemeten. In gewone omstandigheden bedroeg deze ongeveer MBps, dit is een vijfde van de snelheid aangeboden door de goedkoopste oplossing van ID Quantique. Maar de prijs van het hier ontwikkelde systeem bedraagt ook slechts een vijfde van de prijs van die oplossing. Een snellere interface ter vervanging van de UART zou het ontworpen systeem waarschijnlijk meer performantie per geïnvesteerde euro geven. Die MBps is immers ongeveer de snelheid waaraan de UART werkt en de TEROs maken veel sneller data aan dan deze ooit kan doorsluizen. De aangeboden getallen moeten natuurlijk random genoeg zijn. Om dit te controleren werd de software ontwikkeld door NIST gebruikt. De parameters uit vii

11 Samenvatting de AIS-31 standaard werden ingegeven en meerdere megabyte aan data werden geänalyseerd. De resultaten waren naar behoren en een analyserapport van een van de grotere runs is bijgevoegd bij de thesis. Om een aanval te emuleren, werd de FPGA tijdens werking afgekoeld tot -50. De throughput werd opnieuw gemeten, maar zakte slechts een minimum tot ongeveer MBps. De statussignalen toonden echter dat er een sterke degradatie in kwaliteit optrad binnen de TEROs en sommige ervan volledig werden afgesloten. Het minimum aan effect op de throughput valt te verklaren door het feit dat de UART het systeem zo zwaar limiteert dat het wegvallen van extreem veel data zelfs niet opgemerkt wordt. viii

12 List of Figures and Tables List of Figures 1.1 General TRNG schematic and components Wrong representation of /dev/random and /dev/urandom. Source: [15, Correct representation of /dev/random and /dev/urandom. Source: [15, Ring Oscillator formed by three INV gates Ring Oscillators with XOR comparison and sampling at frequency f s as proposed by Sunar [32] Single Open Loop Element with variable delay d 1 as proposed by Danger [6] General TERO concept with forward and backward path Waveforms found at the input and output nodes of TEROs at the moment of excitement. Based on Haddad et al. [14] Full TRNG Element by adding an asynchronous counter to the TERO-Loop from figure 2.10, here a 1-bit counter is used. Based on: Haddad et al. [14] Schematic representation of Varchola s XOR-TERO-Loop Full TRNG Element by adding an asynchronous counter to the XOR-AND-TERO-Loop from figure 2.7, here a 1-bit counter is used. Based on: Varchola et al. [33] PlanAhead view of the placed XOR-TERO Schematic representation of Varchola s NAND-TERO-Loop PlanAhead view of the placed NAND-TERO The upper bound for the number of runs in the Runs Test in function of the number of ones in a 200-bit package at a significance level α = 1% The lower bound for the number of runs in the Runs Test in function of the number of ones in a 200-bit package at a significance level α = 1% Comparison of the fitted curves and the exactly calculated curves for the number of runs in function of the number of ones in a 200-bit package at a significance level α = 1% for the Runs Test Full decision tree implemented in the generic testing unit ix

13 List of Figures and Tables 3.5 Basic detector to see whether the input signal has encountered a rising edge since the last reset High level view of TERO + Testing subsystem Full flowchart for Bit Generation and Testing Schematic overview of the Aggregator Block connections towards the hardware interface Front view of the LX9 MicroBoard Schematic overview of the Aggregator Block List of Tables 1.1 Side by side comparison of a few existing TRNG implementation solutions Lower and upper bounds number of Ones for implemented MonoBit Frequency Test in base 10 and base Numerical values for the upper and lower bounds of runs in function of the number of ones in a 200-bit package at a significance level of α = 1% for the Runs Test x

14 List of Abbreviations and Symbols Abbreviations FPGA HW LUT RNG TRNG PRNG CSRNG CEF PAR P&R R&D UCF FF TFF pds PUF LED UART SPI FSM MFT ES RO OLS GPIO SymLink VCP Field-Programmable-Gate-Array Hardware Look-Up-Table Random Number Generator True-Random Number Generator Pseudo-Random Number Generator Cryptographically Secure Random Number Generator Complementary Error Function Place And Route Place and Route Research and Development User Constraint File Flip-Flop Toggle Flip-Flop Probability Density Function Physically Unclonable Function Light Emitting Diode Universal Asynchronous Receiver Transmitter Serial Peripheral Interface Finite State Machine Monobit Frequency Test Entropy Source Ring Oscillator Open Loop Structure General Purpose Input/Output Symbolic Link Virtual Com Port xi

15 List of Abbreviations and Symbols Symbols α α main α catastrophic α secondary α runs significance level for the discussed test significance level for the main MFT significance level for the strict MFT significance level for the Consecutive Failure Test for the main MFT significance level for the Runs Test xii

16 Chapter 1 Intro 1.1 A brief discussion of Random Number Generators Random numbers today are becoming increasingly important in computing and science. Especially encryption algorithms and keys require long, unpredictable strings of bits to set up secure communication and storage. High demand for these numbers can be met in multiple ways, leading to a number of different ideas for Random Number Generators (RNG). The easiest way is to algorithmically generate so-called pseudo-random numbers. These algorithms are Pseudo-Random Number Generators (PRNG) and have one glaring weakness: no matter how well designed, they work deterministically. This means that with enough data and computing power, future numbers can be predicted, keys can be guessed and security breaches could occur. These algorithms also have to be initialized by giving them at least one truly random number. This initialisation is called seeding. This has led to a number of people demanding truly random numbers, leading to the development of True Random Number Generators (TRNG). The goal here is to generate numbers that are inherently random and completely uncorrelated. This is almost always done by measuring some physical random process and quantizing the output. There are TRNGs that measure thermal noise of resistors, which our best models still perceive to be white noise. Others measure nuclear decay using Geiger counters hooked up to computers, effectively translating quantum randomness to bits. Huge cloud services exist [28], based on distributed radio receivers, measuring atmospheric noise. Closer to the domain of digital chips, there are a lot of generators based on decoupled Ring Oscillators (RO), where the drift between a number of these exhibits true random behaviour. Due to the nature of their usage, random numbers form a weak link in the chain that makes encryption secure. Attackers that are able to predict or worse, control the numbers used as base for the keys in the algorithms, have access to all the data that was supposed to be secure. This explains why a big number of people shy away from using PRNGs, since the determinism theoretically allows for perfect modelling of the entropy source, leading to free gateways for attackers. TRNGs do not solve this problem completely, since they too can be attacked. Influencing the 1

17 1. Intro Figure 1.1: General TRNG schematic and components surroundings in RO-based RNGs, for example, can cause these oscillators to lock on to certain frequencies, decreasing the jitter, which is used as basis for entropy extraction. An example of this can be found in [26]. This is partially solved by checking the estimated randomness of an entropy source before allowing the data to reach the end user, since it is assumed that interfering with the source will cause it to show deterministic behaviour. The only way to assess the randomness is of course statistically. In order to do this, a number of test suites have been designed, the most important being the NIST (National Institute of Standards and Technology) [2] and DIEHARD [25] test suites. Testing and ensuring randomness in RNGs has become so important that the most widespread standards require any TRNG implementation to have at least a few tests running on every source. Data generated by the RNG that does not satisfy the integrated tests should not be delivered to the end user and a catastrophic failure which most definitely compromises security should always be signalled by an alarm. In addition to generation and testing, post-processing (PP) should be mentioned as a final part of the chain to generate random numbers. A source might contain a lot of entropy, that is, produce actual random numbers, but the output might still have a bias towards certain numbers. Specifically, a binary random number generator might produce more ones than zeroes or vice versa. This bias can be counteracted by post-processing. When using arithmetic PP, this is a small and simple algorithm to transform the generated data in such a way that the output preserves all the random qualities of the input, but shows a more uniform distribution of numbers. In this way entropy quality of the output can be enhanced, by sacrificing throughput. A comprehensive analysis of the methodology for design of modern hardware TRNGs has been done by Fischer [13], a general schematic for a TRNG that shows all important components in context can be seen in figure

18 1.2. Linux and /dev/random 1.2 Linux and /dev/random Most contemporary Linux distributions provide kernel support for programs that need random numbers. The random data is generated by harvesting environmental noise from hardware devices. The specifics might differ from system to system, but examples of noise sources are drive spinning speeds, hardware component temperature changes or accumulator register access frequencies. The bits that get gathered from these processes are put in an entropy pool, which serves as the base for both the TRNG output and PRNG algorithm seeding. Both /dev/random and /dev/urandom are shown as character device files that can be easily read by the relevant software. As can be seen from the Random(4) man page [23], /dev/random will only provide a number of random bytes up to the entropy estimation. Once the system deems the data pool too shallow, no more data is provided on the device file: the port blocks. This means that random bytes can only be read at a very limited speed. As an example, the speed of /dev/random on a 32-bit virtual Ubuntu distribution running on top of a 64-bit Windows 10 system was tested. The number of bytes generated was approximately 1 byte every 3 seconds. Virtualization of the Linux OS worsens the aggregation considerably since the OS cannot reach all real-life hardware noise sources, but full-fledged Linux systems do not perform all that much better. The "U" in /dev/urandom stands for unlimited and indicates that the port does not block. According to the same man page, /dev/urandom fills in the gaps of data where not enough entropy is present with a PRNG using the earlier random data as seed. This way, reads can be issued without having to worry about availability of random bytes since requested data will be generated at will. The man page warns people for the dangers of using /dev/urandom for important cryptographic work, citing its theoretical vulnerabilities as a concern: [..] As a result, in this case the returned values are theoretically vulnerable to a cryptographic attack on the algorithms used by the driver. [..] [..] but it is theoretically possible that such an attack may exist. If this is a concern in your application, use /dev/random instead. [..] However, there are some lively discussions and opinions about this advice. The blocking character is seen by some as too disruptive to the responsiveness of software that depends on the numbers [15]. The linked article also questions the representation 3

19 1. Intro of both /dev/random and /dev/urandom. The man page suggests that /dev/urandom only switches to the PRNG when entropy is low, giving the idea the system works something like figure 1.2. The reality is closer to figure 1.3, where /dev/random takes data from the exact same pool as /dev/urandom, which is filled completely from a CSPRNG (Cryptographically Secure Pseudo-Random Number Generator). This means that besides the blocking, there is no real difference between the data retrieved from /dev/random and /dev/urandom. This caused Linux systems to incorporate GetRandom [22], which is basically a system call abstraction of /dev/urandom. Both sources have also come under attack with papers like [21] and [9], questioning the entropy accumulation and the hash algorithms used to mix everything together. Neither paper suggests that the weaknesses of the sources are a serious cause for concern, nor that the theoretical vulnerabilities could easily be translated in a targeted attack on cryptographic Linux systems. They do indicate, however, that using external entropy sources might be necessary on important servers or computers that require safe, unpredictable cryptographic keys. 1.3 Current state-of-the-art Hardware solutions to get a better, faster and more reliable TRNG data stream exist. These come both in plug-and-play ready hardware packages like [16] and [12], and as soft IP cores [17]. These offer a lot of advantages over the generic software solution implemented of /dev/random [23] because they combine the best of both virtual device ports. They have a high enough throughput of random bits to seem non-blocking, allowing for near-constant uptime of the services that require the randomness, and do not have the theoretical weaknesses associated with using a deterministic algorithm the way a PRNG does. There are still some reasons to continue development of new external hardware random number generators. The first is the price. Prices for ID Quantique chips [16] quickly rise above e100 a piece, and for this price you have an external Randomness card of approximately 4 Mbps. Higher speeds will naturally cost even more. IP cores such as [17] are mainly meant to be used for ASIC design. This means that investments have to be done for a full design and production cycle. This might be offset by the fact that the system which needs the randomness is designed and incorporated on the same chip as the one which uses the IP core. Secondly, it is not always known what mechanism or process is used to generate the random data or if it is known, it might not be satisfactory. For example, the Entropy Key UK [12] chips use a PRNG with true entropy seeding for its generation and this is exactly the critique we gave /dev/urandom. Lastly, the chips that have been bought, or in the case of IP cores [17] designed, are almost always non-reconfigurable. This means that if there ever is a flaw detected in the used algorithm or underlying process, they will have to be discarded and new chips will have to be purchased. Preferably, a cheap, reconfigurable and inherently known chip is used, which does not require the same, expensive production cycle of an ASIC and are widely 4

20 1.3. Current state-of-the-art Figure 1.2: Wrong representation of /dev/random and /dev/urandom. Source: [15, 5

21 1. Intro Figure 1.3: Correct representation of /dev/random and /dev/urandom. Source: [15, 6

22 1.4. Our goals Product Displayed Pro s Cons Name Price /dev/random [23] Free Widespread, free Slow in practice ID Quantique [16] e100 Guarantees entropy Expensive Entropy Key [12] 36 Known process Non guaranteed speed IP Cores IP core [17] Unknown Softcore IP Requires full design Table 1.1: Side by side comparison of a few existing TRNG implementation solutions available. FPGA s fit this bill perfectly. 1.4 Our goals This master thesis revolves around designing and implementing a reconfigurable True Random Number Generator for Linux kernel integration. This way we try to make a substitute for /dev/random for a fraction of the cost of the existing hardware solutions. Cheap FPGA s were the targeted platform and performance had to be at least in the same order of magnitude as what the market offers at this point in time. All work was done for and tested on Spartan-6 FPGA s [40]. First tests and proof of concept was done on the Digilent Atlys board [8] and the final implementation was designed for Avnet s LX9 MicroBoard [1]. In order to reach this goal, all three major blocks of a TRNG had to be made. First entropy sources have to be designed and implemented. Following this, live testing has to be put in place to check the healthiness of said entropy sources. Finally a post-processing algorithm has to be added and the data has to be sent to the computer to be integrated into the Linux OS. 7

23

24 Chapter 2 Entropy sources In this chapter, a few possible entropy sources will be introduced and evaluated. Every solution has to fit at least one important criterium: they have to be implementable on FPGA s. Other criteria include reliability, area and implementation ease. Two rejected ideas will be explained superficially: Ring Oscillator TRNG structures [32] [37] [5] and Open Loop structures [6]. The eventual idea that was decided upon is the Transition Effect Ring Oscillator (TERO) which was mainly developed by M. Varchola et al. [33] [10] [35] [34]. Analysis on the element they propose has been done by Haddad, Fischer, Bernard and Nicolai [14] and Kitsos and Voyiatzis [20]. 2.1 Ring Oscillators Ring oscillators (RO) are well known digital circuits which are formed by a number of logic gates connected in a circular structure. The final signal is fed back to the first gate. The logic gates can be anything, the only requirement for oscillatory behaviour is that the number of inverting gates is odd. A very simple RO structure is shown in figure 2.1. Independent RO s which are not put in a feedback loop are called Free Running Ring Oscillators. The frequency at which these oscillate is not controlled and the signals found at the outputs have to be described as electric noise with either randomly varying frequencies or phases [3]. Taking a number of these RO s and having them run independently gives access to this randomness by comparing the outputs and sampling at the desired frequency, as proposed by Sunar [32]. This comparison can easily be implemented by XOR-ing the outputs [18]. Sampling is just feeding this data to a clocked register, which results in eventual circuitry as seen in figure 2.2. The entropy harvested from this kind of systems depends heavily on the number of oscillators used per comparison, as seen in research done by Fischer et al. [5]. Thus, in order to have high entropy generation, a large area has to be used since frequency locking effects will make some of them unusable, resulting in a decrease of the amount of entropy the system produces. This can happen because RO s are 9

25 2. Entropy sources Figure 2.1: Ring Oscillator formed by three INV gates. Figure 2.2: Ring Oscillators with XOR comparison and sampling at frequency f s as proposed by Sunar [32] inherently susceptible to frequency injection attacks by power supply modulation [26]. When compared to Transition Effect Ring Oscillators, normal Ring Oscillators generate and accumulate entropy slower, as seen in a papers by Varchola [33] [10]. Fischer s conclusion [5] means a rather large area is needed to secure randomness. For these reasons, it was decided not to use Ring Oscillators for this master thesis. 2.2 Open Loop Structures Open Loop Structures (OLS) are TRNG elements proposed by Danger et al. [6]. These circuit components use registers that are initialized to a metastable state and it is assumed Gaussian electric noise decides how the outputs resolve. Metastability is induced by purposely neglecting setup- and hold-times. By taking a number of these units and again mixing the outputs using XOR gates [18], random bits can be achieved. A single cell looks like figure 2.3, the final chain is formed by linking a few of these together and having well-tuned delays to induce metastability everywhere. A very big downside to implementing such TRNGs is how difficult it is to induce 10

26 2.3. Transition Effect Ring Oscillators Figure 2.3: Single Open Loop Element with variable delay d 1 as proposed by Danger [6] metastability in FPGA registers. These chips are developed to counteract unstable states where possible and are designed in such a way that they occur as little as possible. The only way to succeed is by generating the delays just right. This means that a lot of platform-specific R&D has to be done. Testament to this is the minutiose P&R work put in just one design by Lozach, Ben-Romdhane, Graba and Danger. [24]. Their work translates well in between most of the Virtex FPGA family, but would have to be completely redone when switching to a different one. Using metastability as noise source can be done in an easier manner by inducing it in a specifically designed combinatorial logic loops rather than an existing register. Especially if this register is meant to eliminate the majority of metastable occurrences. This is the idea used in the TERO Loops by Varchola et al. and what was used in this master thesis. 2.3 Transition Effect Ring Oscillators General concept for TRNG use In the same way as OLSs, TEROs induce metastability in a part of the FPGA and rely on ambient thermal noise to influence the settling behaviour. The circuitry used to achieve this, forms a combinatorial logical loop by forming identical forward and backward paths which can both be excited. By switching the path controls at the same time, pulses are created that traverse the logic loop, creating a metastable state. A general schematic can be seen in figure 2.4. The physical model analysis done by Haddad et al. [14] shows an output waveform generated by a TERO-loop when excited with an input control pulse. A rough approximation of the generated signal pulses can be found in figure 2.5 and is based on the NAND-TERO-Loop seen in figure This circuit was proposed by Varchola et al. [34] as a follow-up to their earlier work [33]. The TRNG capabilities of TERO-Loops depend on the fact that pulses travelling through the loop grow shorter over time and eventually disappear, something that is 11

27 2. Entropy sources also apparent from figure 2.5. This means that the circuit eventually settles down and stops oscillating. The number of oscillations until the circuits settles on an output shows random behaviour. In order to complete the TRNG element starting from figure 2.10, we have to add at least a one bit asynchronous counter, resulting in figure 2.6. Haddad et al. [14] have shown that the probability density function (pds) of the number of oscillations can be described by a combination of erf -functions. It should be noted that the final value of the oscillating node cannot be used as random data since it generally shows considerable bias to settle to a specific value. Testament to this is the fact that TERO-Loops have been proposed as elements for Physically Unclonable Functions (PUF) [35]. The probability that the number of oscillations until settling N osc is equal to q, is described by the pds p q in the following way: p q = 1 2 [ erf ( K 1 Rq q 0 1 R 2q 1 ) erf ( K 1 )] Rq q 0 R 2q+2 1 (2.1) R K = 1 2 2σ r (2.2) q 0 = log( r) log(r) (2.3) In these equations, R, σ r and r are parameters that can be quantified by measurements on specific implementations of TERO-Loops. This model fits better than a normal Gaussian noise model, especially when the forward and backward paths have a different delay. In-depth comparison of the model fits can be found in work by Haddad et al. [14]. The eventual random bit is the least significant bit of the number of oscillations q in equation 2.1. So the probability p b=1 that this bit is 1, is equal to the probability that the number of oscillations was odd p b=1 = k=+ k=0 The probability p b=0 that the bit is 0 is of course p b=0 = 1 p b=1 = p 2k+1 (2.4) k=+ which is the probability that the number of oscillations was even XOR-AND-Loop k=0 p 2k (2.5) The first iteration of TERO-loops proposed by Varchola et al. in 2010 [33] consisted of 2 XOR-gates and two AND-gates, one of each in both the forward and backward paths just like figure 2.7. The circuit has two stable configurations, both when the 12

28 2.3. Transition Effect Ring Oscillators Figure 2.4: General TERO concept with forward and backward path. Figure 2.5: Waveforms found at the input and output nodes of TEROs at the moment of excitement. Based on Haddad et al. [14] Figure 2.6: Full TRNG Element by adding an asynchronous counter to the TERO- Loop from figure 2.10, here a 1-bit counter is used. Based on: Haddad et al. [14] 13

29 2. Entropy sources ctrl signal is high and when it is low. The pulse traversing the loop after excitement is the circuitry swinging between the two possibilities. The INV gates on the ctrl and reset lines are added so all the internal loop-signals can be routed inside one Configurable Logic Block (CLB). This way the difference in branch-delay r from equation 2.3 can be made as small as possible, since the number of passes through the internal FPGA s switching matrix can be minimized. The terout signal is then fed to a TFF just like in the proposal by Varchola et al., see figure 2.8. Simply implementing this TERO-Loop in VHDL logic obviously causes combinatorial loops, this was the main goal, after all. Synthesis tools will try to optimize these away by either removing the logic completely or combining multiple gates into one logic functionality Look-Up Table (LUT). This should be avoided since it would add additional skew in the branch-delays. The way to circumvent these unwanted optimizations is by adding and initializing each LUT manually and connecting them. On the Spartan-6 [40], this can be done by using Xilinx primitives, resulting in the code seen in appendix A. Just writing this code is not enough, since Place And Route (PAR) will still recognize the formed combinatorial loop and refuse to place these on the FPGA. Again manual intervention is needed to force placement. Using the program PlanAhead [39], which is part of the Xilinx WebPack, User Constraint Files (UCFs) were generated, forcing the LUT placement. The placed logic can be seen on figure 2.9. On the Spartan-6, two neighbouring slices form one CLB [41], so this placement is conform the requirements found by Varchola et al. [33]. A control sequence for the XOR-TERO-Loop is detailed in the same paper and the output was measured. The same kind of pulse shortening as shown in figure 2.5 was found. They also found sufficient randomness in the bits generated at the TFF output. Findings Using the XOR-TERO in practice proved difficult. The circuit in a general FPGA environment was very sensitive to both global absolute placement (where it was put in respect to the whole FPGA architecture) and local, relative to logic, placement (where it was put in respect to other actively used logic). Badly placed TERO-Loops showed either deterministic data, which never passed the statistical tests in section 3.2. In other cases they never stopped oscillating, which meant the dampening detector from section never signalled that a bit could be read out. In practice this resulted in TEROs not outputting any acceptable bits. Since the synthesized logic gets placed algorithmically by the PAR process, moving one TERO-Loop could result in other Loops not working. After a few iterations, it was decided to switch to the NAND-TERO-Loop proposed again by Varchola et al. as improvement upon the first iteration [34]. 14

30 2.3. Transition Effect Ring Oscillators Figure 2.7: Schematic representation of Varchola s XOR-TERO-Loop Figure 2.8: Full TRNG Element by adding an asynchronous counter to the XOR- AND-TERO-Loop from figure 2.7, here a 1-bit counter is used. Based on: Varchola et al. [33] NAND-INV-Loop The reason Varchola et al. [34] started working with the structure seen in figure 2.10 is twofold. A nice feature is that there is no more need for a separate reset signal to put the circuit in the same electric starting conditions, something that was required in the XOR-Loop. This is now merged with the control signal. The main advantage, however, lies in the elongated reset phase in the control cycle. Giving the circuit more time to fall back to initial conditions gives better insurance that all lines will have converged to this state. This in turn results in a more reliable and robust system than what could be achieved with the XOR-AND-version. This variant of TERO Loop is also more easily adjusted. By adding more inverters in both the forward and backward branch, it is possible to elongate the absolute 15

31 2. Entropy sources Figure 2.9: PlanAhead view of the placed XOR-TERO delays, lowering the relative delay difference seen in equation 2.3. This means that a higher theoretical entropy rate can be achieved, as seen in Haddad et al. [14]. As before, the buffer LUT on the ctrl-line is used so the internal signals can all be routed inside one CLB, minimizing the passes through the switching matrix, keeping the relative differences in line-length as small as possible. This way metastability is easier to achieve. To circumvent the synthesis tools optimizations, the exact same techniques have been used as for the XOR-Loop. The VHDL code using Xilinx primitives can be found in appendix A and placement using PlanAhead can be seen in figure In the code, a reset-signal is still used, but only to clear the TFF. Resetting the internal TERO-Loop is still done by putting ctrl to 0, making use of the better electrical reset-possibilities. Findings Switching from the XOR TERO to the NAND-INV version solved most of the placement issues. The instabilities found in the original XOR TERO are very diminished in the second version. This resulted in a more reliable operation. Together with the easier control cycle, these were the reasons to pick NAND-INV TEROs for the final design and implementation. 16

32 2.3. Transition Effect Ring Oscillators Figure 2.10: Schematic representation of Varchola s NAND-TERO-Loop Figure 2.11: PlanAhead view of the placed NAND-TERO 17

33

34 Chapter 3 On the fly Testing As mentioned before, at least a minimum live testing system should be added on any TRNG design or implementation. This way statistical quality degeneration can be caught before data is sent to the client. For this master thesis, a number of tests were implemented in section 3.2 based on the NIST drafts [2]. Two specific tests for the chosen sources were also conceived based on the expected operation of the used entropy sources, this is discussed in 3.3. These tests do not check the validity of the eventual output bit, but rather attempt to check whether metastability inside the source is achieved and has settled according to plan. The full flowchart which places both the specific and generic testing in context can be found in figure 3.7. The tests together with the TERO Loops discussed in chapter 2 form subsystems which will be used as main building blocks in chapter 4. A high level view of the subsystem is visualized in figure 3.6. In this chapter the words package, chunk and stream are used interchangeably for buffered vectors of bits generated by the sources from chapter 2, where new bits get added in position 0 and older ones get shifted upward each time a new bit is added. 3.1 Error Types When testing for randomness, two kinds of errors can occur: either random data is concluded to be non-random, this is called a Type 1 Error, or non-random data is falsely labelled as random data, a Type 2 Error. Because randomness and unpredictability are a very strict requirement for the data generated, it is seen as unacceptable to introduce more Type 2 Errors than what prescribed tests inherently bring along. This means that when simplification regarding decision levels was necessary for the implementation, this was always done in such a way that more Type 1 Errors would occur and less Type 2 Errors. In practice this comes down to being more hesitant to label data as random and putting the boundaries of acceptability stricter than what the test definition requires. Lower bounds were moved and/or rounded up, Upper bounds were moved and/or rounded down both in sections and This was done because failing these tests 19

35 3. On the fly Testing caused the data under inspection to be discarded. If the decision had been made to shut the TRNG down on failure, Type 2 Errors would be preferred. 3.2 Generic tests The tests proposed by NIST [2] work on chunks of data and use Hypothesis Testing on these packages to determine if they are statistically valid random data streams. Not all tests can be used for hardware implementation, as seen by Yang [42]. For this thesis, two tests have been selected from this subset based mainly on implementation simplicity and base necessity in the validity verification process. For example: the MonoBit test in section is a preliminary that has to be passed in order for the Runs test in section to be taken into account but also for the remaining tests detailed in the NIST Draft [2]. The length of the test-blocks n was set to 200 for every test, which is about double the minimum size recommended by NIST, but still small enough to easily fit on cheap FPGA s. The decision tree for the generic testing block can be found in figure Frequency MonoBit Test From the NIST Draft [2]: The focus of the test is the proportion of zeroes and ones for the entire sequence. The purpose of this test is to determine whether the number of ones and zeros in a sequence are approximately the same as would be expected for a truly random sequence. The test assesses the closeness of the fraction of ones to 1 / 2, that is, the number of ones and zeroes in a sequence should be about the same. All subsequent tests depend on the passing of this test. The detailed explanation can be found in the draft, but the main decision rule comes down to checking inequality 3.1 ( ) sobs α > erfc (3.1) 2 s obs = S n (3.2) S = X ones (n X ones ) = X zeroes (n X zeroes ) (3.3) where α is the decision level, erf c is the Complementary Error Function (CEF), n is the length of the data stream under inspection, and S in 3.2 is the deviation away from n / 2 of the number of either ones or zeroes in the package. S can be easily computed using equation 3.3 once n is known or set. Here X ones or X zeroes are the respective number of ones or zeroes in the package. Implementing a decent approximation to the CEF on an FPGA would consume a huge area and seems infeasible when using a smaller chip. This is circumvented by 20

36 3.2. Generic tests Significance Lower Upper Lower Upper Level α Bound Bound Bound (hex) Bound(hex) 1% X52 0X X32 0X96 Table 3.1: Lower and upper bounds number of Ones for implemented MonoBit Frequency Test in base 10 and base 16 using a hard coded decision level α, decreasing flexibility but transforming equation 3.1 into two easy comparisons once the number of ones or zeroes has been tallied. A more detailed explanation and analysis has been done by Rozĩc [36]. Final design decisions The final design for this thesis has two MonoBit Frequency tests running at different decision levels: one at level α main = 0.01 and another at decision level α catastrophic = This is approximately 2 40, which is what NIST recommends in Draft B [11]. If the first test fails, the tested block is discarded and not offered to be sent to the client. No error signal is given because it is perfectly possible for a block of actual random data to fail the test. In fact, it is expected to happen about once every 100 generated packages, as this is the meaning of α. The second test (with α catastrophic = ) is quite hard to fail so if it happens, the conclusion is made that something important has gone wrong and an error signal is given. In the current implementation this will cause the TERO to be shut down until a global reset is issued. On top of the lax (that is, α main = 0.01) MonoBit Test, an extra safety measure is implemented. If this test fails five subsequent times, the decision is also made to send out an error message and switch off the associated source. This is done because even when the assumption is made that perfectly random data will fail the test by chance, it still only has a chance of one in ten billion (α secondary = = α 5 main ) to fail five times in a row at the chosen significance level (α main = 0.01). The lower and upper bounds for the number of ones or zeroes for both the lax and strict test can be found in table Runs Test From the NIST Draft [2]: The focus of this test is the total number of runs in the sequence, where a run is an uninterrupted sequence of identical bits. A run of length k consists of exactly k identical bits and is bounded before and after with a bit of the opposite value. The purpose of the runs test is to determine whether the number of runs of ones and zeros of various lengths is as expected for a random sequence. In particular, this test determines 21

37 3. On the fly Testing whether the oscillation between such zeros and ones is too fast or too slow. Again the details can be found in the draft. The main decision is made by checking equation 3.4 α runs > erfc ( ) Vn (obs) 2nπ(1 π) 2 2nπ(1 π) (3.4) π = nk=1 P k n (3.5) where α runs again is the significance level of the test, V n (obs) the number of observed runs or switches between zeroes and ones, n the number of bits in the package under test and π the proportion of ones in that package. π can be mathematically described by 3.5 if the stream is written as array P with index k. The number of runs is easily kept track of during accumulation of a new 200-bit package. Each time a new committed bit is different from the last one, a counter is increased. At the end of an accumulation phase, this number of runs is ideally compared to the significance level α using equation 3.4. Again the CEF is too difficult or large to implement on the cheaper FPGA s, so simplifications are needed. By hard coding α runs, equation 3.4 can be transformed into an inequality in V n (obs) with parameter X ones (or X zeroes, see equation 3.3). As explained by Rozĩc [36], this gives two integer curves against which the final number of runs has to be compared. This number has to fit in between the lower and upper bound of the integer solution of equation 3.4. Since the Runs Test is only relevant if the MonoBit Frequency Test was passed, these curves have only been constructed between the lower and upper bound of the lax test found in table 3.1. Final design decisions The graphs for the used curves can be found in figures 3.1 and 3.2. Simplifying equation 3.4 gives the smooth blue curves labelled calculated max/min runs. Since these numbers are not integers, they were first rounded to the nearest integer, resulting in the red rounded max/min runs curves. After that these rounded curves were approximated by the yellow fit max/min runs curves which were constructed in such a way that jumps in the respective bounds coincided at the same number of ones. This was done in order to lessen the combinatorial logic needed during assessment, since the decision tree for the passing or failing of the runs test approximately is cut in half. The band of measurements which are seen as acceptable results is most clearly visualized in figure 3.3, where the calculated and the fitted curves are plotted for both the lower and upper bounds in function of the number of ones. Numerical values for the rounding and fitting can be found in table

38 3.2. Generic tests Figure 3.1: The upper bound for the number of runs in the Runs Test in function of the number of ones in a 200-bit package at a significance level α = 1% Figure 3.2: The lower bound for the number of runs in the Runs Test in function of the number of ones in a 200-bit package at a significance level α = 1% 23

39 3. On the fly Testing Figure 3.3: Comparison of the fitted curves and the exactly calculated curves for the number of runs in function of the number of ones in a 200-bit package at a significance level α = 1% for the Runs Test Figure 3.4: Full decision tree implemented in the generic testing unit 24

40 3.2. Generic tests nb. of Calculated lower Calculated upper Fitted lower Fitted upper ones bound runs bound runs bound runs bound runs Table 3.2: Numerical values for the upper and lower bounds of runs in function of the number of ones in a 200-bit package at a significance level of α = 1% for the Runs Test 25

41 3. On the fly Testing Figure 3.5: Basic detector to see whether the input signal has encountered a rising edge since the last reset 3.3 Specific Tests Where generic testing checks for correct statistical properties of the generated data, specific tests evaluate operation of the used sources and compares it against expected behaviour of the TERO elements described in section 2.3. In the flowchart depicted in figure 3.4, both tests discussed here fit in the Produce Bit block, since they are required to complete correctly before any bit is added to the package under test in section Oscillation Detection The first step in generating a new bit using TERO-Loops is inducing metastability. This means oscillations should occur and at least one rising edge of the TFF output signal has to have been detected. Detecting this is easily achieved by using a small circuit like the one in figure 3.5. Resetting the FF puts the Oscillated output to 0 and leaving the Reset signal low allows the logic 1 to be loaded into the memory element on any rising edge of the Input signal. If metastability was supposed to be induced but no rising edge is detected within an acceptable period, the system decides the associated TERO has failed catastrophically and it is switched off until the next general system reset. A nonoscillating loop is useless for random bit generation Dampening Detection Once the TERO starts oscillating, it is expected to do so for a finite time. Before the resulting bit can reliably be read out, the internal metastable loop has to settle down. In order to make a reliable system, this settling has to be detected and used as the enabling signal to proceed with the readout. An extra testing block has been designed to check for this expected settling. It uses an oscillation detector just like the one described in section 3.3.1, but this one is reset more often, namely every few clock cycles. The decision to signal that the oscillations have stopped is made 26

42 3.3. Specific Tests Figure 3.6: High level view of TERO + Testing subsystem when just before a reset would happen, no more oscillations have taken place and the oscillated output of the circuit in figure 3.5 is low. A problem that could occur here is that a loop keeps on oscillating and never settles down. It was mentioned before that this happened regularly to the XOR- AND-TEROs (section 2.3.2). The NAND-INV did not show this behaviour during testing (section 2.3.3). As can be seen in the general flowchart 3.7, this problem would cause the subsystem to hang. No ERROR-decision is made here because once settling would occur, the bit would of course be eligible as random data and could still be useful. 27

43 3. On the fly Testing Figure 3.7: Full flowchart for Bit Generation and Testing 28

44 Chapter 4 Implementation on the LX9 MicroBoard With the sources and tests from chapters 2 and 3 in place, there are a few final parts to be implemented on the FPGA. In order to communicate with Linux through a serial port, a simple simplex UART interface was made in section 4.1. The block that interfaced a number of copies of the system depicted in figure 3.6 to the UART was called the Aggregator and is described in section 4.2. The switches, LEDs and push-button located on the LX9 MicroBoard [1] were also used and their designation is discussed in section 4.3. The whole system here together with the tests and entropy sources was implemented and synthesized. Programmable bitstreams and associated MCS files for the LX9 were generated and those MCS files have been loaded into the Flash memory to be automatically read into the FPGA on start-up. 4.1 FPGA side of the USB-link The UART Bridge interface available on the LX9 is a CP2102/9 developed by Silicon Labs [30]. It offers baud rates up to bits / s, which gives a throughput of approximately 0.9 Mbits / s or 0.11 MBps. Since it s expected that this block will turn out to be the bottleneck of the system, these numbers give an early estimation of the expected results in chapter 5. A faster interface implementation should give a considerable increase in system performance. A solution to consider here could be the Serial Peripheral Interface (SPI) which is present on the LX9 as programming port for the Flash memory. The starting code for the UART is based on a public GIT project [4] onto which a number of changes were made to make the block fit more neatly into the system Swapping out the Interface Integrating a new interface in the global system is made easy, in order to allow for aforementioned upgrade in transmission speed. All that s important is to give a 29

45 4. Implementation on the LX9 MicroBoard Figure 4.1: Schematic overview of the Aggregator Block connections towards the hardware interface similar interface towards the Aggregator-block. Three signals are important: An 8-bit wide line from the Aggregator where the data to be sent is offered one byte at a time A 1-bit wide line from the Aggregator to request the interface to send the offered byte A 1-bit wide line from the interface to signal that the offered byte can be replaced by the next, either because it has been buffered or because the sending is complete Figure 4.1 shows how the Aggregator and UART-interface are linked up at the moment. 4.2 Aggregator block A high level schematic of the Aggregator-block is shown in figure 4.3. The most important data- and control-signals are drawn. The 200-bit sequences produced in chapter 3 have to be read out, buffered and chopped into bytes to be compatible with the interface on the FPGA side of the UART-block. In practice this means offering the generated data byte per byte and waiting for the UART to complete a send action in between. At the same time the TERO-subsystems have to be read out and post-processing should be applied when demanded. The chosen implementation uses two intertwined, synchronized Finite State Machines (FSMs). The first FSM is responsible for Entropy Source (ES) readout 30

46 4.3. User Control Options while the second is used to communicate with the UART. Readout is done by checking whether a tested entropy source package signals that its 200 bits of data are good to be read. If this is the case, the data is read and put into an intermediate buffer. This way the entropy source can start generating random bits again. The used buffer consists of two 200 bit buffers for a total size of 400 bits. When new data has to be queued up for the UART, the UART Communication Control FSM checks if the Readout buffers have enough data. How enough should be interpreted, depends on whether XOR-Post-Processing is needed. If whitening should be applied, the intermediate buffer has to be completely filled in order to generate 200 bits in the Sender-buffer. In that case, this is done by applying bitwise XOR-ing using the upper and lower part of the intermediate buffer. If there is no need for the Post-Processing step, only the Lower buffer has to be filled. The data present there can then be transferred to the Sender-buffer without any extra manipulations. The Readout Control FSM keeps track of the two intermediate 200-bit registers and which of the two are filled. If possible, it tries to shift useful data down to the lower register. If that one is filled, it tries to fill the upper part of the buffer. Clearing the registers is done when the UART Comm Control signals that the Send-buffer was completely transferred to the UART-unit. Depending on whether XOR-Post-Processing is enabled or not, Readout Control will know if both registers have been used to generate the new Send-buffer or just the lower part. All used registers are cleared, of course, otherwise unwanted correlation in the sent data would be generated. While the Readout Control is busy doing the bookkeeping of the intermediate buffers, the UART Communication Control FSM offers a byte of data from the Send-register to the UART. Then a Send Request is issued to the UART-block and the FSM waits until it receives affirmation that the byte has been sent. The buffer is then shifted down eight bits so a new byte appears at the UART input and the process is repeated. The mentioned wait period was the reason to decide on the double-fsm implementation since it is easier to just have one wait state towards the UART block than write the extra logic into every state of the Readout Control. When the whole Send-buffer has been shifted through, the FSM waits until enough of the intermediate buffer is filled. Again enough depends on the need to apply Post-Processing. Enabling the XOR-ing will cause the Send-buffer to be filled using both halves of the intermediary buffer, otherwise just the lower half is transferred. 4.3 User Control Options The LX9 offers four switches and one push-button as board input controls [1]. Since the decision was made to keep the UART simplex, these are the only five ways to have control over the workings at run-time. Information regarding the inner state of the system is output through the four installed LEDs, which together with the switches and the button can be seen in figure 4.2. The LEDs each show information about one of the TERO Packages. The information shown on the LEDs depends on the positions of the switches, as does some of the working of the FPGA. The inputs 31

47 4. Implementation on the LX9 MicroBoard Figure 4.2: Front view of the LX9 MicroBoard have been configured as follows: The push-button: Pushing the button issues a global system reset, effectively putting the FPGA back into the state it was on start-up GPIO switch 1: Switches the Running-signal on or off, determining whether the sources generate new bits or not. Turning this off will stop generation of data. Once all data that was still buffered has been sent to the OS, the transfer stream will halt until the switch is flipped again GPIO switches 2 and 3: Allow flipping through the possible data shown on the LEDs, currently following combinations can be used: Switch 2 low, 3 low: shows error signals on the LEDs. For each shining LED, one of the four TERO loops has been diagnosed with a fatal error and switched off Switch 2 high, 3 low: shows which TERO loop packages have been used to transmit data towards the OS so far Switch 2 low, 3 high: shows which TERO loop package currently has approved data to be read out by the Readout FSM in section 4.2 Switch 2 high, 3 high: gives exactly the same as having both switches in their low position GPIO switch 4: Enables or disables XOR-Post-Processing as discussed in section Solutions for cheaper FPGAs All work and implementation was done for the Spartan-6 LX9, which is the second cheapest chip in the Spartan-6 line, since the LX9 MicroBoard was provided. However, 32