Concurrency. Gerard Tel / Jacco Bikker - november 2015 januari Welkom!

Concurrency Gerard Tel / Jacco Bikker - november 2015 januari 2016 Welkom!

Agenda: Introductie Path Tracing C# en SIMD Vectorizatie Speculative xecution

Concurrency college 7 Vectorizatie 3 Introductie Wat vooraf ging: Hardware: Multiple core Vector instructies Caches, cache lines, data locality Patterns Map pattern Patterns en hardware Performance

Concurrency college 7 Vectorizatie 4 Introductie Hardware Low Level Parallellisme Instructie-verwerking door de CPU 1. Fetch 2. Decode 3. xecute 4. Writeback t

Concurrency college 7 Vectorizatie 5 Introductie Hardware Low Level Parallellisme Pipelining: parallelle executie van de stages. Kloksnelheid wordt bepaald door de meest complexe stage. t

Concurrency college 7 Vectorizatie 6 Introductie Hardware Low Level Parallellisme Optimalisatie van pipelining: vereenvoudigen van de stages vergroten van het aantal stages hogere kloksnelheid t

Concurrency college 7 Vectorizatie 7 Introductie Hardware Low Level Parallellisme xecution units zijn complex en gespecialiseerd: floating point operaties integer operaties memory operaties

Concurrency college 7 Vectorizatie 8 Introductie Hardware Low Level Parallellisme Dit leidt uiteindelijk tot de superscalar pipeline: per klok worden meerdere instructies gefetched, en meerdere instructies uitgevoerd. Superscalar - voorwaarden Instructies gebruiken niet dezelfde execution units Instructies hebben elkaars resultaten niet nodig Flow is voorspelbaar Data is voorhanden t

Concurrency college 7 Vectorizatie 9 Introductie Hardware Low Level Parallellisme Feeding the Beast Dependencies, execution units: out of order execution Flow: branch prediction Data: caches, registers Stage cost: VLIW (very large instruction width) t

Concurrency college 7 Vectorizatie 10 Introductie Hardware Low Level Parallellisme Reality Check: welke code is sneller? int[] arr = new int[64 * 1024 * 1024]; // 256MB // loop 1 for (int i = 0; i < arr.length; i++) arr[i] *= 3; // loop 2 for (int i = 0; i < arr.length; i += 16) arr[i] *= 3; Loop 2 is ongeveer 2x sneller, en niet 16x! Oorzaak: cachelines; i+=16 zorgt dat elke operatie in een unieke cacheline gebeurd (16x sizeof(int) == 64).

Concurrency college 7 Vectorizatie 11 Introductie Hardware Low Level Parallellisme Reality Check: welke code is sneller? int steps = 64 * 1024 * 1024; int[] a = new int[4]; // loop 1 for (int i = 0; i < steps; i++) { a[0]++; a[0]++; a[0]++; a[0]++; } // loop 2 for (int i = 0; i < steps; i++) { a[0]++; a[1]++; a[2]++; a[3]++; } Loop 2 is ongeveer 2x sneller. Oorzaak: instruction dependencies; in loop 1 heeft iedere increment het resultaat van de vorige nodig, waardoor de CPU deze niet tegelijk kan uitvoeren.

Concurrency college 7 Vectorizatie 12 Introductie Hardware Low Level Parallellisme Feeding the Beast Dependencies, execution units: out of order execution Flow: branch prediction Data: caches, registers Stage cost: VLIW (very large instruction width) Murphy Under-utilization: bubble Branch misprediction: pipeline clear Cache miss: stall, hyperthread

Concurrency college 7 Vectorizatie 13 Introductie Hardware VLIW Vector instructies: Vector4 a = { 1, PI, e, 2 }; Vector4 b = { 2, 2, 4, 4 }; Vector4 c = a * b; SIMD: Single Instruction, Multiple Data A A A A Idee: Functie A4 bestaat uitsluitend uit instructies die 4 items verwerken Het uitvoeren van A4 vereist dan net zo veel instructies als A De throughput van A4 is vier maal hoger. Maar: Dan moeten we wel het map pattern op grote schaal toepassen.

Concurrency college 7 Vectorizatie 15 Path Tracing Grote schaal: Path Tracing nergietransport: van lichtbron naar camera, via oppervlakken Lichtbron: sky, arealight Camera: via pixel op virtueel scherm (sensor) Oppervlakken: diffuus, spiegelend, glas, lichtgevend Intuïtief, natuur: Volg een hoeveelheid fotonen Niet intuïtief, wel praktisch: Keer de paden om (camera, door pixel, naar licht)

Concurrency college 7 Vectorizatie 16 Path Tracing Grote schaal: Path Tracing Paden: Vertices: punt op camera, punten op geometrie, punt op licht Verbonden door rays Ray: P(t) = O + td

Concurrency college 7 Vectorizatie 17 Path Tracing Grote schaal: Path Tracing // feed me with primary ray for a pixel Color Trace( vec3 O, vec3 D ) { // find intersection point, normal, material I, N, mat = Intersect( O, D ); // terminate path if we found a light if (mat.islight()) return mat.emissive; // otherwise, bounce randomly and recurse vec3 R = RandomReflection( N ); return mat.color * dot( N, R ) * Trace( I, R ); }

Concurrency college 7 Vectorizatie 18 Path Tracing Grote schaal: Path Tracing

Concurrency college 7 Vectorizatie 19 Path Tracing Grote schaal: Path Tracing Optimaliseren van path tracing: Multi-threading: mogelijkheden op grote schaal embarrassingly parallel Beter resultaat met minder paden: variance reduction Instruction level parallelism: concurrency binnen één thread (remind me to turn off F.Lux)

Concurrency college 7 Vectorizatie 21 C# en SIMD C++: m128 x4 = _mm_set_ps( 1, 5, 3.14, 0.1 ); m128 y4 = _mm_set_ps( 2, 2, 2, 2 ); m128 result4 = _mm_add_ps( x4, y4 ); // result: 3, 7, 5.14, 2.1 C#:?

Concurrency college 7 Vectorizatie 22 C# en SIMD RyuJIT Benodigd voor Windows 7 / VS2013:.NT 4.6 Nuget 2.8.6 (mogelijk eerst 2.8.5 deïnstalleren) Install-Package System.Numerics.Vectors Pre x64 Let op: veel informatie op internet bespreekt oudere versies van RyuJIT. Je hebt géén environmentvariabelen nodig.

Concurrency college 7 Vectorizatie 23 C# en SIMD System.Numerics.Vectors namespace System.Numerics { public struct Vector3 : Iquatable<Vector3>, IFormattable { public float X; public float Y; public float Z; public Vector3(float value); public Vector3(Vector2 value, float z); public Vector3(float x, float y, float z); public static Vector3 operator -(Vector3 value); public static Vector3 operator -(Vector3 left, Vector3 right); public static bool operator!=(vector3 left, Vector3 right); public static Vector3 operator *(float left, Vector3 right); public static Vector3 operator *(Vector3 left, float right); public static Vector3 operator *(Vector3 left, Vector3 right); public static Vector3 operator /(Vector3 value1, float value2); public static Vector3 operator /(Vector3 left, Vector3 right); public static Vector3 operator +(Vector3 left, Vector3 right); public static bool operator ==(Vector3 left, Vector3 right);

Concurrency college 7 Vectorizatie 24 C# en SIMD System.Numerics.Vectors Voorbeeld: C# code: Vector3 D = Vector3.Normalize( T - P ); Assembler: vsubps xmm0,xmm1,xmm6 ; subtract vmovaps xmm1,xmm0 vdpps xmm1,xmm1,xmm0,0f1h ; dot vcvtss2sd xmm1,xmm1,xmm1 ; float to double vsqrtsd xmm1,xmm0,xmm1 ; square root vcvtsd2ss xmm1,xmm1,xmm1 ; double to float vmovupd xmmword ptr [rsp+40h],xmm0 vmovss xmm0,dword ptr [7F90A518h] vdivss xmm0,xmm0,xmm1 ; division...

Concurrency college 7 Vectorizatie 25 C# en SIMD Klaar? Nee. Vector3 D = Vector3.Normalize( T - P ); Vector3 A = T P // 75% float B = dot( A, A ) // 75% Vector3 C = { B, B, B } // 75% Vector3 D = A / C // 75%

Concurrency college 7 Vectorizatie 27 Vectorizatie Scalar Flow Vector3 D = Vector3.Normalize( T - P ); A = T.X P.X B = T.Y P.Y C = T.Z P.Z D = A * A = B * B F = C * C F += F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X P.X B = T.Y P.Y C = T.Z P.Z D = A * A = B * B F = C * C F += F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X P.X B = T.Y P.Y C = T.Z P.Z D = A * A = B * B F = C * C F += F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A = T.X P.X B = T.Y P.Y C = T.Z P.Z D = A * A = B * B F = C * C F += F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G 0 1 2 3

Concurrency college 7 Vectorizatie 28 Vectorizatie Scalar Flow Optimaal normaliseren met SIMD: Invoer is 4 vectoren. A A A A

Concurrency college 7 Vectorizatie 29 Vectorizatie Scalar Flow A = T.X P.X B = T.Y P.Y C = T.Z P.Z D = A * A = B * B F = C * C F += F += D G = sqrt( F ) D.X = A / G D.Y = B / G D.Z = C / G A4 = TX4 PX4 B4 = TY4 PY4 C4 = TZ4 PZ4 D4 = A4 * A4 4 = B4 * B4 F4 = C4 * C4 F4 += 4 F4 += D4 G4 = sqrt4( F4 ) DX4 = A4 / G4 DY4 = B4 / G4 DZ4 = C4 / G4 A4

Concurrency college 7 Vectorizatie 30 Vectorizatie Path Tracing - SIMD class Ray { Vector3 O; Vector3 D; float t, u, v; Vector3 N; int objidx; } Ray ray[4]; class Ray4 { Vector4 OX4, OY4, OZ4; Vector4 DX4, DY4, DZ4; Vector4 t4, u4, v4; Vector4 NX4, NY4, NZ4; Vector4i objidx; } Ray4 ray; OX4 = { ray[0].o.x, ray[1].o.x, ray[2].o.x, ray[3].o.x }; OY4 = { ray[0].o.y, ray[1].o.y, ray[2].o.y, ray[3].o.y }; OZ4 = { ray[0].o.z, ray[1].o.z, ray[2].o.z, ray[3].o.z };

Concurrency college 7 Vectorizatie 31 Vectorizatie SIMD Data class Ray { Vector3 O; Vector3 D; float t, u, v; Vector3 N; int objidx; } Ray ray[4]; AoS Array of Structures

Concurrency college 7 Vectorizatie 32 Vectorizatie SIMD Data class Ray4 { Vector4 OX4, OY4, OZ4; Vector4 DX4, DY4, DZ4; Vector4 t4, u4, v4; Vector4 NX4, NY4, NZ4; Vector4i objidx; } Ray4 ray; SoA Structure of Arrays

Concurrency college 7 Vectorizatie 33 Vectorizatie SIMD Data class Ray4 { Vector<float> OX4, OY4, OZ4; Vector<float> DX4, DY4, DZ4; Vector<float> t4, u4, v4; Vector<float> NX4, NY4, NZ4; Vector<int> objidx; } Ray4 ray; SoA Structure of Arrays

Concurrency college 7 Vectorizatie 34 Vectorizatie Vectorization public Ray Generate( Random rng, int x, int y ) { float r0 = (float)rng.nextdouble(); float r1 = (float)rng.nextdouble(); float r2 = (float)rng.nextdouble() - 0.5f; float r3 = (float)rng.nextdouble() - 0.5f; // calculate sub-pixel ray target position on screen plane float u = ((float)x + r0) / (float)screenwidth; float v = ((float)y + r1) / (float)screenheight; Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); // calculate position on aperture Vector3 P = pos + lenssize * (r2 * right + r3 * up); // calculate ray direction Vector3 D = Vector3.Normalize( T - P ); // return new primary ray return new Ray( P, D, 1e34f ); }

Concurrency college 7 Vectorizatie 35 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... }

Concurrency college 7 Vectorizatie 36 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) { // float r0 = (float)rng.nextdouble(); // float r1 = (float)rng.nextdouble(); // float r2 = (float)rng.nextdouble() - 0.5f; // float r3 = (float)rng.nextdouble() - 0.5f;... }

Concurrency college 7 Vectorizatie 37 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) { // float r0 = (float)rng.nextdouble(); // float r1 = (float)rng.nextdouble(); // float r2 = (float)rng.nextdouble() - 0.5f; // float r3 = (float)rng.nextdouble() - 0.5f; float [] r0 = { (float)rng.nextdouble(), (float)rng.nextdouble(), (float)rng.nextdouble(), (float)rng.nextdouble() }; Vector<float> r0_4 = new Vector( r0 ); float [] r1 = { (float)rng.nextdouble(), (float)rng.nextdouble(), (float)rng.nextdouble(), (float)rng.nextdouble() }; Vector<float> r0_4 = new Vector( r1 ); float [] r2 = { (float)rng.nextdouble() - 0.5f, (float)rng.nextdouble() 0.5f, (float)rng.nextdouble() - 0.5f, (float)rng.nextdouble() 0.5f }; Vector<float> r0_4 = new Vector( r2 ); float [] r3 = { (float)rng.nextdouble() 0.5f, (float)rng.nextdouble() 0.5f, (float)rng.nextdouble() 0.5f, (float)rng.nextdouble() 0.5f }; Vector<float> r0_4 = new Vector( r3 );... }

Concurrency college 7 Vectorizatie 38 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... // calculate sub-pixel ray target position on screen plane // float u = ((float)x + r0) / (float)screenwidth; // float v = ((float)y + r1) / (float)screenheight;... }

Concurrency college 7 Vectorizatie 39 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... // calculate sub-pixel ray target position on screen plane // float u = ((float)x + r0) / (float)screenwidth; // float v = ((float)y + r1) / (float)screenheight; float [] values = { x, x + 1, x + 2, x + 3 }; Vector<float> x4 = new Vector<float>( values ); Vector<float> y4 = new Vector<float>( y ); Vector<float> u4 = (x4 + r0_4) / screenwidth4; Vector<float> v4 = (y4 + r1_4) / screenheight4;... }

Concurrency college 7 Vectorizatie 40 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1);... }

Concurrency college 7 Vectorizatie 41 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); Vector<float> Tx4 = p1x4 + u4 * (p2x4 - p1x4) + v4 * (p3x4 - p1x4);... }

Concurrency college 7 Vectorizatie 42 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... // Vector3 T = p1 + u * (p2 - p1) + v * (p3 - p1); Vector<float> Tx4 = p1x4 + u4 * (p2x4 - p1x4) + v4 * (p3x4 - p1x4); Vector<float> Ty4 = p1y4 + u4 * (p2y4 - p1y4) + v4 * (p3y4 - p1y4); Vector<float> Tz4 = p1z4 + u4 * (p2z4 - p1z4) + v4 * (p3z4 - p1z4);... }

Concurrency college 7 Vectorizatie 43 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... // Vector3 P = pos + lenssize * (r2 * right + r3 * up);... }

Concurrency college 7 Vectorizatie 44 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... // Vector3 P = pos + lenssize * (r2 * right + r3 * up); Vector<float> Px4 = posx4 + lenssize4 * (r2_4 * rightx4 + r3_4 * upx4); Vector<float> Py4 = posy4 + lenssize4 * (r2_4 * righty4 + r3_4 * upy4); Vector<float> Pz4 = posz4 + lenssize4 * (r2_4 * rightz4 + r3_4 * upz4);... }

Concurrency college 7 Vectorizatie 45 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... Vector3.Normalize( T - P ); Vector<float> x4 = Tx4 - Px4; Vector<float> y4 = Ty4 - Py4; Vector<float> z4 = Tz4 - Pz4; Vector<float> len4 = Vector.SquareRoot<float>( x4 * x4 + y4 * y4 + z4 * z4 ); x4 /= len4; y4 /= len4; z4 /= len4;... }

Concurrency college 7 Vectorizatie 46 Vectorizatie Vectorization public Ray4 Generate4( Random rng, int x, int y ) {... Ray4 r4 = new Ray4(); r4.ox4 = Px4; r4.oy4 = Py4; r4.oz4 = Pz4; r4.dx4 = x4; r4.dy4 = y4; r4.dz4 = z4; r4.t4 = new Vector<float>( 1e34f ); return r4; }

Concurrency college 7 Vectorizatie 47 Vectorizatie Vectorization Digest: Vectorizatie begint met het identificeren van een scalar flow. Uitgevoerd met vector artithmetic voeren we 4 onafhankelijke, identieke flows in parallel uit. De data moet passend zijn voor de vector flow: waar we eerst pos.x gebruiken, gebruiken we nu pos.x4 (en dus niet pos.xyzw!), waar we eerst de constante PI gebruiken, gebruiken we nu { PI, PI, PI, PI }. We noemen deze data layout structure of arrays. Theoretische speedup is 4x. Maar: r is bijna altijd sprake van enige overhead om data in het juiste formaat te krijgen. De C# JIT compiler is niet optimaal.

Concurrency college 7 Vectorizatie 49 Speculative Broken Streams Wat doen we als niet alle streams dezelfde code uitvoeren? void IntersectPlane( int idx, Plane plane, Ray ray ) { float OdotN = Vector3.Dot( ray.o, plane.n ); float DdotN = Vector3.Dot( ray.d, plane.n ); float t = -(OdotN + plane.d) / DdotN; if ((t <= 0) (t > ray.t)) return; ray.n = plane.n; ray.objidx = idx; ray.t = t; }

Concurrency college 7 Vectorizatie 50 Speculative Broken Streams Wat doen we als niet alle streams dezelfde code uitvoeren? void IntersectPlane4( int idx, Plane plane, Ray4 rays ) { Vector<float> OdotN = Dot4(... ); Vector<float> DdotN = Dot4(... ); Vector<float> t4 = -(OdotN4 + plane.d4) / DdotN4; if ((t4 <= 0) (t4 > ray.t4)) return; ray.n4 =...; ray.objidx4 = Vector<int>( idx ); ray.t4 = t4; }

Concurrency college 7 Vectorizatie 51 Speculative Broken Streams Vector<int> mask = Vector.LessThan( t4, ray.t4 ); r4.t4 = Vector.ConditionalSelect( mask, t4, ray.t4 );

Concurrency college 7 Vectorizatie 52 Speculative Broken Streams void IntersectPlane4( int idx, Plane plane, Ray4 rays ) { Vector<float> OdotN = Dot4(... ); Vector<float> DdotN = Dot4(... ); Vector<float> t4 = -(OdotN4 + plane.d4) / DdotN4; // if ((t4 <= 0) (t4 > ray.t4)) return; Vector<int> mask1 = Vector.GreaterThan( t4, Vector<float>.Zero ); Vector<int> mask2 = Vector.LessThan( t4, ray.t4 ); Vector<int> mask = Vector.BitwiseAnd( mask1, mask2 ); rays.nx4 = Vector.ConditionalSelect( mask, rays.nx4, new Vector<float>( plane.n.x ) ); rays.ny4 = Vector.ConditionalSelect( mask, rays.ny4, new Vector<float>( plane.n.y ) ); rays.nz4 = Vector.ConditionalSelect( mask, rays.nz4, new Vector<float>( plane.n.z ) ); rays.t4 = Vector.ConditionalSelect( mask, rays.t4, t4 ); rays.objidx4 = Vector.ConditionalSelect( mask, rays.objidx4, new Vector<int>( idx ) ); }

Concurrency college 7 Vectorizatie 53 Speculative Broken Streams Digest: Streams uitschakelen kan niet We kunnen wel zorgen dat acties geen effect meer hebben, middels masking Dit is een voorbeeld van speculative execution. Speculative execution heeft een belangrijk voordeel: r is geen branching code, en dus geen branch misprediction. Hierdoor kan SIMD code in sommige gevallen meer dan 4x sneller zijn: superlinear speedup.

Concurrency college 7 Vectorizatie 55 Praktikum xtra opdracht: Niet om in te leveren!

Concurrency college 7 Vectorizatie 56 Literatuur Lezen: Boek hoofdstuk 2.3 Deze slides SIMD-nabled Vector Types with C#, Gastón Hillar, DrDobbs, 2014: http://www.drdobbs.com/architecture-and-design/simd-enabled-vector-types-with-c/240168888 The JIT finally proposed. JIT and SIMD are getting married, Immo Landweth, Microsoft, 2014: http://blogs.msdn.com/b/dotnet/archive/2014/04/07/the-jit-finally-proposed-jit-and-simd-are-gettingmarried.aspx

Concurrency Gerard Tel / Jacco Bikker - november 2015 januari 2016 IND van Vectorizatie volgende college: Concurrency in The Last of Us

Bonus Slides: Cache & VLIW C# Performance Tests

Paste in template: Stopwatch timer = new Stopwatch(); int[] arr = new int[64 * 1024 * 1024]; // 256MB public void Tick() { // clear the screen screen.clear( 0 ); int elapsed1, elapsed2; // cache test // loop 1 timer.reset(); timer.start(); for (int i = 0; i < arr.length; i++) arr[i] *= 3; timer.stop(); elapsed1 = (int)timer.lapsedmilliseconds; // loop 2 timer.reset(); timer.start(); for (int i = 0; i < arr.length; i += 16) arr[i] *= 3; timer.stop(); elapsed2 = (int)timer.lapsedmilliseconds; // show timings string tm = "loop 1: " + elapsed1.tostring(); screen.print( tm, 10, 10, 0xffffff ); tm = "loop 2: " + elapsed2.tostring(); screen.print( tm, 10, 30, 0xffffff ); }

Paste in template: Stopwatch timer = new Stopwatch(); int[] arr = new int[4]; public void Tick() { // clear the screen screen.clear( 0 ); int elapsed1, elapsed2; int steps = 64 * 1024 * 1024; // loop 1 timer.reset(); timer.start(); for (int i = 0; i < steps; i++) { arr[0]++; arr[0]++; arr[0]++; arr[0]++; } timer.stop(); elapsed1 = (int)timer.lapsedmilliseconds; // loop 2 timer.reset(); timer.start(); for (int i = 0; i < steps; i++ ) { arr[0]++; arr[1]++; arr[2]++; arr[3]++; } timer.stop(); elapsed2 = (int)timer.lapsedmilliseconds; // show timings string tm = "loop 1: " + elapsed1.tostring(); screen.print( tm, 10, 10, 0xffffff ); tm = "loop 2: " + elapsed2.tostring(); screen.print( tm, 10, 30, 0xffffff ); }