jerome Posted September 8, 2017 Share Posted September 8, 2017 Hi, People usually love the Solid Particle System (aka SPS). Some of them sometimes ask for new features like the ability to extend it once created (coming soon) or for some extra speed by the ability to disable some computations. I made some study about how things could get faster. The short answer is : go to a lower lever in the implementation (replace the arrays of objects by typed arrays of floats, for instance), then use if possible other processes (GPU or workers). Well, here are the current status of my prototypes, so you can compare on your computer and browser the differences. The SPS of reference is really big, filled with 40K (yes, 40, 000 !) boxes and tetrahedrons. It's far more than we usually ask to a SPS with animated solid particles in the PG examples you could find in the forum posts. So your browser may suffer a bit ... Reference legacy SPS : http://jerome.bousquie.fr/BJS/test/spsReference.html Then comes the lighter typed array based version : http://jerome.bousquie.fr/BJS/test/spsBuffer.html As you can notice, it's a bit faster. Not only because of the usage of buffers/typed arrays, but also because it has less features than the legacy SPS for now. Let's go on... [EDIT] (from here, your need to have a browser with sharedArrayBuffer enabled) Here comes his new friend, the worker based SPS : http://jerome.bousquie.fr/BJS/test/spsProtoWorker.html This one is really faster. In this version, the particle logic (what the user want them to do) is still in the main thread and the worker only computes the transformations (the final vertex coordinates from the particle rotations, positions, scaling values, etc). At last, here's the second worker version : http://jerome.bousquie.fr/BJS/test/spsProtoWorker2.html It looks faster ... at least on my browsers. In this last version, the particle logic is deported in the worker. The main thread only carries for updating the mesh from the vertex buffers. In both worker versions, the worker computations are decoupled from the render loop. This means that the worker computes, then answers the main thread it has finished and this one just orders it to compute again, whatever the render loop is currently doing at this moment. The render loop just reads the data currently updated by the worker in a shared buffer (shared between the worker and the main thread) Next study step to come (not soon) : the GPU based SPS Please wait for a while until the frame counter stabilizes to the current average value if you run the tests. Convergence, thrice, brianzinn and 7 others 9 1 Quote Link to comment Share on other sites More sharing options...
Pryme8 Posted September 8, 2017 Share Posted September 8, 2017 Great work bud! jerome 1 Quote Link to comment Share on other sites More sharing options...
Wingnut Posted September 8, 2017 Share Posted September 8, 2017 Nice hackin', J! I'm getting some ReferenceError: SharedArrayBuffer is not defined - spsProtoWorker2.js: 137 ...in the last 2 demos. I probably need to adjust my Firefox somehow. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted September 8, 2017 Share Posted September 8, 2017 I think SharedArrayBuffer is not implemented everywhere yet. Wingnut 1 Quote Link to comment Share on other sites More sharing options...
jerome Posted September 8, 2017 Author Share Posted September 8, 2017 yep, just use the latest versions or enable it in the browser settings In order to understand what are the dimensions we're talking about in these examples : 20 K boxes, each having 24 vertices = 480K vertices + 20K tetrahedrons, each having 12 vertices = 240K vertices A total of 720 K vertices, knowing each has 3 floats (x,y,z) for the positions and 3 floats for the normals, so 4.320 millions of coordinates to be scaled, rotated (a quaternion and a rotation matrix computed per particle each step !) and translated each update, CPU side. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted September 8, 2017 Share Posted September 8, 2017 Interesting. Do you know if a shared buffer is limited to just UI and a single worker? If it can work with multiple workers, you might be able to get some kind of a thread pool where each core works on a piece. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted September 8, 2017 Share Posted September 8, 2017 You know I tried this on my Sony tablet and got 11 fps with worker, so it works on Android. With legacy is update so slow, there was no time update the FPS reading. Need to check iOS. Not holding my breath. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted September 8, 2017 Share Posted September 8, 2017 It actually worked on an iPad Air 2, A8 processor. About 58-60 with worker. 2-8 with legacy. jerome 1 Quote Link to comment Share on other sites More sharing options...
jerome Posted September 8, 2017 Author Share Posted September 8, 2017 #1 sharedArrayBuffers work also with multiple workers, I tried them, but I faced a big issue in the data synchronization : when a worker updates a part of the common buffer, the updated value is only accessible by the other thread (including the main thread) a while after only... un-predictable in a render loop. The W3C and JS guys added a feature called Atomics ( https://github.com/tc39/ecmascript_sharedmem/blob/master/TUTORIAL.md ) to make opaerations atomic and to guarantee the read value is the last updated one. Unfortunately, Atomics only works with IntArray... so no way to manage float coordinates . I went crazy after hours testing this. If you change the value of the variable workerNb in my code, you'll see the data synchronization issue live. Note also that the VertexBuffer used by the GPU can't be shared directly among workers. We need a shared buffer for the workers, then copy it to the vertex buffer before the rendering. I will probably make a third prototype by sharing a buffer per worker.. with 2, 3, or more workers (some browsers limit the number to 4) each computing simultaneously only a subpart of the array in order to check if it's worth it. This will be my next step actually before the GPU experiments, if they ever come one day ... #2 : 40K is really a huge number for solid particles. In the PGs, you will hardly find examples with particle number above 6K as they can easily be recycled when they get out of the frustum. Maybe you could test the worker examples with, say, 10 or 12K particles (just copy/paste the 3 files on your server and change the value of the variable particleNb) what a legacy SPS would normally not animate at 60 fps Quote Link to comment Share on other sites More sharing options...
adam Posted September 8, 2017 Share Posted September 8, 2017 29 minutes ago, jerome said: Unfortunately, Atomics only works with IntArray... so no way to manage float coordinates . Could you multiply the float values by something like 100 before storing them into the int array and then divide the int value by 100 to get back to your float value? Quote Link to comment Share on other sites More sharing options...
jerome Posted September 8, 2017 Author Share Posted September 8, 2017 yes, this is a workaround, but this limits quickly the smoothness because thus the precision is fixed forever : say everything is multiplied/divided by 1000 (factor 1000), then wherever the camera is located, wherever each vertex is positionned, rotated or scaled, they are all bound to some virtual 3D grid. Said differently, this technique tends to reduce all the possible used values to a finite pool : the integer values storable in the array. intValue1 = (floatValue1 * 1000)|0 // cast to int with a precision of 1000, stored in the shared array receivedFloatValue = intValue / 1000 // right ? well, plenty of different float values for initial floatValue1 would give the same intermediate intValue, so the same final received value ex : 1.0, 1.0001, 1.0002, 1.0003, etc but also 1.00011, 1.000101, 1.000999999, etc they all end to the same final value : 1.0 If all the particles have a size around, say, 1.0, can be distant the ones from the others fot 0.1 and the camera can be close up to 0.01 to them, this really matters. Not sure I'm very clear. Quote Link to comment Share on other sites More sharing options...
jerome Posted September 11, 2017 Author Share Posted September 11, 2017 clearer : RaananW, GameMonetize and adam 2 1 Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted September 11, 2017 Share Posted September 11, 2017 As for float, I round all the geometry coming out of Blender to 4 decimals. It seems more than sufficient. We are also using 32 bit integer not 16 bit. The largest values normals is 1 right? I am wondering out loud about a worker to compute normals. In my morphing implementation, when a new target is being computed, the target normals are also calculated. Then each frame, I interpolate the normals as well as the positions. This will mean that the starting of a morph on the fly at least one frame late. Since I have a queue based animation system anyway, this might not be a big hurtle. In fact for QI.Automaton, I export those 24 tiny FACE targets. In the process of construction, I create all the expressions (speech visemes are also expressions) from their smaller parts. Doing that might reduce my load time. Could combine that with read ahead, since it would not be trying up the UI much at all. Just thinking out loud. Quote Link to comment Share on other sites More sharing options...
jerome Posted September 11, 2017 Author Share Posted September 11, 2017 all this is really interesting, needs test and gives me ideas Will check tomorrow as soon as I can get some free time Quote Link to comment Share on other sites More sharing options...
adam Posted September 11, 2017 Share Posted September 11, 2017 It obviously depends on how much precision you need and the scale of your world. boxes with size 1: https://playground.babylonjs.com/#MTER2F#2 boxes with size .1: https://playground.babylonjs.com/#MTER2F#1 boxes with size .01: https://playground.babylonjs.com/#MTER2F Good for hacking or your own project where you know the constraints. Probably not a good fit for SPS. EDIT: those first 3 examples were 3 decimal places. Here is 4 decimal places: boxes with size .1: https://playground.babylonjs.com/#MTER2F#3 boxes with size .01 https://playground.babylonjs.com/#MTER2F#4 This might work for SPS. Also remember that camera.minZ defaults to 1. In most cases you wouldn't be close enough to see this level of detailed movement. jerome 1 Quote Link to comment Share on other sites More sharing options...
jerome Posted September 12, 2017 Author Share Posted September 12, 2017 really smart study case Quote Link to comment Share on other sites More sharing options...
jerome Posted September 12, 2017 Author Share Posted September 12, 2017 ok back. I was about to implement some high precision float to integer casting in order to use the painful Atomics feature. Painful, because it implies to manage the way each buffer element is updated, then is tagged as asleep in one thread and being surveyed until awaken in the other thread... this for 7 million array elements ! Well, before diving in those complex programming constraints, I re-checked at last time my initial code that shouldn't make real concurrent access to the same part of the memory because each worker was supposed to read and write in the buffer on its dedicated portion only. Moreover the main thread was only support to read the shared buffer whatever the data were up to date or not. And I found a tiny bug in the way I used the indexes to share the pieces of buffer for each worker... fixed. So no more flickering, now. And no need for integer casting and all the Atomics stuff. Everything is here : http://jerome.bousquie.fr/BJS/test/SPSWorker/ There's a folder for each version : one or two workers. Just click on the html file. The difference between the version 1 and 2 is just that the version 1 implements the particle logic in the main thread, whereas the version 2 implements the particle logic in each worker. Theoretically, the version 2 should be faster. On average machines, the FPS should be up to 20 times higher than the typed array mono-threaded version or the legacy SPS one : Reference legacy SPS : http://jerome.bousquie.fr/BJS/test/spsReference.html Lighter typed array based version : http://jerome.bousquie.fr/BJS/test/spsBuffer.html On fast machines, the FPS is almost always around 60 FPS. This is expected because the render loop is decoupled from the worker computations. So don't really mind the FPS when close to 60, don't compare also the particle speed between both version : the velocity step is clocked on the render loop in the version 1, but is clocked on the worker cycle in the version 2, so they might be different. So what to compare then ? maybe the smoothness with what the particle evolve ... are they jerky or not. I tried also with 4 or 6 workers but could not get any gain. Actually, as the Vertex Buffer, used by the GPU at the end, can't be shared among workers, we have to copy the buffer shared between workers in the vertex buffer before the rendering. This means anyway a loop to copy 7M elements in the main thread.... this loop (as well with particle logic when in the main thread) can slow down the main thread, so the FPS. As usual, please wait for a while until the FPS meter stabilizes. adam, JohnK and Wingnut 3 Quote Link to comment Share on other sites More sharing options...
JohnK Posted September 12, 2017 Share Posted September 12, 2017 On a low end Window's laptop - legacy 7fps, new 10 fps but new version much less stuttery. Quote Link to comment Share on other sites More sharing options...
jerome Posted September 12, 2017 Author Share Posted September 12, 2017 Only 10 fps with http://jerome.bousquie.fr/BJS/test/SPSWorker/twoWorkers/spsProtoWorker2.html ? rats ! On linux, Chrome is really faster than FF... I get 60 fps in full screen on my powerful work machine : GPU nVidia Quadra K620, CPU Intel Xeon E3 (4 x 3.1 GHz) Anyway 40K is certainly a far too big number ... [EDIT] : I just slowed down the particle speed Quote Link to comment Share on other sites More sharing options...
jerome Posted September 12, 2017 Author Share Posted September 12, 2017 Heavy stress test here (beware, these need very powerful PC) : http://jerome.bousquie.fr/BJS/test/SPSWorker/Ultimate/spsProtoWorker3.html This one runs at 60 fps in Chrome here ... 200K (yes : 200, 000 !) transparent quad particles and 4 workers. [EDIT] http://jerome.bousquie.fr/BJS/test/SPSWorker/Ultimate/spsProtoWorker4.html 250K triangles, 6 workers, 60 fps in full screen in Chrome on my muscle work computer. JCPalmer 1 Quote Link to comment Share on other sites More sharing options...
Wingnut Posted September 12, 2017 Share Posted September 12, 2017 Hi guys. Quick interrupt. Those not seeing demos in Firefox... enter about:config in browser URL field... and set javascript.options.shared_memory to true. Thanks for info, JcPalmer and Jerome. Great demos, interesting tests! jerome 1 Quote Link to comment Share on other sites More sharing options...
JohnK Posted September 12, 2017 Share Posted September 12, 2017 3 hours ago, jerome said: http://jerome.bousquie.fr/BJS/test/SPSWorker/Ultimate/spsProtoWorker4.html Am getting 25 fps using the above using FF on my low end Window's laptop (Intel (HD) integrated graphics) Am getting 4 fps on my low end android tablet. I'm impressed as I have seen apparently less intensive PGs fail on the Android. jerome 1 Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted September 12, 2017 Share Posted September 12, 2017 I get 59-60 FPS for the last 2 on chrome & firefox (i5 & GTX 1050 @ 2560 x 1600 res). Edge fails due to share memory. I am not sure using more workers than the number of cores is good though. Wonder is there is a way to find how many there are in JS? jerome 1 Quote Link to comment Share on other sites More sharing options...
jerome Posted September 15, 2017 Author Share Posted September 15, 2017 Hi, third step : GPU SPS http://jerome.bousquie.fr/BJS/test/SPSGpu/spsShader.html This one runs at 60 fps in chrome on my muscle work computer. As usual 40K solid particles (boxes + tetras). This is a monothreaded JS version, but all the particle transformations (translation, rotation, scaling) are computed GPU side. I didn't code a nice fragment shader managing the light reflection for now. Next step to come : the worker GPU SPS. Not sure we can get more perfs again anyway because the particle logic (the user custom behavior) has to remain in the main thread. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted September 15, 2017 Share Posted September 15, 2017 Full screen it runs @ 59 fps on FF. About 25 % cpu total all cores. Did not test chrome, since you already did. Since it wasn't using workers, tried Edge, but could not compile effect. Vertex Shader: Fragment & error msg: Error does not mean anything to me. jerome 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.