SPS experiments

jerome · September 8, 2017

Hi,

People usually love the Solid Particle System (aka SPS). Some of them sometimes ask for new features like the ability to extend it once created (coming soon) or for some extra speed by the ability to disable some computations.

I made some study about how things could get faster. The short answer is : go to a lower lever in the implementation (replace the arrays of objects by typed arrays of floats, for instance), then use if possible other processes (GPU or workers).

Well, here are the current status of my prototypes, so you can compare on your computer and browser the differences.

The SPS of reference is really big, filled with 40K (yes, 40, 000 !) boxes and tetrahedrons. It's far more than we usually ask to a SPS with animated solid particles in the PG examples you could find in the forum posts. So your browser may suffer a bit ...

Reference legacy SPS : http://jerome.bousquie.fr/BJS/test/spsReference.html

Then comes the lighter typed array based version : http://jerome.bousquie.fr/BJS/test/spsBuffer.html

As you can notice, it's a bit faster. Not only because of the usage of buffers/typed arrays, but also because it has less features than the legacy SPS for now.

Let's go on...

[EDIT] (from here, your need to have a browser with sharedArrayBuffer enabled)

Here comes his new friend, the worker based SPS : http://jerome.bousquie.fr/BJS/test/spsProtoWorker.html

This one is really faster. In this version, the particle logic (what the user want them to do) is still in the main thread and the worker only computes the transformations (the final vertex coordinates from the particle rotations, positions, scaling values, etc).

At last, here's the second worker version : http://jerome.bousquie.fr/BJS/test/spsProtoWorker2.html

It looks faster ... at least on my browsers.

In this last version, the particle logic is deported in the worker. The main thread only carries for updating the mesh from the vertex buffers.

In both worker versions, the worker computations are decoupled from the render loop. This means that the worker computes, then answers the main thread it has finished and this one just orders it to compute again, whatever the render loop is currently doing at this moment. The render loop just reads the data currently updated by the worker in a shared buffer (shared between the worker and the main thread)

Next study step to come (not soon) : the GPU based SPS

Please wait for a while until the frame counter stabilizes to the current average value if you run the tests.

Pryme8 · September 8, 2017

Great work bud!

Wingnut · September 8, 2017

Nice hackin', J!

I'm getting some ReferenceError: SharedArrayBuffer is not defined - spsProtoWorker2.js: 137 ...in the last 2 demos. I probably need to adjust my Firefox somehow.

JCPalmer · September 8, 2017

I think SharedArrayBuffer is not implemented everywhere yet.

jerome · September 8, 2017

yep, just use the latest versions or enable it in the browser settings

In order to understand what are the dimensions we're talking about in these examples :

20 K boxes, each having 24 vertices = 480K vertices

+ 20K tetrahedrons, each having 12 vertices = 240K vertices

A total of 720 K vertices, knowing each has 3 floats (x,y,z) for the positions and 3 floats for the normals, so 4.320 millions of coordinates to be scaled, rotated (a quaternion and a rotation matrix computed per particle each step !) and translated each update, CPU side.

JCPalmer · September 8, 2017

Interesting. Do you know if a shared buffer is limited to just UI and a single worker? If it can work with multiple workers, you might be able to get some kind of a thread pool where each core works on a piece.

JCPalmer · September 8, 2017

You know I tried this on my Sony tablet and got 11 fps with worker, so it works on Android. With legacy is update so slow, there was no time update the FPS reading. Need to check iOS. Not holding my breath.

JCPalmer · September 8, 2017

It actually worked on an iPad Air 2, A8 processor. About 58-60 with worker. 2-8 with legacy.

jerome · September 8, 2017

#1 sharedArrayBuffers work also with multiple workers, I tried them, but I faced a big issue in the data synchronization : when a worker updates a part of the common buffer, the updated value is only accessible by the other thread (including the main thread) a while after only... un-predictable in a render loop.

The W3C and JS guys added a feature called Atomics ( https://github.com/tc39/ecmascript_sharedmem/blob/master/TUTORIAL.md ) to make opaerations atomic and to guarantee the read value is the last updated one. Unfortunately, Atomics only works with IntArray... so no way to manage float coordinates . I went crazy after hours testing this. If you change the value of the variable workerNb in my code, you'll see the data synchronization issue live.

Note also that the VertexBuffer used by the GPU can't be shared directly among workers. We need a shared buffer for the workers, then copy it to the vertex buffer before the rendering.

I will probably make a third prototype by sharing a buffer per worker.. with 2, 3, or more workers (some browsers limit the number to 4) each computing simultaneously only a subpart of the array in order to check if it's worth it. This will be my next step actually before the GPU experiments, if they ever come one day ...

#2 : 40K is really a huge number for solid particles. In the PGs, you will hardly find examples with particle number above 6K as they can easily be recycled when they get out of the frustum.

Maybe you could test the worker examples with, say, 10 or 12K particles (just copy/paste the 3 files on your server and change the value of the variable particleNb) what a legacy SPS would normally not animate at 60 fps

adam · September 8, 2017

29 minutes ago, jerome said:

Unfortunately, Atomics only works with IntArray... so no way to manage float coordinates .

Could you multiply the float values by something like 100 before storing them into the int array and then divide the int value by 100 to get back to your float value?

jerome · September 8, 2017

yes, this is a workaround, but this limits quickly the smoothness because thus the precision is fixed forever : say everything is multiplied/divided by 1000 (factor 1000), then wherever the camera is located, wherever each vertex is positionned, rotated or scaled, they are all bound to some virtual 3D grid.

Said differently, this technique tends to reduce all the possible used values to a finite pool : the integer values storable in the array.

intValue1 = (floatValue1 * 1000)|0 // cast to int with a precision of 1000, stored in the shared array

receivedFloatValue = intValue / 1000 // right ?

well, plenty of different float values for initial floatValue1 would give the same intermediate intValue, so the same final received value

ex : 1.0, 1.0001, 1.0002, 1.0003, etc but also 1.00011, 1.000101, 1.000999999, etc they all end to the same final value : 1.0

If all the particles have a size around, say, 1.0, can be distant the ones from the others fot 0.1 and the camera can be close up to 0.01 to them, this really matters.

Not sure I'm very clear.

jerome · September 11, 2017

clearer :

JCPalmer · September 11, 2017

As for float, I round all the geometry coming out of Blender to 4 decimals. It seems more than sufficient. We are also using 32 bit integer not 16 bit. The largest values normals is 1 right? I am wondering out loud about a worker to compute normals. In my morphing implementation, when a new target is being computed, the target normals are also calculated. Then each frame, I interpolate the normals as well as the positions. This will mean that the starting of a morph on the fly at least one frame late. Since I have a queue based animation system anyway, this might not be a big hurtle. In fact for QI.Automaton, I export those 24 tiny FACE targets. In the process of construction, I create all the expressions (speech visemes are also expressions) from their smaller parts. Doing that might reduce my load time. Could combine that with read ahead, since it would not be trying up the UI much at all.

Just thinking out loud.

jerome · September 11, 2017

all this is really interesting, needs test and gives me ideas

Will check tomorrow as soon as I can get some free time

adam · September 11, 2017

It obviously depends on how much precision you need and the scale of your world.

boxes with size 1:

https://playground.babylonjs.com/#MTER2F#2

boxes with size .1:

https://playground.babylonjs.com/#MTER2F#1

boxes with size .01:

https://playground.babylonjs.com/#MTER2F

Good for hacking or your own project where you know the constraints. Probably not a good fit for SPS.

EDIT:

those first 3 examples were 3 decimal places.

Here is 4 decimal places:

boxes with size .1:

https://playground.babylonjs.com/#MTER2F#3

boxes with size .01

https://playground.babylonjs.com/#MTER2F#4

This might work for SPS. Also remember that camera.minZ defaults to 1. In most cases you wouldn't be close enough to see this level of detailed movement.

jerome · September 12, 2017

really smart study case

jerome · September 12, 2017

ok back.

I was about to implement some high precision float to integer casting in order to use the painful Atomics feature. Painful, because it implies to manage the way each buffer element is updated, then is tagged as asleep in one thread and being surveyed until awaken in the other thread... this for 7 million array elements !

Well, before diving in those complex programming constraints, I re-checked at last time my initial code that shouldn't make real concurrent access to the same part of the memory because each worker was supposed to read and write in the buffer on its dedicated portion only. Moreover the main thread was only support to read the shared buffer whatever the data were up to date or not.

And I found a tiny bug in the way I used the indexes to share the pieces of buffer for each worker... fixed. So no more flickering, now. And no need for integer casting and all the Atomics stuff.

Everything is here : http://jerome.bousquie.fr/BJS/test/SPSWorker/

There's a folder for each version : one or two workers. Just click on the html file. The difference between the version 1 and 2 is just that the version 1 implements the particle logic in the main thread, whereas the version 2 implements the particle logic in each worker.

Theoretically, the version 2 should be faster.

On average machines, the FPS should be up to 20 times higher than the typed array mono-threaded version or the legacy SPS one :

Reference legacy SPS : http://jerome.bousquie.fr/BJS/test/spsReference.html

Lighter typed array based version : http://jerome.bousquie.fr/BJS/test/spsBuffer.html

On fast machines, the FPS is almost always around 60 FPS. This is expected because the render loop is decoupled from the worker computations. So don't really mind the FPS when close to 60, don't compare also the particle speed between both version : the velocity step is clocked on the render loop in the version 1, but is clocked on the worker cycle in the version 2, so they might be different.

So what to compare then ? maybe the smoothness with what the particle evolve ... are they jerky or not.

I tried also with 4 or 6 workers but could not get any gain. Actually, as the Vertex Buffer, used by the GPU at the end, can't be shared among workers, we have to copy the buffer shared between workers in the vertex buffer before the rendering. This means anyway a loop to copy 7M elements in the main thread.... this loop (as well with particle logic when in the main thread) can slow down the main thread, so the FPS.

As usual, please wait for a while until the FPS meter stabilizes.

JohnK · September 12, 2017

On a low end Window's laptop - legacy 7fps, new 10 fps but new version much less stuttery.

jerome · September 12, 2017

Only 10 fps with http://jerome.bousquie.fr/BJS/test/SPSWorker/twoWorkers/spsProtoWorker2.html ? rats !

On linux, Chrome is really faster than FF... I get 60 fps in full screen on my powerful work machine : GPU nVidia Quadra K620, CPU Intel Xeon E3 (4 x 3.1 GHz)

Anyway 40K is certainly a far too big number ...

[EDIT] : I just slowed down the particle speed

jerome · September 12, 2017

Heavy stress test here (beware, these need very powerful PC) : http://jerome.bousquie.fr/BJS/test/SPSWorker/Ultimate/spsProtoWorker3.html

This one runs at 60 fps in Chrome here ... 200K (yes : 200, 000 !) transparent quad particles and 4 workers.

[EDIT] http://jerome.bousquie.fr/BJS/test/SPSWorker/Ultimate/spsProtoWorker4.html

250K triangles, 6 workers, 60 fps in full screen in Chrome on my muscle work computer.

Wingnut · September 12, 2017

Hi guys. Quick interrupt. Those not seeing demos in Firefox... enter about:config in browser URL field... and set javascript.options.shared_memory to true.

Thanks for info, JcPalmer and Jerome. Great demos, interesting tests!

JohnK · September 12, 2017

3 hours ago, jerome said:

http://jerome.bousquie.fr/BJS/test/SPSWorker/Ultimate/spsProtoWorker4.html

Am getting 25 fps using the above using FF on my low end Window's laptop (Intel (HD) integrated graphics)

Am getting 4 fps on my low end android tablet.

I'm impressed as I have seen apparently less intensive PGs fail on the Android.

JCPalmer · September 12, 2017

I get 59-60 FPS for the last 2 on chrome & firefox (i5 & GTX 1050 @ 2560 x 1600 res). Edge fails due to share memory. I am not sure using more workers than the number of cores is good though. Wonder is there is a way to find how many there are in JS?

jerome · September 15, 2017

Hi, third step : GPU SPS

http://jerome.bousquie.fr/BJS/test/SPSGpu/spsShader.html

This one runs at 60 fps in chrome on my muscle work computer.

As usual 40K solid particles (boxes + tetras). This is a monothreaded JS version, but all the particle transformations (translation, rotation, scaling) are computed GPU side. I didn't code a nice fragment shader managing the light reflection for now.

Next step to come : the worker GPU SPS. Not sure we can get more perfs again anyway because the particle logic (the user custom behavior) has to remain in the main thread.

JCPalmer · September 15, 2017

Full screen it runs @ 59 fps on FF. About 25 % cpu total all cores. Did not test chrome, since you already did.

Since it wasn't using workers, tried Edge, but could not compile effect.

Vertex Shader:

Fragment & error msg:

fragment.jpg.73569a0960720cd52a8aeda2b740064d.jpg

Error does not mean anything to me.

SPS experiments

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members