Jump to content

SPS experiments


jerome
 Share

Recommended Posts

Hi,

People usually love the Solid Particle System (aka SPS). Some of them sometimes ask for new features like the ability to extend it once created (coming soon) or for some extra speed by the ability to disable some computations.

I made some study about how things could get faster. The short answer is : go to a lower lever in the implementation (replace the arrays of objects by typed arrays of floats, for instance), then use if possible other processes (GPU or workers).

Well, here are the current status of my prototypes, so you can compare on your computer and browser the differences.

The SPS of reference is really big, filled with 40K (yes, 40, 000 !) boxes and tetrahedrons. It's far more than we usually ask to a SPS with animated solid particles in the PG examples you could find in the forum posts. So your browser may suffer a bit ...

Reference legacy SPS : http://jerome.bousquie.fr/BJS/test/spsReference.html

Then comes the lighter typed array based version : http://jerome.bousquie.fr/BJS/test/spsBuffer.html 

As you can notice, it's a bit faster. Not only because of the usage of buffers/typed arrays, but also because it has less features than the legacy SPS for now.

Let's go on...

[EDIT] (from here, your need to have a browser with sharedArrayBuffer enabled)

Here comes his new friend, the worker based SPS : http://jerome.bousquie.fr/BJS/test/spsProtoWorker.html

This one is really faster. In this version, the particle logic (what the user want them to do) is still in the main thread and the worker only computes the transformations (the final vertex coordinates from the particle rotations, positions, scaling values, etc).

At last, here's the second worker version : http://jerome.bousquie.fr/BJS/test/spsProtoWorker2.html

It looks faster ... at least on my browsers.

In this last version, the particle logic is deported in the worker. The main thread only carries for updating the mesh from the vertex buffers.

In both worker versions, the worker computations are decoupled from the render loop. This means that the worker computes, then answers the main thread it has finished and this one just orders it to compute again, whatever the render loop is currently doing at this moment. The render loop just reads the data currently updated by the worker in a shared buffer (shared between the worker and the main thread)

Next study step to come (not soon) : the GPU based SPS

 

Please wait for a while until the frame counter stabilizes to the current average value if you run the tests.

 

 

 

 

Link to comment
Share on other sites

yep, just use the latest versions or enable it in the browser settings

In order to understand what are the dimensions we're talking about in these examples :

20 K boxes, each having 24 vertices = 480K vertices

+ 20K tetrahedrons, each having 12 vertices = 240K vertices

A total of 720 K vertices, knowing each has 3 floats (x,y,z)  for the positions and 3 floats for the normals, so 4.320 millions of coordinates to be scaled, rotated (a quaternion and a rotation matrix computed per particle each step !) and translated each update, CPU side.

Link to comment
Share on other sites

#1 sharedArrayBuffers work also with multiple workers, I tried them, but I faced a big issue in the data synchronization : when a worker updates a part of the common buffer, the updated value is only accessible by the other thread (including the main thread) a while after only... un-predictable in a render loop.

The W3C and JS guys added a feature called Atomics ( https://github.com/tc39/ecmascript_sharedmem/blob/master/TUTORIAL.md ) to make opaerations atomic and  to guarantee the read value is the last updated one. Unfortunately, Atomics only works with IntArray... so no way to manage float coordinates . I went crazy after hours testing this. If you change the value of the variable workerNb in my code, you'll see the data synchronization issue live.

Note also that the VertexBuffer used by the GPU can't be shared directly among workers. We need a shared buffer for the workers, then copy it to the vertex buffer before the rendering.

I will probably make a third prototype by sharing a buffer per worker.. with 2, 3, or more workers (some browsers limit the number to 4) each computing simultaneously only a subpart of the array  in order to check if it's worth it. This will be my next step actually before the GPU experiments, if they ever come one day ...

#2 : 40K is really a huge number for solid particles. In the PGs, you will hardly find examples with particle number above 6K as they can easily be recycled when they get out of the frustum.

Maybe you could test the worker examples with, say, 10 or 12K particles (just copy/paste the 3 files on your server and change the value of the variable particleNb) what a legacy SPS would normally not animate at 60 fps 

Link to comment
Share on other sites

29 minutes ago, jerome said:

Unfortunately, Atomics only works with IntArray... so no way to manage float coordinates .

Could you multiply the float values by something like 100 before storing them into the int array and then divide the int value by 100 to get back to your float value?

Link to comment
Share on other sites

yes, this is a workaround, but this limits quickly the smoothness because thus the precision is fixed forever : say everything is multiplied/divided by 1000 (factor 1000), then wherever the camera is located, wherever each vertex is positionned, rotated or scaled, they are all bound to some virtual 3D grid.

Said differently, this technique tends to reduce all the possible used values to a finite pool : the integer values storable in the array.

intValue1 = (floatValue1 * 1000)|0   // cast to int with a precision of 1000, stored in the shared array

 receivedFloatValue = intValue / 1000  // right ?

well, plenty of different float values for initial floatValue1 would give the same intermediate intValue, so the same final received value 

ex : 1.0, 1.0001, 1.0002, 1.0003, etc but also 1.00011, 1.000101, 1.000999999, etc they all end to the same final value : 1.0 

If all the particles have a size around, say, 1.0, can be distant the ones from the others  fot 0.1 and the camera can be close up to 0.01 to them, this really matters.

Not sure I'm very clear. 

Link to comment
Share on other sites

As for float, I round all the geometry coming out of Blender to 4 decimals.  It seems more than sufficient.  We are also using 32 bit integer not 16 bit.  The largest values normals is 1 right?  I am wondering out loud about a worker to compute normals.  In my morphing implementation, when a new target is being computed, the target normals are also calculated.  Then each frame, I interpolate the normals as well as the positions.  This will mean that the starting of a morph on the fly at least one frame late.  Since I have a queue based animation system anyway, this might not be a big hurtle.  In fact for QI.Automaton, I export those 24 tiny FACE targets.  In the process of construction, I create all the expressions (speech visemes are also expressions) from their smaller parts.  Doing that might reduce my load time.  Could combine that with read ahead, since it would not be trying up the UI much at all.

Just thinking out loud.

Link to comment
Share on other sites

It obviously depends on how much precision you need and the scale of your world.

boxes with size 1:

https://playground.babylonjs.com/#MTER2F#2

boxes with size .1:

https://playground.babylonjs.com/#MTER2F#1

boxes with size .01:

https://playground.babylonjs.com/#MTER2F

Good for hacking or your own project where you know the constraints.  Probably not a good fit for SPS. 

EDIT:

those first 3 examples were 3 decimal places.

Here is 4 decimal places:

boxes with size .1:

https://playground.babylonjs.com/#MTER2F#3

boxes with size .01

https://playground.babylonjs.com/#MTER2F#4

This might work for SPS.  Also remember that camera.minZ defaults to 1.  In most cases you wouldn't be close enough to see this level of detailed movement.

 

Link to comment
Share on other sites

ok back.

I was about to implement some high precision float to integer casting in order to use the painful Atomics feature. Painful, because it implies to manage the way each buffer element is updated, then is tagged as asleep in one thread and being surveyed until awaken in the other thread... this for 7 million array elements !

Well, before diving in those complex programming constraints, I re-checked at last time my initial code that shouldn't make real concurrent access to the same part of the memory because each worker was supposed to read and write in the buffer on its dedicated portion only. Moreover the main thread was only support to read the shared buffer whatever the data were up to date or not.

And I found a tiny bug in the way I used the indexes to share the pieces of buffer for each worker... fixed. So no more flickering, now. And no need for integer casting and all the Atomics stuff.

Everything is here : http://jerome.bousquie.fr/BJS/test/SPSWorker/

There's a folder for each version : one or two workers. Just click on the html file. The difference between the version 1 and 2 is just that the version 1 implements the particle logic in the main thread, whereas the version 2 implements the particle logic in each worker.

Theoretically, the version 2 should be faster.

On average machines, the FPS should be up to 20 times higher than the typed array mono-threaded version or the legacy SPS one :

Reference legacy SPS : http://jerome.bousquie.fr/BJS/test/spsReference.html

Lighter typed array based version : http://jerome.bousquie.fr/BJS/test/spsBuffer.html 

On fast machines, the FPS is almost always around 60 FPS. This is expected because the render loop is decoupled from the worker computations. So don't really mind the FPS when close to 60, don't compare also the particle speed between both version : the velocity step is clocked on the render loop in the version 1, but is clocked on the worker cycle in the version 2, so they might be different.

So what to compare then ? maybe the smoothness with what the particle evolve ... are they jerky or not.

I tried also with 4 or 6 workers but could not get any gain. Actually, as the Vertex Buffer, used by the GPU at the end, can't be shared among workers, we have to copy the buffer shared between workers in the vertex buffer before the rendering. This means anyway a loop to copy 7M elements in the main thread.... this loop (as well with particle logic when in the main thread) can slow down the main thread, so the FPS.

As usual, please wait for a while until the FPS meter stabilizes.

 

 

 

 

 

Link to comment
Share on other sites

Only 10 fps with http://jerome.bousquie.fr/BJS/test/SPSWorker/twoWorkers/spsProtoWorker2.html  ? rats !

On linux, Chrome is really faster than FF... I get 60 fps in full screen on my powerful work machine : GPU nVidia Quadra K620, CPU Intel Xeon E3 (4 x 3.1 GHz)

Anyway 40K is certainly a far too big number ...

[EDIT] : I just slowed down the particle speed

Link to comment
Share on other sites

Heavy stress test here (beware, these need very powerful PC) : http://jerome.bousquie.fr/BJS/test/SPSWorker/Ultimate/spsProtoWorker3.html

This one runs at 60 fps in Chrome here ... 200K (yes : 200, 000 !) transparent quad particles and 4 workers.

 

[EDIT] http://jerome.bousquie.fr/BJS/test/SPSWorker/Ultimate/spsProtoWorker4.html

250K triangles, 6 workers, 60 fps in full screen in Chrome on my muscle work computer.

Link to comment
Share on other sites

3 hours ago, jerome said:

Am getting 25 fps using the above using FF on my low end Window's laptop (Intel (HD) integrated graphics)

Am getting 4 fps on my low end android tablet.

I'm impressed as I have seen apparently less intensive PGs fail on the Android.

Link to comment
Share on other sites

Hi, third step : GPU SPS

http://jerome.bousquie.fr/BJS/test/SPSGpu/spsShader.html

This one runs at 60 fps in chrome on my muscle work computer.

As usual 40K solid particles (boxes + tetras). This is a monothreaded JS version, but all the particle transformations (translation, rotation, scaling) are computed GPU side. I didn't code a nice fragment shader managing the light reflection for now.

Next step to come : the worker GPU SPS. Not sure we can get more perfs again anyway because the particle logic (the user custom behavior) has to remain in the main thread.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...