Jump to content

Optimizing computeNormals()


Recommended Posts

(Pulled out of another thread: http://www.html5gamedevs.com/topic/15154-mesh-based-particle-system/?p=87087)

(There is also an older thread here: http://www.html5gamedevs.com/topic/14166-computenormals-optimization/)


The current computeNormals() function is very neat and tidy, but as it seems like it gets called a lot, I wonder if that is at the expense of speed/memory.


First, I wondered what this is doing:

normals[i1 * 3] = normals[i1 * 3] || 0.0;normals[i1 * 3 + 1] = normals[i1 * 3 + 1] || 0.0;normals[i1 * 3 + 2] = normals[i1 * 3 + 2] || 0.0;//repeat for i2 and i3

It seems to be guarding against bad data? Does bad data ever exist under normal usage? Could this clean-up operation be moved to another function, which could be called at the top of computeNormals(), and then a boolean parameter to say "skip the clean-up, I know my data is clean"?


Next question was about this code:

Vector3.FromFloatsToRef(normals[i1 * 3], normals[i1 * 3 + 1], normals[i1 * 3 + 2], vertexNormali1);//ditto for i2 and i3 into vertexNormali2 and vertexNormali3vertexNormali1 = vertexNormali1.addInPlace(faceNormal);//ditto for vertexNormali2 and vertexNormali3normals[i1 * 3] = vertexNormali1.x;normals[i1 * 3 + 1] = vertexNormali1.y;normals[i1 * 3 + 2] = vertexNormali1.z;//ditto for i2 and i3

The two functions being called actually don't do anything; they look like this:

Vector3.FromFloatsToRef = function (x, y, z, result) {    result.x = x;    result.y = y;    result.z = z;};
Vector3.prototype.addInPlace = function (otherVector) {    this.x += otherVector.x;    this.y += otherVector.y;    this.z += otherVector.z;    return this;};

My inlining we get:

vertexNormali1.x = normals[i1 * 3] + faceNormal.x;vertexNormali1.y = normals[i1 * 3 + 1] + faceNormal.y;vertexNormali1.z = normals[i1 * 3 + 2] + faceNormal.z;//Repeat for i2 and i3

But they are then just assigned back, and vertexNormali1 is not used anywhere else! So we can do away with it completely, and the code reduces to:

normals[i1 * 3] += faceNormal.x;normals[i1 * 3 + 1] += faceNormal.y;normals[i1 * 3 + 2] += faceNormal.z;

OR, if the || 0.0 code has to be done in this loop, then it can be:

normals[i1 * 3] = (normals[i1 * 3] || 0.0) + faceNormal.x;normals[i1 * 3 +1] = (normals[i1 * 3 +1] || 0.0) + faceNormal.y;normals[i1 * 3 +2] = (normals[i1 * 3 +2] || 0.0) + faceNormal.z;//Repeat for i2 and i3

So, fewer lines, more clarity, saving 6 function calls, possibly saving the creation of a few temporaries, and doing away with three local vars. Have I broken anything?  :-)




Link to comment
Share on other sites



AFAIRemember, the || 0.0 assignation was done not to prevent bad values, but to set initial values in a loop where some values weren't defined then.


You are right on the rest : we could get rid off internal Vector3 methods and re-implement them in computeNormals(). We could also re-implement the cross vector product and the normalize method which are finally only coordinates multiplications and additions.

I asked these questions to myself when re-coding the old algo.


Actually it has absolutely no visible impact on the execution. There are far more other bottlenecks in the BJS process, like making draw calls, etc. So the gain isn't pertinent.


The real gain might be in terms of object creation. If we don't create 9 Vector3 each call, the GC won't handle them. It is better for it to deal with scalar type variables (floats, integers).

So we could get rid off the creation of 9 vectors3 and recode everything with only additions and multiplications. Won't be very DRY.


The problem to me is that I am not sure that these 9 vectors3 are the cause of the GC.

Nine vector3 objects are just a speck of dust for the GC. We shouldn't even remark them in the CPU profiler.

Link to comment
Share on other sites

Looks good to me.


The "|| 0" check is because the passed-in normals array may start out empty, or not large enough. It would probably be quicker to do a preliminary loop filling it with zeroes to the correct sizebefore the main loop that sums normals together for each point. That would avoid the array reads, as well as the (presumably unwanted) weird results if someone passed in an array that wasn't zeroed.

Link to comment
Share on other sites

We don't really care about the fact that someone could pass a wrong array as computeNormals() is a kind of internal tool, although it is public and static.

AFAIR, reducing the number of passes when iterating on big arrays had a very sensitive effect.


In other words, a single for{} loop on 30K elements doing two assignations is faster than 2 for{} loops each making one assignation.

That's what I remember when I was working on it ...

Maybe I am wrong.


I mean that computeNormals() can be optimized again with direct calculations and maybe float32Array usage but the gains to get are now quite small.


I really need to chek if the GC activity is really related to computeNormals() or not.

Many tests on the ribbon example can't make me conclude... the browser (or the CPU profiler) doesn't react the same each test with or without freezeNormals(). I stressed this example for days in many browsers and didn't notice any real GC issue until now.

So I feel a little confused.


Imho, the current computeNormals() should be enough to deal properly with the GC. So there might be another issue somewhere else... if we can call this an issue because it runs pretty well at 60 fps.


I think I will go on with this state (for the SPS) and, only finally and if needed, check if deep optimizations can get things better.

Link to comment
Share on other sites

Jerome, I don't think qqdarren was talking about the GC so much as the number of reads/writes to object properties and array elements. Modern VMs do a lot of optimizing so it's hard to predict what will be fast, but in general we can expect that math on local vars will usually be done in registers, while code like:

normals[i] += myvec.x;

usually means the VM has to look up property i from array normals, read it, then look up property x from myvec, read that, then do the math, then assign the array value. Basically there are a lot more reads and writes than necessary.


Also, re:GC, there can be cases where innocuous-looking code might (behind the scenes) force the VM to, say, convert a number from int representation into a boxed double. Even if the value doesn't outlive the function call, if that double went onto the heap then it can lead to GCs. It varies by VM so it's hard to say anything for sure without experimenting.



To that end, here is a simple A/B test of the current compute normals compared to one which zeroes out the normals array and then makes the changes darren suggested:


To compare just run a cpu profile and compare the time spent in "ComputeNormals" vs "ComputeNormals2". On my machine in chrome the latter is roughly twice as fast, but I think it could be improved more, e.g. by changing the normalization at the end to do the math directly rather than via Vector3 methods.



Link to comment
Share on other sites

Seems to be a good optimization (and really simple to implement).


I just couldn't see the difference until now on my machine with 1000 segments. Both functions ran at 60 fps. I need to check the CPU profiler.

If you get an improvement of twice the speed it's really worth it ...! :)


I really need to check this in deep.

Link to comment
Share on other sites

My interest was in why performance on Firefox is worse than Chrome. E.g. perhaps the Chrome JS interpreter is doing the inlining and Firefox isn't. So we should inline for it (except where there is an obvious downside, such as dramatically increasing the number of lines). Here inlining showed the temporary wasn't even needed, so the code also got simpler.


BTW, regarding the DRY principle: this has to be balanced against another principle, called "Don't Over-Engineer". That is judged by experience and common-sense. You should only be making small functions for everything if you are working in Java Middleware with an over-zealous manager who has put in place every metric he could find, and you are therefore forbidden from having functions with more than six lines. Another sign you are in over-engineered-coding-hell is you are being forced to write unit tests for the getters and setters, just so you can get 100% code coverage. ;-)

Link to comment
Share on other sites


promise, I will go in under-engineering !


Actually, I was just lazy (as usual) and used the BJS vector3 functions instead of re-implemented them (cross vector product, etc) because I thought they were no real gain.

Need to check gain in fact and if it has an impact on the GC... or not (maybe other cause)


I know the FF VM is far behind the Chrome one (and the IE one under Windows too). Moreover Chrome (and IE, I guess) has a VM process or thread per browser tab, whereas FF shares the JS VM among all open tabs.

Link to comment
Share on other sites

Hm. Well, I'm reasonably familiar with v8, and there, it's usually best to just write straightforward JS and not worry about optimization until you know which functions are hot. v8 will do a fair amount of inlining and so on on its own, provided your functions meet certain criteria, so it's often best to just worry about meeting those criteria, and letting it work its magic.


But if you can isolate stress tests with real world data where the it spends 90% of its time in one function, then it's sometimes worthwhile to temporarily throw DRY out the window and micro-optimize the hottest code path. Personally I think this is fine for stuff that's intended to be the "internals" of something like BJS, which "users" shouldn't normally need to look at. 


With all that said I know very little about FF. I know they have an engine (IonMonkey) that does a lot of the same optimizations as v8 but I don't know if FF uses by default yet. And for IE I know even less.

Link to comment
Share on other sites

Well the good new is that we could probably get some micro-seconds and maybe some GC rest by recoding (again) some parts of computeNormals()


The bad new is that I will have to do this  :lol:


Well, I will go on for now a little on SPS work (+ my real job  :P ) and then only I will review computeNormals()

Link to comment
Share on other sites

Hey there,


you only talk about "micro-seconds".

I don't know your testing machines, but for me this means a lot, because I always watch my applications running on

slow android smartphones.

And there "micro-seconds" can result in 1ms per frame, which is a lot.


So never stop investigating in any performance stuff  :D

So just wanted to say thanks! 

Link to comment
Share on other sites

  • 2 months later...
  • 2 months later...

As ASM.js tests aren't as promising as expected, I'm about to do a final optimization on ComputeNormals() :


it will now only handle scalar values (and no more js objects) and reduce again the temporary variables used


This improvement will probably give just a really tiny gain and would have an small effect only when the function is called within the render loop on huge meshes needed to be updated.


I did it because I just couldn't let things yet optimizable but not optimized :P

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Create New...