Does anyone have a video or post that explains the optimization part for the original paper? I understand most of it but that part and can’t seem to wrap my head around it.
Just glossed over the paper but it seems, in principle, simple enough (though rather brilliant IMHO).
Essentially they're doing what you do when you train a neural network, only that instead of adjusting weights connecting "neurons", you adjust the shape and position of gaussians, and the coefficients of spherical harmonics for the colors.
This requires the rendering step to be differentiable, so that you can back-propagate the error between the rendering and the ground-truth image.
The next key step is to every N iterations adjust the number of gaussians. Either fill in details by cloning a gaussian in an area which is undercovered, or split a gaussian in an area which is overcovered.
They use the gradient of the view-space position to determine if more detail is needed, ie those gaussians which the optimizer wants to move significantly over the screen seems to be in a region with not enough detail.
They then use the covariance of the gaussians to determine to split or to clone. Gaussians with large variance get split, the others cloned.
They also remove gaussians which are almost entirely transparent, no point in keeping those around.
That's my understanding at least, after a first time gloss-through.
> Essentially they're doing what you do when you train a neural network, only that instead of adjusting weights connecting "neurons", you adjust the shape and position of gaussians, and the coefficients of spherical harmonics for the colors.
My brain:
> They're providing inverse reactive current to generate unilateral phase detractors, automatically synchronizing cardinal gram meters.
Heh. For those that haven't dabbled much with neural nets, the key aspect here is the backpropagation[1]. If you want to optimize a process, you typically change the parameters (turn a knob or change a number) and see how the output reacts. If it changed too much you reduce the parameter etc. This is a forwards process.
The idea in backpropogation is instead to mathematically relate a change in output to a change in the parameters. You figure out how much you need to change the parameters to change the output a desired amount. Hence the "back" in the name, since you want to control the output, "steering" it in the direction you want, and to do so you go backwards through the process to figure out how much you need to change the parameters.
Instead of "if I turn the knob 15 degrees the temperature goes up 20 degrees", you want "in order to increase the temperature 20 degrees the knob must be turned 15 degrees".
By comparing the output with a reference, you get how much the output needs to change to match the reference, and by using the backpropagation technique you can then relate that to how much you need to change the parameters.
In neural nets the parameters are the so-called weights of the connections between the layers in the model. However the idea is quite general so here they've applied it to optimizing the size, shape, position and color of (gaussian) blobs, which when rendered on top of each other blend to form an image.
Changing a blobs position say might make it better for one pixel but worse for another. So instead of doing a big change in parameters, you do small iterative steps. This is the so-called training phase. Over time the hope is that the output error decreases steadily.
edit: while backpropagation is quite general as such, as I alluded to earlier, it does require that the operation behaves sufficiently nice, so to speak. That's one reason for using gaussians over say spheres. Gaussians have nice smooth properties. Spheres have an edge, the surface, which introduces a sudden change. Backpropagation works best with smooth changes.
Just to add some detail regarding the "blob optimization" phase.
The algorithm that recovers the camera positions from the reference images also gives you a sparse cloud of points (it places the pixels from the image in 3D space). Use that as the center of the initial blobs, and give each blob an initial size. This is almost certainly not enough detail, but a start.
Then you run the "training" for a while, optimizing the position and shape of the blobs. Then you try to optimize the number of blobs. The key aspect here is to determine where more detail is needed.
In order to do so they exploit that they already have derivatives of several properties, including screen position of each blob. If the previous training pass tries to move a given blob a significant distance on the screen, then they take that as a signal that the backpropagation is struggling to cover an area.
They then decide to split the blobs either by duplication or by splitting, depending on if the blob is large or not.
If it's small they assume there's detail it can't fill in, and duplicate the blob and move the new blob slightly in the direction it wanted to move the source blob so they don't overlap exactly.
If the blob is large they assume the detail is too fine and is overcovered by the blob, hence they split it up, calculating the properties of the new blobs so that they best cover the volume the source blob covered.
This process of training followed by blob optimization is repeated until the error is low enough or doesn't change enough, suggesting it converged or a failure to converge respectively.
Thank you. This was much more approachable for someone like myself that has little background (a few undergrad courses) in both machine learning and computer vision concepts.
I was just about to ask why not use a sphere? since it could be thought of as a nn, it will be into NN someday. guess the splitting and merge can be compared with dropout then.
I'm no expert, but my immediate thoughts are that evaluating a gaussian blob is very simple, it's just an exponential of a distance. The edge of a sphere makes it more complicated to compute, hence slower.
For backpropagation, the differentials of a gaussian is smooth while it's not for a sphere, again because of the edge.
Now, if you want to use a sphere you probably will do something like using an opacity falloff similar to ReLU[1], making it transparent at the edge.
This should make smooth enough as such I guess, but I imagine you still have the more complicated rendering. Though I may be mistaken.
I still continue to read comments like those though - there is a chance I might make sense of a word! But I did find myself laughing as I read the original post thinking about how this sounds like a word salad.
The object that’s being optimized are the parameters of a 3D Gaussian, just imagine a blob changing shape. That’s optimized instead of optimizing a neural network
What parts confuse you? There are a few steps in optimization. There are lots of papers on differentiable rendering, but the pruining of gaussians and the actual treatment of gaussians, I don't think there's a blog post.