it's been proven that all models learned by gradient descent are equivalent to kernel machines. interpolation isn't generalization. if theres a new input sufficiently different from the training data the behaviour is unknown
Can you say what that says about the behavior described with the modular arithmetic in the article?
And, in particular, how to interpret the fact that different hyperparameters determined whether runs, obtaining equally high accuracy on the training data, got good or bad scores on the test data, in terms of the "view it as a kernel machine/interpolation" lens?
My understanding is that the behavior in at least one of those "models learned by gradient descent are equivalent to [some other model]" papers, works by constructing something which is based on the entire training history of the network. Is that the kernel machines one, or some other one?
if you train a model on modular arithmatic, it can only learn what's in the training data. if all of the examples are of the form a + b mod 10, it isn't likely to generalize to be able to solve a + b mod 12. a human can learn the rule and figure it out. a model can't that's why a diverse training set is so important. it's possible to train a model to aproximate any function, but whether the approximation is accurate outside of the datapoints you trained on is not reliable, as far as I understand.
different hyperparameters can give a model that us over or underfit, but this helps the model interpolate, not generalize. it can know all the answers similar to the training data, not answers different to or it