Large model capacity enough to perfectly memorize/interpolate data may not be a bad thing. A phenomenon known as "deep double descent" says that increasing modeling capacity relative to the dataset size can reduce generalization error, even after the model achieves perfect training performance (see work by Mikhail Belkin [0] and empirical demonstrations on large deep learning tasks by researchers from Harvard/OpenAI [1]). Other work argues that memorization is critical to good performance on real-world tasks where the data distribution is often long-tailed [2]: to perform well at tasks where the training data set only has 1 or 2 examples, it's best to memorize those labels, rather than extrapolate from other data (which a lower capacity model may prefer to do).
I think there's something critical about the implicit data universe being considered in the test and training data, and in these randomized datasets. Memorizing elephants isn't necessarily a bad thing if you can be assured there are actually elephants in the data, or if your job is to reproduce some data that has some highly non-random, low entropy (in an abstract sense) features.
I think where the phenomenon in this paper, and deep double descent starts to clash with intuition, is the more realistic case where the adversarial data universe is structured, not random, but not conforming to the observed training target label alphabet (to borrow a term loosely from the IT literature). That is, it's interesting to know that these models can perfectly reproduce random data, but generalizing from training to test data isn't interesting in a real-world sense if both are constrained by some implicit features of the data universe involved in the modeling process (e.g., that the non-elephant data only differs randomly from the elephant data, or doesn't contain any nonelephants that aren't represented in the training data). So then you end up with this: https://www.theverge.com/2020/9/20/21447998/twitter-photo-pr...
I guess it seems to me there's a lot of implicit assumptions about the data space and what's actually being inferred in a lot of these DL models. The insight about SGD is useful, but maybe only underscores certain things, and seems to get lost in some of the discussion about DDD. Even Rademacher complexity isn't taken with regard to the entire dataspace, just over a uniformly random sampling of it -- so it will underrepresent certain corners of the data space that are highly nonuniform, low entropy, which is exactly where the trouble lies.
There's lots of fascinating stuff in this area of research, lots to say, glad to see it here on HN again.
Thanks for your [2] link paper, it's an interesting one.
If you think about it, the high-level idea makes some intuitive sense. Since NNs are known to be "equivalent" to kernel methods - which are, to oversimplify, essentially nearest-neighbour + a similarity function - then the ability of NNs to memorise specific training examples can be analogised to adding another "neighbour" to interpolate from. So maybe it's not too surprising that NNs which can do this can have better generalisation performance. (Although it is still surprising, since I certainly wouldn't predict it a priori!)
Really, what is the difference between a "feature" and "memorising a data point"? It seems only a matter of scale - how much of the input space is "relevant" to the learned representation.
> Since NNs are known to be "equivalent" to kernel methods
I remember reading this somewhere, but I can't find the exact paper. Do you think you could link me to it?
Also, I seem to recall the paper didn't provide a method to actually construct the equivalent kernel. Do you know of any work since then that actually constructs an equivalent kernel method?
There is also this which seems to have been produced in parallel, closely related to the first but not exactly the same: https://arxiv.org/abs/1804.11271
[0]: https://arxiv.org/abs/1812.11118
[1]: https://openai.com/blog/deep-double-descent/
[2]: https://arxiv.org/abs/2008.03703