This happens in smaller models because you reach parameter saturation very quick...

This happens in smaller models because you reach parameter saturation very quickly. In modern LLMs and with current datasets, it is very hard to even reach this point, because the total compute time boils down to just a handful of epochs (sometimes even less than one). It would take tremendous resources and time to overtrain GPT4 in the same way you would overtrain convnets from the last decade.