I think the point being made is that a deep network (GPT-3 was the example with 175B parameters) will (due to the virtue of its size) not have any 'local optima' in the sense that there is no traditional 'local' for these high dimensional places. This is because as the # of dimensions increases it is easier to move away from or towards both better or worse parameter sets. Thus optimization algorithms don't have to be concerned about being trapped in a 'local optima'. Also because there are many good parameter sets not all parameters even need be used to get good results, thus processes like distillation can work.