I think the general consensus (from my interactions) is that a local minima requires the gradient to vanish. When you have many dimensions, it's unlikely that they are all 0. Coupled with modern optimization methods (primarily momentum), this encourages the result to be in a shallow valley as opposed to a spiky minima. The leap of faith is equating shallow=general and spiky=overfitted.