Think of the model as a point in high d space that you're trying to trap inside a cage. (Corresponding to it being at a local minima and surrounded by higher points.) So you're trying to build a cube in n dimensions, which has 2n sides. Now imagine that each of those sides will randomly be there or not, probabilistically, with probability p. To actually trap a model like gpt3, p^175,000,000,000 has to be high enough that you observe a case during training where you roll a 1.