The question that you had asked is “I simply cannot make sense of what you write in item #1 in your “avoidance” section. What am I missing?” What I described in item #1 is, in fact, model drift. In the paper on underspecification, the authors identified model drift as a related issue that also causes your model’s performance to degrade when put in practice, and I included it as a takeaway here. This takeaway, or tip, is not the only one I mention. Hopefully one takeaway from the article being about this related concept doesn’t make you believe that there’s nothing new or original to be gained from the Google paper. (For example, model drift isn’t related to tips 3, 4, or 5 that I described.)
You shared “It is well known that parameter space has many local minima that will generate models that perform nearly equally well on ‘small’ validation sets.” That does describe the issue of underspecification. The authors say “A ML pipeline is underspecified if there are many [models] that a pipeline could return with similar predictive risk.”
Note that this isn’t necessarily tied to “small” validation sets. Of course, most things in machine learning are improved with larger sample sizes, all else held equal. But that is only one part of the problem.
The authors later say that underspecification causes problems when the models “encode substantially different inductive biases that result in different generalization behavior on distributions that differ from [the training distribution].”
This is part of why I advocate that you ensure your training distribution is as similar as possible to your deployment distribution. While you may find this to be “obvious,” I think this bears repeating because many, many deployed models are trained on data that doesn’t mirror the deployment distribution as closely as it could. Critical thinking about this may lead to an easier solution than learning about, developing, and using stress tests to detect underspecification — what I lay out in item #3. Having gap between training and deployment distributions is also a problem where increased sample size is unlikely to address the problem — in fact, increased sample size may improve performance on the iid held-out test/validations sets but not address the issue of generalizing beyond the training distribution! This would boost confidence in the model pre-deployment and the gap between pre-deployment performance and post-deployment performance gets even larger.