That's a valid question! I believe the main issue lies with your assumption: "Assuming that our training, validation, and test data is sampled from the same distribution as the 'real world' distribution."
In academic settings, we can really easily wave this away.
But in many practical cases, this isn't a trivial thing to do. In NLP, new slang is developed. In finance, the economic conditions in your training data almost certainly won't hold when forecasting forward -- and this is consistent for most things with a temporal component. If you're deploying a computer vision app, that app will be used on many different phones in many different environments. It's nearly impossible to ensure that your training data will reflect the conditions under which the app is going to be used. Each of these models will probably perform worse when deployed than when they're evaluated on a held-out validation or testing set.
The researchers identify that underspecification exists when multiple models have nearly identical "iid held-out performance." When they use the phrase "iid held-out performance," they mean that the training/validation/testing sets are pulled from the same distribution, then randomly split.
Perhaps your statement "the essence of the problem lies in these insufficiently large validation sets" isn't inaccurate, but is imprecise. It's likely more the representativeness (or lack thereof) of the data that is used for training/validation/testing data and how it doesn't mirror the 'real world' distribution. I personally find the computer vision app example to be the most intuitive here. If I want to detect bubbles in water, I may take pictures in the sinks in my home. But the lighting in my home and the sinks in my home are likely to be different from the lighting and sinks in other people's homes. This means that if I were to take my self-collected bubble data and split it into training/testing/validation sets, my model's performance on held-out validation data will likely be an overconfident statement about how my model will perform "in the real world." We can try to gather images from a random sample of sinks and under varying lighting conditions, but this is exceptionally challenging and almost certainly won't be as representative as needed.