How Much Data Do I Need?

I have discussed data issues in several previous articles. People are often confused about how much data they really need. In particular, I frequently hear the refrain “Simulation requires so much data, but I don’t have enough data to feed it.” So let’s examine a situation where you have, say 40% of the data you would like to have in order to make a sound decision and let’s examine the choices.

1) You can possibly defer the decision. In many cases no decision is a decision in itself because the decision will get made by the situation or by others involved. But if you truly do have the opportunity to wait and collect more data before making the decision, then you must measure the cost of waiting against the potential better decision that you might make with better data. But either way, after waiting you still have all of the following options available.

2) Use “seat of the pants” judgment and just decide based on what you know. This approach compounds the lack of data by also ignoring problem complexity and ignoring any analytic approach. (Ironically enough this approach often ignores the data you do have.) You make a totally subjective call, often heavily biased by politics. There is no doubt that some highly experienced people can make judgment calls that are fairly good. But it is also true that many judgment calls turn out to be poor and could have benefited greatly from a more analytical and objective approach.

3) Use a spreadsheet or other analytical approach that doesn’t require so much data. On the surface this sounds like a good idea and in fact, there is a set of problems for which spreadsheets are certainly the best (or at least an appropriate) choice. But for the modeling problems we typically come across, spreadsheets have two very significant limitations: they cannot deal with system complexity and they cannot adequately deal with system variability. With this approach you are simply “wishing away” the need for the missing data. You are not only making the decision without that data, but you are pretending that the missing data is not important to your decision. An oversimplified model that doesn’t consider variability or system complexity and ignores the missing data … doesn’t sound like the makings of a good decision.

3) Simulate with the data you have. No model is ever perfect. Your intent is generally to build a model to meet your project objectives to the best of your ability given the time, resources, and data available. We can probably all agree that better and more complete data results in a more accurate, complete, and robust model. But model value is not true false (valuable or worthless) but rather it is a graduated scale of increasing value. Referencing back to that variability problem, it is much better to model with estimates of variability than to just use a constant. Likewise a model based on 40% data won’t provide near the results of one with all of the desired data, but it will still outperform the analytical techniques that are not only missing that same data, but are also missing the system complexity and variability.

And unlike the other approaches, simulation does not ignore the missing data, but can also help you identify the impact and prioritize the opportunities to collect more data. For example some products have features that will help you assess the impact of guesses on your key outputs (KPIs). They also have features that can help assess where you should put your data collection efforts to expand sample or small data sets to most improve your model accuracy. And all simulations provide what-if capability you can use to evaluate best and worst case possibilities.

Perfection is the enemy of success. You can’t stop making decisions while you wait for perfect data. But you can use tools that are resilient enough to provide value with limited data. Especially if those same tools will help you better understand the value of both the existing and the missing data.

Happy modeling!

Dave Sturrock
VP Operations – Simio LLC

Leave a Reply

Your email address will not be published. Required fields are marked *