Archive for the ‘Data Issues’ Category

How Much Data Do I Need?

Saturday, June 14th, 2014

I have discussed data issues in several previous articles. People are often confused about how much data they really need. In particular, I frequently hear the refrain “Simulation requires so much data, but I don’t have enough data to feed it.” So let’s examine a situation where you have, say 40% of the data you would like to have in order to make a sound decision and let’s examine the choices.

1) You can possibly defer the decision. In many cases no decision is a decision in itself because the decision will get made by the situation or by others involved. But if you truly do have the opportunity to wait and collect more data before making the decision, then you must measure the cost of waiting against the potential better decision that you might make with better data. But either way, after waiting you still have all of the following options available.

2) Use “seat of the pants” judgment and just decide based on what you know. This approach compounds the lack of data by also ignoring problem complexity and ignoring any analytic approach. (Ironically enough this approach often ignores the data you do have.) You make a totally subjective call, often heavily biased by politics. There is no doubt that some highly experienced people can make judgment calls that are fairly good. But it is also true that many judgment calls turn out to be poor and could have benefited greatly from a more analytical and objective approach.

3) Use a spreadsheet or other analytical approach that doesn’t require so much data. On the surface this sounds like a good idea and in fact, there is a set of problems for which spreadsheets are certainly the best (or at least an appropriate) choice. But for the modeling problems we typically come across, spreadsheets have two very significant limitations: they cannot deal with system complexity and they cannot adequately deal with system variability. With this approach you are simply “wishing away” the need for the missing data. You are not only making the decision without that data, but you are pretending that the missing data is not important to your decision. An oversimplified model that doesn’t consider variability or system complexity and ignores the missing data … doesn’t sound like the makings of a good decision.

3) Simulate with the data you have. No model is ever perfect. Your intent is generally to build a model to meet your project objectives to the best of your ability given the time, resources, and data available. We can probably all agree that better and more complete data results in a more accurate, complete, and robust model. But model value is not true false (valuable or worthless) but rather it is a graduated scale of increasing value. Referencing back to that variability problem, it is much better to model with estimates of variability than to just use a constant. Likewise a model based on 40% data won’t provide near the results of one with all of the desired data, but it will still outperform the analytical techniques that are not only missing that same data, but are also missing the system complexity and variability.

And unlike the other approaches, simulation does not ignore the missing data, but can also help you identify the impact and prioritize the opportunities to collect more data. For example some products have features that will help you assess the impact of guesses on your key outputs (KPIs). They also have features that can help assess where you should put your data collection efforts to expand sample or small data sets to most improve your model accuracy. And all simulations provide what-if capability you can use to evaluate best and worst case possibilities.

Perfection is the enemy of success. You can’t stop making decisions while you wait for perfect data. But you can use tools that are resilient enough to provide value with limited data. Especially if those same tools will help you better understand the value of both the existing and the missing data.

Happy modeling!

Dave Sturrock
VP Operations – Simio LLC

Data Collection Basics Part 2

Sunday, September 28th, 2008

Last week in Data Collection Basics (Part 1) I discussed data collection, introducing the topics of identifying required data and then locating or creating that data. Once you have some data, you typically need to do some analysis on it before you can effectively use that data.

Select Distribution. Typically input data to a simulation model is specified as a distribution. If you have estimated data you must select the most appropriate distribution (for example a minimum time, typical time, and maximum time may be represented as a Triangular distribution). If you have actual data, then you will need to run a statistical analysis on it. Many software products (some generic and some simulation-specific) are available to help you with selecting (fitting) a distribution and its shape parameters, and even with cleaning the data to eliminate bad observations.

Analyze Sensitivity. Once you have some data you can build it into your model and start making trial runs. Particularly if you have relied on an estimate, you might want to run your model with values above and below the estimated values to determine system sensitivity to that parameter. If you find that the system is sensitive to an estimated value (e.g. the results change significantly with a change to the input parameter), then you can determine if it is worth a greater investment to obtain a more reliable value. This is one potential solution to the problems of bias and inaccuracy discussed in the initial article. But more than that, it is also a good way to iteratively determine how much time to spend on your input data.

Adjust Detail. Sometimes the quality of the available data can help you determine the appropriate level of detail for a model. If the data you intend to use is not very good, then there is little point to building a highly detailed model. This is not to imply that such a model is of no value, after all every model is just a representation or estimate of reality – no model will be perfect. But it is important to represent to your stakeholders the relative accuracy of the model and its underlying data.

This was a quick overview of some steps to data collection. Whole textbook chapters have been written about each of these, so be sure to look for greater detail when you are ready.

Dave Sturrock
VP Products – Simio LLC

Data Collection Basics

Sunday, September 21st, 2008

Even though the people responsible for building models are often the “data collection people”, I know very few associates who think this is a particularly enjoyable part of their job. But data collection is a necessary part of most simulation projects. An early task in each simulation project should be to identify what data will be needed and how that data will be obtained.

Identify Data. There are many different types of data that you will potentially need. Like other aspects of simulation, the identifying required data is best done iteratively. Start by looking at the major areas of your model: arrival sections, processing sections, storage areas, departure areas, internal movement and similar aspects. For each area, then consider the key parameters necessary to describe it. For example, in an arrival area: What is arriving? Are there many different types of entities? Do they each have descriptive attributes that are important? Do you expect the arrivals to follow some type of a time-based pattern? Considering questions such as these will also help you define the model and modeling approach and iteratively, more detail on the exact data required.

Locate Data. With the current level of automation and electronic tracking, the availability of data has become more prevalent. If it’s an existing system, there may already be data that is routinely collected. If it is a new system, the vendor may have access to data collected on similar systems. In either case, the existence of data does not necessarily make your job easy. For example, perhaps you are interested in a processing time on an operation, and that processing time is automatically captured. But what may not be obvious is exactly what that number represents. Does it (sometimes) include time when the process was failed (perhaps short failures are imbedded but long failures are not)? Does it (sometimes) include time when an operator went on break and forgot to properly log out? Detecting and cleaning such situations can be a tedious and frustrating part of using existing data.

Create Data. If the data you need does not exist or cannot be appropriately cleaned, you must often create it. On an existing system, the most accurate method is to electronically capture the data or have manual studies done to determine it. Either of these can be very expensive. An alternate approach is to get estimates from people who know – people running or managing the operation. Although fast and inexpensive, this may introduce bias and inaccuracy. Likewise on a system that does not yet exist, you may need to rely on specifications provided by a vendor, again possibly introducing bias and inaccuracy. More on dealing with this situation later.

This was a quick overview of some initial steps to consider in data collection. Next week I will discuss some additional steps on what to do next with that data. Until then, Happy Modeling!

Dave Sturrock
VP Products – Simio LLC