How much data is too much?

When we design information systems, particularly working with stakeholders who have had difficulties accessing data in the past, it can be very tempting to collect every piece of data we can think of, just because we now have a tool that can capture and store the data. But we have to resist such temptations, or we'll end up with systems that are too bloated to maintain and unwieldy amounts of data that are impossible to analyze meaningfully.

We've all seen forms that ask for so much information it's exhausting to even think about filling them out. What incentives do we have to complete such forms, or to not rush through them as quickly as possible? The data entry person who fires up a bloated information system has the exact same reaction. When faced with a seemingly endless number of fields to complete, she might be tempted to skip some or fake data if all fields are required. She typically doesn't know what is critical to the analyst. So bad data goes in, and bad analysis comes out. The system runs the risk of being abandoned, either by the people trying to maintain it or the people trying to get information out of it -- or by everyone.

It's better, when first designing the system, to ask "why" about each data field that is proposed. Why is it necessary to collect this? What report will require that data item to be complete? How will you use this piece of data to make better decisions? That's why we typically ask stakeholders to come up with their most critical policy and management questions that they have been unable to answer because they couldn't access the pertinent data. This process narrows the types of data that the system needs to collect to only the most critical pieces of information and helps us avoid "data smog" that can actually keep analysts from making good data-driven decisions.

Even with this process, it's difficult to control the kid-in-a-candy-store mentality. Sitting down with stakeholders and brainstorming requirements for a new module often results in calls for everything but the kitchen sink. Just because we can collect a lot of data doesn't mean we should.

I think it's better to take a minimalist approach, especially when first introducing an information system to an organization that hasn't used one before. Real-world use of the system will reveal which critical pieces of data may still be missing, and those fields can be added in a later version, or by the organization with a customized need. It is better, I think, to risk the system being too small than being too large.

What do you think? How much data is too much?