By Dr. John R. TalburtIn academia we say both quality and quantity are equally important to enterprise data. In business we would say, “It depends.” However, the real answer is that this is a false choice like the question of “what happened to the missing dollar?” in the bellboy story. According to the fundamental principles of analyzing information quality, “quantity” is actually one of the dimensions of data quality.
In the Wang-Strong framework of data quality dimensions, “amount of data” is one of the dimensions. This means the quantity of data is one of the characteristics of quality. So from this perspective, it is not a question of quality or quantity, it is really a question of how important is quantity as a measure when analyzing information quality.
Dimensions of Information Quality
All other things being equal, increasing the amount of data increases the overall quality of the data. If the other dimensions of data quality such as accuracy, value added, completeness, and consistency are good, then having more of it is also good. However, as we know in real life, all other things are usually not equal.
Large amounts of data (Big Data) are often associated with poor quality in these other dimensions. Just look back to the roots of data quality in data warehousing. Putting together data from different systems to create the large warehouse of data is what exposed the inconsistency, lack of standards, and general poor conditions of data across the enterprise. Building these data warehouses is what started us on this journey to data quality in the first place.
Another reason big data is often associated with poor data quality is because of unstructured social media. Social media offers many challenges in other dimensions such as understandability, believability, and relevance. Here again, the ultimate objective is often to integrate the social media data into structured internal data. That puts us back into the same inconsistent representation scenario as the early days of data warehousing.
Finally, even in those cases where large quantities of data are from the same source or where different sources have been harmonized, challenges still exist. Perhaps the most overlooked is Bayes Theorem of conditional probability, which drives much of the analyses searching for those hidden, golden nuggets of insight everyone seems to believe exist in Big Data.
Bayes illustrates the probability of knowing something increases as we gather new evidence about it. If P(A) represents the probably of A, and D represents new data or new evidence, then the Bayes Theorem tells us how to calculate a factor F so that F times the P(A) gives us P(A|D), the new probability of A given that we know D. Whether A is the probability of a credit card transaction being fraudulent or if a stock price will go up, we are always searching for that new data or new test that will give us the answer.
Practical Applications of Conditional Probability
Regardless of the calculation of F, every data scientist and business analyst should be aware of consequence of Bayes Theorem. Namely, that the smaller the probability of A, the more precise the test must be to reliably detect A. The classic example of this is called the “deadly disease.” Suppose that A is a rare, but fatal disease only occurring in only one out of every ten thousand persons. This means that P(A) = 0.01%. Now suppose a researcher has found a blood test that is 99% accurate in the sense that if a patient has the disease the test will signal positive 99% of time, thus only giving a false negative signal 1% of the time. Conversely, if a patient doesn’t have the disease, the test will only signal a false positive 1% of the time.
Now suppose a you take the blood test and it comes up positive. What is the probability you actually have this fatal disease?
Well don’t panic, the answer is it only means you have a slightly less than 1% chance of actually having the disease. How can that be? The test is 99% accurate, shouldn’t it be almost certain that you have the disease? The answer is no; the takeaway from this is that we can’t accurately predict something that only occurs 0.01% of the time with a tool that only has a precision of 1%.
The precision of the test must be of a higher order of magnitude than the probability of the effect. In this example, if the blood test was 99.999% accurate, then testing positive would mean you have a 91% chance of having the disease. Even at 99.99% test accuracy, the probability of having the disease is only 50%.
It turns out in both academic and business settings, quantity can be important to analyzing information quality, but in addition to the other dimensions of data quality. Accuracy, completeness, consistency and more are all necessary to solve problems using data.
For more information on how to best leverage your enterprise data, contact Black Oak Analytics at 501-379-8008.