Data analysis demands common sense
January 10, 2017
How do you translate the ever increasing amount of available data into sensible decisions? That demands good data analysis and must certainly not be taken for granted.
The detective Sherlock Holmes hit the nail on the head: ‘It’s human nature to see only what we expect to see.’ This quote is one of the most important warnings for anyone tackling Big Data. We must prevent data analysis from causing accidents through poor interpretation. So data analysis demands switching off preconceptions and expectations and switching on common sense.
If you simply collect enough data and carry out enough different analyses, then a striking correlation will always surface. If one unscrupulously misuses a computer's computing power, one will arrive at countless irrelevant or even misleading conclusions. We need to be on our guard here, certainly at a time when Big Data is portrayed as a machine into which you can pour large amounts of data at will, and subsequently, once the machine has done its number crunching, obtain ready-made solutions. Reality is far more complex. The fact that we can model large quantities of data only makes sense if we don't forget the data's context, because without context, data loses its value. Realizing successful applications in the field of Big Data is often a process fraught with setbacks. A process of blood, sweat and tears.
Avoid wrong conclusions
A good data scientist is extremely aware of the risks involved and has been trained to be very critical. The central issue here is Simpson's paradox. An example will easily explain this paradox to non-statisticians. Statistics show that seamen who have fallen overboard without a life jacket could be rescued more often than seamen who wore a life jacket. This contradicts intuition, but more detailed analysis makes it quite explicable. Seamen evidently opted to wear a life jacket primarily in poor weather conditions – conditions in which rescue was difficult or sometimes impossible. But of course the example does not lead to the conclusion that you should take off your life jacket to increase your chances of survival.
The example shows how important the data context is for formulating responsible conclusions. It is essential to always be alert that data analysis is not done frivolously and that patterns are not simply translated into conclusions. This typifies the world in which a data scientist operates: if you combine data in a handy way, improbable results will be produced. A wrong conclusion can prove to be life-threatening. It wouldn't be the first time that a decision not to wear life jackets was taken on the basis of a data analysis. Metaphorically speaking.
Sought: good data scientists
Now that Big Data is becoming more intertwined in society, we must avoid making unfortunate conclusions. So we should not leave data analysis entirely to a powerful computer that can do some very clever things. Number crunching is after all only the (relatively simple) start of an analysis and its success depends on a data scientist with the competence of a Sherlock Holmes. The difficulty is mainly in being able to comprehend or interpret the results, possibly with interesting insights for the client as a result.
So it is about training data scientists who are able to deal properly with Simpson's paradox, making them sharp and critical when interpreting data. This will enable them to seek real causal relationships without tunnel vision, which others can also associate with.
Those good data scientists are no luxury. Certainly in an environment with large amounts of data – big and messy – it is often not so easy to understand relationships between variables as with the relatively simple case of the life jackets. That emphasizes the necessity that data scientists should be more than clever boys and girls who are good at statistics. They must be able to encounter setbacks when seeking and comprehending significant results. And they must be better able than anyone to visualize the context from which the data has been taken. All this means that groundbreaking Big Data projects will never be delivered on a routine basis. Because even if the economic research models are sound and the conclusions are statistically valid, it still does not follow that you should switch off your common sense.