Is all Big Data Analyze-able
By Tricia Aanderud, Director of Data Visualization Practice, Zencos Consulting
Many business users are excited about the potential of Big Data to evolve their organizations. You cannot expect a dataset to unlock any mysteries until you know question that needs to be answered. All of the following rules may seem obvious but there are caveats to each. Note that it does not matter what size your data is these rules still apply to your question.
Rule 1: Big Data must be Contextual
By its very nature, big data is just that – big and it continues to grow. What may not be immediately obvious is that only a small portion, if any, of the data may be useful or meaningful. Consider if you were trying to determine the influence of daily events on the stock market in 1945 – how much data from a newspaper would be useful – maybe only the content on the front page? Maybe just the headlines?
More than likely most of the ad content, color stories, and even the classified section an analyst would toss aside – but not before reviewing it.
Users get it backwards and try to ask a question after reviewing data. Often they find so much of the big data is useless or just leads to more questions. They fail to consider if the data meets their needs and if it actually does answer the question.
Rule 2: Any Data must be Quantifiable
Even if the data is huge and unstructured, it still needs organization before you can analyze it. Big data is not always numeric columns – sometimes it is unstructured (video content, PowerPoint presentations) content. When you look at the dataset, you have to determine how to categorize and quantify it. Based on your question the data may not be able to provide the analytics you need. Thus, a promising dataset turns into a waste of time.
Obviously you want to know and understand the origin of the data before doing any analysis – information such as where, when, how and why are important to know. A way to understand a dataset is from the metadata, which describes each column. Imagine the confusion of receiving a dataset that contained a list of customers and addresses only to later find out the data was collected 50 years ago for a different reason.
Rule 3: Data must be Accurate
There is no faster way to lose an audience than with inaccurate data. You have to build trust with your message. After you determine the data can answer your question, this may be the most important rule. You must establish a data governance methodology to confirm the datais accurate, reasonable, consistent, and valid. Otherwise all of your analysis will do nothing more than confirm a misleading conclusion.
There was a funny example of inaccurate data in the British Medical Journal Report that showed many hospitals were admitting men in the obstetric ward. The report stated “Even more striking; between 15,000 and 20,000 men have been admitted to obstetric wards each year since 2003…”
While it is easy to understand that the data entry person was not paying attention or the system makes it difficult to be accurate, it does call into question what other errors are present in the data. This particular error is easy to identify and correct. What if you want to forecast which sex was more likely to be admitted for a certain condition – your whole dataset just became invalid because you do not know how often the gender is inaccurate.
While it is tempting to jump on the big data bandwagon, ensure you understand the question you want answered or problem you need to resolve before investing a lot of time. Breaking any of the rules above will lead you down the wrong analysis path.