Having a clean dataset is of high importance for big data, as poor data quality leads to poor quality of insights.
Data cleaning is the process of ensuring that your data is correct, consistent and useable by identifying any errors or corruptions in the data, correcting or deleting them, or manually processing them as needed. It is also looking into systems to prevent the error from happening again.
“Without a systematic way to start and keep data clean, bad data will happen.” — Donato Diorio
Maintaining data quality is a difficult but necessary task. CrowdFlower, provider of a “data enrichment” platform for data scientists, conducted a survey of about 80 data scientists and found that data scientists spend:
60% of the time in organizing and cleaning data.
19% of the time is spent in collecting datasets.
9% of the time is spent in mining the data to draw patterns.
3% of the time is spent in training the datasets.
4% of the time is spent in refining the algorithms.
5% of the time is spent in other tasks
Data is often combined from multiple places with different sources or parties providing the input. Therefore, data can be messy and data cleaning is required to ensure the data has integrity before analysis can be performed.
Additionally, organisations should find ways to create clean data during collection such as creating high standards and procedures during input so that data cleaning effort can be reduced.
Criteria and activities to have a clean dataset
Six criteria used to determine if there is a clean dataset are as follows:
1) Validity - does data meet the rules? Example: Typically, salary of non-managers should not be higher than managers
2) Accuracy – is the data correct? Example: Postcode of an area should be correct
3) Completeness – is there any missing data? Example: All fields should be filled, otherwise analysis can be skewed
4) Consistency – does the data match up? Example: Data should be consistent regardless of source
5) Uniformity – Are units used similar? This is to ensure like for like comparison.
6) Timeliness – is the data up to date? If big data is collected automatically, this should not apply. Example: Updated name of leaders at organisations/ companies
Having a clean dataset involve the following activities:
Getting rid of extra spaces
Selecting and treating all blank cells (or missing values)
Converting numbers stored as text into numbers
Removing duplicates
Highlighting errors
Changing text to lower/upper/proper case
Performing spell check
Filtering outliers
Data sparseness and formatting inconsistencies are usually the biggest challenges to have a clean data.
In the case of outliers, outliers are innocent until proven guilty. You should never remove an outlier just because it’s a "big number." That big number could be very informative for the model.
It is also important not to ignore missing values. The fact that the value was missing is informative in itself. To remove every missing value or to simply replace every missing value with an imputed value can cause bias in the model. Note that imputation doesn't necessary provide better result. Therefore, extra attention should be paid to how to deal with missing value.
Why a clean dataset is important?
1) A clean dataset leads to the right conclusions and insights. Hence, it leads to better execution of decisions. In the long run, it allows companies to trust their data and enable quicker and more knowledgeable decisions.
It is basically down to the principle of 'Garbage in, garbage out', that is incorrect or poor-quality input will produce faulty output.
2) It increases productivity and efficiency of employees. It ensures that employees are making the best use of their work hours, as employees will not be doing the wrong analysis and making the wrong decision. It provides accurate real-time visibility and allow employees to take corrective action in real time.
3) Organizations that keep business critical information at a high-quality gain a significant competitive advantage in their markets because they’re able to adjust their operations to the changing circumstances quickly.
We earlier wrote on how big data analytics improve efficiency. Data can be used for various purposes such as to improve processes or to spot opportunities for expansion and growth. Therefore, it is important to have a clean dataset.
Data is an integral part of analytics. Managing and ensuring that the data is clean plays an important role in ensuring the quality of analytics. Clean data is simply data with a high level of accuracy “that you can make more sense out of". Better data beats fancier algorithms.
However, companies that are busy growing their operations often struggle to keep their databases in shape. One way is to outsource database cleaning and management, which is a smart move. That way companies can take advantage of extra resources in a low-cost and low-risk way.
What is the greatest challenge that you faced when not having a clean dataset? Share with us by leaving us a comment. If you require more information or outsourcing on data analytics, contact us. Subscribe to our newsletter for regular feeds.
Did you find this blog post helpful? Share the post! Have feedback or other ideas? We'd love to hear from you.
Reference
Alvira Swalin, How to Handle Missing Data, https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4, published 31 January 2018
DeZyre, Why data preparation is an important part of data science, https://www.dezyre.com/article/why-data-preparation-is-an-important-part-of-data-science/242, published 7 April 2016
Geotab, 6 Steps for Data Cleaning and Why It Matters, https://www.geotab.com/blog/data-cleaning/, published 24 May 2018
Comments