2.5 trillion bytes of data are generated daily around the world. Even though data became widely accessible, their quality still remains problematic. Incorrect and unreliable data are a major burden to every person working with data. According to Kissmetrics statistics, poor-quality data may cause companies to lose as much as 20% of their profit. Incorrect data are also a significant obstacle to business development.
Read this article to find out:
Why are my data poor quality? 7 main reasons
There can be many causes for data errors and usually, there are more than one. Let’s take a look at the most common reasons:
1. Outdated, obsolete data
Data, especially those collected in the field, are obtained from different places and with different frequencies. It’s not guaranteed that all collected data are related to the same time period. Moreover, in case of the large projects, data collection can take so much time that by the end of the process data will already be outdated.
This is common for a manual data collection process that cannot keep up with constant changes. An example is the manual collection of data in the field related to e.g. water supply, energy, or telco networks.
2. Different data models
The manual collection is the obvious cause of differences in data models, even if they are gathered by the same person. It’s quite amazing how many ways there are for writing a street name. Let’s take the 4th of July Avenue. It can be also written as Fourth of July Avenue, 4 July Ave, 4-th of July Ave, etc. A simple change in the name that may easily go unnoticed causes data to be inconsistent. Data processing systems treat them as completely different information.
It’s not only humans to be blamed for the lack of data consistency. There are several reasons for differences in data models. They often stem from updates in IT systems, especially the serious updates that cause the software to go up a few versions. The updated system may include new attributes so transferred data become incomplete.
The problem of different data models is often related to company mergers. Mergers require the integration of databases used by the companies that combine into a single entity.
Reasons for differences in data models
Changing data notation
Updating IT systems
Merging systems of different companies
3. Lack of benchmark
It’s crucial to define the right benchmark that allows you to verify data reliability, particularly when using open data.
Besides their unquestionable advantage of accessibility, open data carry a significant risk of errors. This is because they are updated by a wide audience. An example of an open data platform is OpenStreetMap.
To avoid falsified analysis results, it’s worth comparing several datasets from different sources. This allows you to capture common parts as well as parts with the most significant differences. For example, you can compare data from OpenStreetMap and Topographic Object Database.
4. Too much trust in external data sources
A common mistake is having too much trust in data from external sources.
Usually, they are verified and their quality is satisfactory. But mailing address databases contradict this argument. They contain many addresses but usually, only a small portion is truly useful. There is a lack of knowledge about how such data were collected and whether they are up-to-date, full, and consistent. There is also no guarantee that they will support your operations.
During data analysis, you should also consider the context and time when data were produced as well as who collected them and for what purpose. Bias may impact data even at the collection stage and this can influence the analysis results.
5. Many data sources in a company
Data sources may be dispersed even within a single organization. Different data types may be collected in different ways so there can be a lack of a consistent data model or format.
Problems arise when you try to integrate such data. A particular record (e.g. related to one of the clients) can appear repeatedly in the central system that gathers data from several sources. Therefore, you need to decide which database or system should be superior to others that are only supplementing the main data source. The unnecessary positions should then be deleted.
6. Duplicated records
Often, after merging several data sources, the final dataset contains duplicated records. That’s not a problem if they are identical, as all you need to do is delete repeated data. It gets tricky if records differ in just a single attribute, a small detail such as a digit in a phone number. Then you don’t know which one is correct and which one should be deleted. In this situation, you need to perform additional data verification.
7. Human errors
We already mentioned human errors. They appear when data is transcribed into a system or database. They may be invalid attribute values, typos, or inconsistencies in notation due to differences between languages (e.g. periods or commas when writing decimal points).
This type of error is a result of human fatigue or distraction related to repetitive, tedious tasks. They may also occur in case an employee lacks the skills required for completing the task, for example, they don’t know how to fill out a certain form.
The most common data errors
A general category of data errors is related to the attributes:
- missing or unknown values,
- lack of diacritics,
- different notations of a given attribute, e.g. avenue – ave – av, street – st – str, etc. Lack of consistency in data models from different sources, e.g. George Washington Boulevard vs Washington Boulevard,
- lack of an identification number (ID),
- different data formats and/or different units
Other common errors include:
Our poll results show that the low data quality is usually a result of human errors or obsolete data from different sources.
Spatial data errors
It’s surprising how many companies and institutions start to put their spatial data to use. This trend will keep on growing so it’s important to take care of your geospatial data.
Reminder – spatial data is the kind that contains additional geolocation information besides the usual attribute list.
All of the data errors described above can be true for both spatial and non-spatial data.
In both cases, there can be missing attributes, invalid values, typos, etc. that can result from outdated systems, human errors, different data sources, duplicated objects from database integrations, etc.
Unfortunately, spatial data can carry additional, specific errors.
The most common spatial data errors
The most common spatial data errors include:
- unclosed polygons,
- lines that don’t reach points,
- lines that cross themselves,
- incorrectly placed apexes or intercepts,
- invalid geometry type,
- incorrectly defined model scheme,
- invalid units or coordinate systems,
- inconsistent networks and lack of links between objects.
An example is a challenge related to classifying soils in cross-border areas. During validation, some profiles may be overlayed.
Incorrect usage of generalization techniques or parameters for vector data, e.g. a smoothing or simplification parameter that’s too large.
Where do spatial data errors come from?
Sources of spatial data errors are similar to the ones we already mentioned before. They may occur because:
- data sources are created by unexperienced people that can make many mistakes,
- data aren’t verified before being commonly shared,
- used information is outdated,
- data were created in a system or model that isn’t supported by newer systems and can’t be properly read,
- a person working with spatial data isn’t experienced enough to know e.g. which coordinate system should be used or which generalization or classification technique is best for a given dataset.
What are the consequences of using poor-quality data?
First and foremost, using poor-quality data results in equally poor-quality work.
Using analyses of data containing errors make you draw incorrect conclusions, therefore, every decision you make based on those analyses is also wrong. This is a particularly poor strategy, especially in the ever-competitive market, where the survival of many companies depends on making the right decisions.
In the case of both spatial and non-spatial data, errors greatly disturb and delay work. This results in delayed projects, unhappy clients, financial loss, and losing partnerships.
In our poll, we asked participants about the most common consequences of working with poor-quality data. Their responses point to work inefficiency, incorrect analyses, and lost sales opportunities.
Spatial data errors have even more consequences, way more severe than business loss.
Sometimes, human life depends on spatial data quality.
The most popular use of spatial data is GPS and navigation. This is what emergency rescue services use to reach the incident location. In this kind of situation, every second of delay may be someone’s last. Incorrect data may cause the ambulance to reach the wrong place first which extends waiting time for the people needing help.
There is another, less dramatic example that still shows the impact of incorrect data. A construction company may accidentally damage energy installation or water/gas pipelines when using maps with incorrect information. This can be both dangerous and troublesome for people who lack access to gas, water, or electricity.
Now you know the sources and types of both spatial and non-spatial data errors. Poor-quality data can negatively influence both business operations and the everyday life of many people. Therefore it’s worth it to take care of the data quality before using them in analyses and projects. This ensures the reliability of your work and analyses used for making crucial strategic decisions.
Stay tuned for the next article to learn how to eliminate data errors. Follow us on LinkedIn so you don’t miss it!