Cleaning data before saving
When creating data warehouses, still not enough attention is paid to cleaning the information contained in them. Apparently, the assumption is that the bigger the store, the better. This is a sure way to turn a data warehouse into a waste pit.
You need to clean up your data! After all, information is heterogeneous and almost always comes from multiple sources. The fact that there are multiple data collection points makes the cleansing process so complex and relevant.
Mistakes will always be made, and it is not possible to eliminate them completely. Sometimes it may make more sense to put up with them than to spend resources on eliminating them. In general, however, you must try by all means to reduce the number of errors to an acceptable level. The methods used for analysis are already fraught with inaccuracies, so why make them worse?
The psychological aspect of the problem must also be considered. If the analyst or decision maker does not have confidence in the numbers they get from the repository, they will rely on information from other sources. This significantly reduces the value of the repository.
Types of errors
Simple errors, such as type mismatches, differences in input formats and encodings, are not considered in this article. That is, when the information comes from multiple sources in which different conventions are used to denote the same facts.
A typical example of such an error is the specification of the gender of a person. For example, sex can be specified as M/F/D or 0/1/2.
Errors of this type are solved by specifying recoding rules and type conversions, so they are relatively easy to solve. We are interested in more complicated issues that cannot be solved by simple methods.
There are quite a few types of complex errors. In addition to the general-purpose errors, there are also those that are unique to a particular field or task. However, let's consider those that are independent of the task:
- Inconsistent information;
- Incomplete data;
- Anomalous values;
- Noisy data;
- Incorrect data.
There are proven methods to solve each of these problems. Of course, errors can be corrected manually, but this becomes problematic when dealing with large amounts of data. So let's look at ways to solve these problems in an automated fashion with minimal manual involvement.
The first thing to decide is what exactly counts as a contradiction. Strange as it may sound, this task is not trivial. For example, a different spelling of the name is not necessarily an error, but may result from an official name change.
Once we have determined what is considered a contradiction and found such records, there are several possible actions:
- If multiple inconsistent records are found, delete them all or leave one of the options selected by a simple rule. For example, the most recent entry. The method is trivial and therefore easy to implement. Sometimes it is sufficient.
- Correct inconsistent data with statistics. For example, it is possible to calculate the probability of occurrence of each of the inconsistent values and select the most likely value. In most cases, this method leads to more correct results.
This problem is found in most data warehouses. Many forecasting methods assume that data arrives in a steady, constant stream. In practice, this is rarely the case. As a result, data warehouse applications such as forecasting are often realized poorly or with significant limitations. The following methods can be used to prevent this phenomenon:
- Approximation: that is, if there is no data at any point, we take its neighborhood and calculate the value at that point using known formulas and add the corresponding data set to the repository. This works well for ordered data, such as information about daily product sales.
- Determining the most plausible value: this considers the entire data rather than the neighborhood of a point. This method is used for unordered information, i.e. in cases where we cannot determine the environment of the point in question.
Often there are events or data that do not fit into the overall picture. For example, the price of a product is ten times higher than the average. Such values are best corrected. The fact is that the analysis algorithms know nothing about the nature of the processes. Therefore, any anomaly is perceived as a perfectly normal value. This significantly biases the model, since an occasional failure or success is treated as a pattern.
There is a method for this problem as well - robust estimation. These methods are resistant to strong disturbances. An example is the median filter.
We evaluate the available data and apply one of the following measures to any deviation from the specified tolerance:
- Anomalous values are removed;
- Anomalous values are replaced by the nearest limits.
During the analysis, we almost always encounter noise. In most cases, the noise does not provide useful information, but only prevents us from seeing the image clearly. There are several methods to deal with this phenomenon:
- Spectral analysis: we can use it to truncate high-frequency data components, i.e. noise is a frequent and insignificant variation around the main signal. And by changing the width of the spectrum, we can choose what kind of noise we want to remove.
- Autoregressive methods: This widely used method is actively used in time series analysis. It amounts to finding a function that describes the process as signal plus noise. The noise can then be removed and the underlying signal is preserved.
This is an extensive topic because there are too many types of this type of error, such as typos, intentional data corruption, format errors, as well as errors related to the data input application.
Tried and tested methods exist for most of these problems. Some are obvious, such as format validation before putting data into the repository. Some are a bit more challenging. For example, typos can be corrected based on different thesauri. But even these types of errors must be cleaned up in any case.
Unclean data is a very serious problem. In fact, they can ruin all efforts to populate a data warehouse with data. It is important to remember that this is not a one-time cleanup, but an ongoing activity. Ideally, specific procedures should be introduced into the ETL process to ensure that data is cleansed before it is uploaded to the repository.
Data that is modified during the cleansing process must be flagged to reflect this aspect in subsequent analysis. Otherwise, there is a risk of relying on it as real information, which could lead to wrong conclusions.
The solutions described above are not the only solutions. There is a whole range of other processing methods that can help improve data quality, from expert systems to neural networks. It should be kept in mind that cleaning methods can be strongly tied to the specific domain. For example, information that is noise to some is very valuable information to others.
In this sense, the ETL process should be considered more comprehensive than just a process of importing data with minimal transformations and uploading it to the repository. Data cleansing should be a mandatory step in this process, as the value of the repository is determined not only and not so much by the amount of data, but by the quality of the information collected.