The Challenge

Now more than ever, companies need to gather more data to improve operations. Logistics companies especially have many opportunities to gather massive amounts of data to use for operations, data analytics, and ultimately continuous improvement. Keep in mind that more data does not necessarily translate to new insights, especially if there are data quality concerns.

Insightful data analysis depends on accurate and meaningful data. Poor data quality can have many causes, including human error during data entry, and process errors stemming from multiple system integrations.

Organizations might be hesitant to get started with analytics due to data issues but overcoming data quality challenges is part of the journey to become more analytically mature. If your organization increases data visibility to identify and fix data quality issues at the source, make data exclusions and corrections, then you will be ready to use analytics to drive continuous improvement.

data quality visual quote

The Role of Anomaly Detection

Before you can improve data quality, you must isolate trends or causes of inaccurate data. ORTEC uses various methods of anomaly detection, identification of data points that vary from normal behavior, to isolate data quality issues.

The first method we use is statistics-based outlier detection (see graph below titled *Histogram — Realized Duration). With this method, we can identify data points outside of an expected range, for example, values that are extremely far from the average value. This detection can be used for numerical data such as delivery sizes or stop durations, data points that are often corrupted by user error.

Next, we can perform Actual vs. Planned Outlier Detection (see graph below titled **Realized duration vs. Planned duration). This method of detection will find data points where the actual value differs significantly from the plan. A discrepancy between the actual and the plan could indicate data input errors.

Last, we employ a manual form of outlier detection, a method that can be informed by both analytics and business rules. With this method we define upper and lower bounds on various fields informed by business specific knowledge and show alerts when data points fall outside of this range. For example, the business will know that routes will never exceed a certain duration, and anything that exceeds an upper bound can be flagged as an outlier.

*This histogram shows some realized duration that are much larger than the average value which should be investigated.

**This graph shows data points where the realized duration is drastically different from the plan should be analyzed.

large data quality image
Data Quality image 2

Fix the Issues at the Source

When we identify causes for data anomaly, the next step is to fix data issues at the source, which is the sustainable option to drive long-term analytics. This step is dependent on the data gathering method, but we can highlight some helpful approaches when data is inaccurate because of manual data entry. When data is incorrect because of human error, we should design out the possibility of human error whenever possible.

For example, if stop durations are inaccurate because they depend on driver feedback, geofencing-based methods could be used where realizations are captured once a driver arrives and departs from pre-defined locations. Another example is if a driver must enter the delivery sizes at a customer, data validation can ask drivers to double check extreme values. As the data collection issues are resolved you will see your data quality improved over time.

How to Fix Poor Data Quality for Analytics

When it is too late to fix the process that causes the bad data, we must find a way to handle inaccurate data in analytics.

The first option is to exclude data that may skew analysis. For example, if drivers often add extra zeros on accident to the delivery size, the process should be fixed to disallow this from happening in the first place. For historical data, these values are skewed so one option would be to exclude deliveries over a certain unlikely size. However, there is risk with this process as the total delivery volume would now be under reported.

The next best option is to correct incorrect data. With this method, there are a few options. First, you could manually override incorrect data points with values you know to be correct or closer to correct. This is helpful for the rare data issue but can be tedious. Next, create rules to fix data. For example, if a delivery size is much too large such as in the above example, replace it with the planned value as this is a more likely value. This could be automated and less tedious but more accurate.

Lastly, if the corrected values exist in some other system – for example, an invoicing system – try integrating this system back into the analytics system to get the most accurate set of analytics data.

Conclusion

Data quality issues are not a blocker to get started with analytics in your organization but improving data quality is the first step. The visibility of your data and data quality sets up the organization for success in analytics for continuous improvement. When you start with outlier identification, you will get insight into where problems exist and where to focus your efforts. Then you can move to improving data quality at the source, and finally make historic data ready to get analytics-driven insights. By systematically understanding and improving data quality, your organization will be ready to get started with analytics.