by Matthew Baas
A quick guide to our second place solution for the Zindi UmojaHack South AFrica Yassir ETA prediction challenge.
TL;DR: In late July 2020, the Zindi competition platform hosted a one-day hackathon in South Africa to predict the estimated time of arrival for deliveries by the Yassir company in Tunisia. Myself and a few friends teamed up and achieved second place in the hackathon with a CatBoost model trained on several carefully engineered features. This post is a quick summary of how our solution worked, and assumes a vague passing familiarity with feature engineering.
The initial data provided to us contained 120 000 entries of deliveries, each only providing us with 6 values:
Timestamp- Time that the trip was started (as an epoch timestamp in seconds)
Origin_lon- Origin (in degrees)
Destination_lon- Destination (in degrees)
Trip_distance- Distance in meters that the delivery driver traveled from origin to destination.
And using these variables we need to predict the
ETA estimated trip time in seconds. Often times many deliveries happen in a single day.
In addition, we are given daily weather data for the general region where the delivery data is sampled from. The weather data consists of 9 variables describing the max/min pressure, temperature, wind and rainfall. The usefulness of such data is somewhat suspect however, since it a single value is given for an entire day while the true weather likely varies significantly throughout the day.
Now on to the spice of our solution – data augmentation. The largest factor to our model’s performance was how we crafted the input to get some useful additional features while not using any data aside from the fields we have just mentioned.
We derived 38 additional features, which can be broken up into a few categories:
Here we made use of the trusty functions from the fastai library. In particular, we use fastai’s
add_datepart() and legacy fastai V1’s
add_cyclic_datepart() functions to add various time-derived fields from the epoch timestamp given.
The additional features are either integers, such as the day of week or day of year, or floats such as the cosine of the hour, or booleans such as whether it is the start of a month. For our purposes, to keep things simple, we cast everything to a float.
Since most of the data is only over a few months, including features using month of year would not be super useful because some months in the test set might not have been seen during training. So, after filtering down the derived fields those fastai functions gave us, we obtained the following X additional features for each delivery:
'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'TimestampIs_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'TimestampIs_year_start', 'Hour', 'Minute', 'Second', 'Epoch_Elapsed', 'weekday_cos', 'weekday_sin', 'day_month_cos', 'day_month_sin', 'month_year_cos', 'month_year_sin', 'day_year_cos', 'day_year_sin', 'hour_cos', 'hour_sin', 'clock_cos', 'clock_sin', 'minute_cos', 'minute_sin', 'second_cos', 'second_sin'
Here is where things started to get interesting. Since we are not allowed to consult a map of the area of Tunisia where the deliveries are, we cannot easily group deliveries by delivery zone, but that sounds like a useful thing to do.
So, what we did was separately train a k-means clustering algorithm on the start and end (
longitude) tuples and use the assigned cluster index of each delivery as an additional feature.
Concretely, we derived 3 features (the cluster index from the trained k-means model) from 3 k-means models:
Destination_lon) to get a rough clustering of common driving routs.
Origin_lon) corresponding to the starting location, to get an idea of the starting zone of deliveries.
Destination_lon) corresponding to the ending location of deliveries, to get an idea of the zone a delivery ends in.
From our experiments on a validation set, we found that using 15 clusters for all 3 models worked best. Also, surprisingly, treating these cluster index features as purely floating point numbers for the CatBoost model (explained later) seemed to work better than treating them as categorical variables. We are not too sure why this was – it could well have been because our categorical settings for a CatBoost model was misguided – but it certainly did give a slight performance boost.
We add two features corresponding to the bearing from the origin latitude-longitude point to the destination latitude-longitude point, and a second ‘reverse’ bearing from the destination latitude-longitude point to the origin latitude-longitude point. This angle, in radians, gives an indication of the direction of travel and seemed to help quite a bit.
We then went on to add two more features based on simple math operations of the most important existing features. Namely, we added the inverse of the trip distance as a feature, which appeared to help the model learn better.
We also added a feature constructed by the multiplication of the trip distance and the destination longitude. This feature was added while trying different permutations of operations on the base features, and it happened to turn out that adding this particular feature improved our score non-trivially. We are not too sure if there is any specific meaning to the trip distance multiplied by the destination longitude, but it appeared to work xD.
We added another feature based on the idea that if the distance the delivery driver will travel (the
Trip_distance) is much longer than the straight-line Euclidean distance between the starting and ending point – it likely means that there is some disruption or traffic along the direct route, indicating a longer travel time.
We thus add a feature as a rough approximation of the sinuosity coefficient by taking the ratio of the Euclidean distance between starting and ending latitude-longitude points and the actual distance of the path travelled (
Trip_distance). This feature also proved fairly useful.
A final feature we added was actually the non-negative matrix factorization (NMF) of all the other fields, including derived fields. Concretely, we used NMF to reduce the dimension of all other features to a single dimension, and used this projected scalar as a final additional input to the model. This is what really helped push forward our model’s performance into the top 3.
As an aside, the competition organizers with Yassir and Zindi were very epic with this competition and there were no missing data values in either the weather or delivery information. This also meant that there were no missing entries in any of the derived fields! Very epic indeed; no missing data in the entire competition.
All of the additional features added to one very big [pandas] DataFrame object which is used as the input for training the model. The model we used was a CatBoostRegressor from the CatBoost gradient boosting library.
We simply followed the typical use case examples of CatBoost and trained it accordingly using an RMSE loss metric. There was nothing special about the model itself. For experimentation we split the input data into 80% for training and 20% for validation (with the validation 20% being chronologically the most recent 20% of deliveries in the provided data).
And, unfortunately, as is often the case with these kinds of competitions, we used ensembling to get that final 1% improvement at the end. We trained a few models with slightly different parameters and took the average prediction result for the ETA.
That’s it! Nothing else, nothing spooky. Fairly simple, only some special feature engineering going on here. Hopefully some of the features we derived has given you some ideas for future data science endeavors with tabular data :).
Have a great day.tags: hackathon - solution guide - catboost