Setting up the input/output
First I split the data up into hourly chunks with the
rolling_window method. Then I run a 2D rolling window to pull enough hours to include both my predictor value and response values. Note that there’s extra data in the middle. I drop this middle section and assign the X and y arrays via
np.split(). Then I reshape the arrays so that the hours are no longer divided. I’m using the
rolling_window method that I mentioned yesterday (using strides) and a 2D version that doesn’t use strides as defined here.
data = egz.df_energy['Main (kW)'] bins = egz.rolling_window(data,4,4) windows = egz.rolling_window2D(bins,24*7*5+1) X,_,y = np.split(windows,[24*7*4,24*7*5],1) X = X.reshape(X.shape,-1) y = y.reshape(y.shape,-1)
I can declutter this process by getting rid of the step where I break into hour-long bins (although that section did make it a bit easier to follow):
windows = egz.rolling_window(data,4*(24*7*5+1),4) X,_,y = np.split(windows,[4*24*7*4,4*24*7*5],1)
See below for a more readable edit
X array of 4-week predictor values starts at 00:00:00 on 8/5/15 and my
y array holds the hour-long predictions 5 weeks from then starting on 9/9/15 at 00:00:00.
4 weeks of prediction data, 1 week gap, 1 hour of target data.
Apparently sklearn doesn’t like when you give it null values. For now I’ll just
fillna(0) although this may impact the accuracy of the model. I’d be better off filling in values from the week before to give a better estimate of the actual usage that occured then (or average the value the week before with the interpolated values to also reflect that day’s context).
Here’s a really big plot of all the test samples compared to their target values. Since the predictions are each hour-long bins of data, I plotted them by raveling the predicted array.
Click the image to expand it to full size:
Measuring the fit
I tried measuring the Mean Absolute Percent Error to measure the fit, however the outlier days when the model didn’t know that there was no school threw this measure off entirely (it was over 4 billion percent).
To reduce the impact of these outlier days I instead measured the Median Absolute Percent Error which gave a much more reasonable value. Note the addition of a
1e-9 term so that the system doesn’t try to divide by zero.
np.median((np.abs(((y_test+1e-9) - (pipe.predict(X_test)+1e-9))) / (y_test+1e-9))) >>> 0.16049520499889364
To put this in comparison, the MIT report from yesterday measured a MAPE of about 0.12 for RF and 0.14 for ANN, so we’re not far off. And this is only trained on data about previous consumption, so it doesn’t have information about holidays yet.
I tweaked the ranges of some of the window variables to see how it affected the predictions. To do so I added some more variables which also makes the assignment more readable:
input_size = 4*24*7*4 gap_size = 4*24*7 output_size = 4 windows = egz.rolling_window(data,input_size+gap_size+output_size,output_size) X,_,y = np.split(windows,[input_size,input_size+gap_size],1)
|4 weeks||1 week||1 hour||0.1605|
|1 weeks||1 week||1 hour||0.1718|
|4 weeks||1 week||1 day||0.1597|
|8 weeks||1 week||1 day||0.1717|
|8 weeks||1 day||1 hour||0.1439|
|8 weeks||1 day||1 day||0.1569|
|4 weeks||1 day||1 day||0.1593|
|4 weeks||1 day||1 hour||0.1325|
|1 weeks||1 day||1 hour||0.1526|
|1 day||1 hour||1 hour||0.0794|
Note that due to the random nature of the model, the accuracy is subject to change across trials (but as an ensemble method it should be somewhat consistent).
As expected, the model with the smallest gap (1 hour) had the best accuracy but at that small of a gap the forecast is no longer very useful. From what I can tell, the winner is 4 weeks of input, 1 day gap, 1 hour predicted. That way the forecast will be available a day beforehand. We could also present the user with an option to run a forecast of their desired length, although they would have to wait for the server to process their request.
Better handling of null values
Instead of replacing the nulls with 0 kW, I can replace them with last week’s value with
I actually tried using the
'W' offset alias instead of calculating the week index interval but it wasn’t behaving as I expected.
The training process is rather slow, even if I drastically reduce the number of samples. It appears the biggest time-suck is the large number of features that need to be tracked.
feature_importances_ property of a trained RF estimator shows the relative importance of your features. Here’s what the distribution looks like:
As you can see, almost all the features have close to zero importance. Here’s the top 10 values of the sorted list:
The importance drops off very quickly. Only a couple of sparse values are significant predictors (the top contenders appear to be from a few weeks in the past).
By changing the
max_features argument to something like
'log2' I was able to increase the training speed at the expense of the median APE (0.13 vs 0.15).
Here’s an example output using
'sqrt' max features: