GEFCom 2014 Electricity Price

The Global Energy Forecasting Competition (GEFCom) is a competition conducted by a team led by Dr. Tao Hong that invites submissions around the world for forecasting energy demand. GEFCom was first held in 2012 on Kaggle, and the second GEFCom was held in 2014 on CrowdANALYTIX. Tangent Works participated in the 2017 competition using TIM and was among the winning teams. Before this competition we first tried using TIM on the 2014s problems. In this solution you can learn how to use TIM to solve one of them.

Problem description

The topic of the probabilistic price forecasting track was to forecast the probabilistic distribution (in quantiles) of the electricity price for one zone on a rolling basis 24 hours ahead. Contestants were asked to provide forecasts for 15 rounds = different days. Incremental price and load data was provided in each of the rounds.

Data Recommendation Template

In the price forecasting track of the competition that we will look at, only 2 predictors were available - forecasts of electricity loads from two different zones. However, in general, there are more predictors that influence the price and should be included as well if possible. These are mostly meteorological data and their influence from country to country depending on the composition of the energy sources. For example in countries with lots of renewables, solar related forecasts (like global horizontal irradiation) matter a lot.

TIM Setup

TIM requires no setup of TIM's mathematical internals and works well in business user mode. All that is required from a user is the desired prediction horizon.

Demo using Python API Client

Set up Python Libraries

In [1]:
import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json

import tim_client

Credentials and logging

(Do not forget to fill in your credentials in the credentials.json file)
In [2]:
with open('credentials.json') as f:
    credentials_json = json.load(f)                     # loading the credentials from credentials.json

TIM_URL = 'https://timws.tangent.works/v4/api'          # URL to which the requests are sent

SAVE_JSON = False                                       # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/'                            # folder where the requests and responses are stored

LOGGING_LEVEL = 'INFO'
In [3]:
level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)
In [4]:
credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)

api_client.save_json = SAVE_JSON
api_client.json_saving_folder_path = JSON_SAVING_FOLDER
[INFO] 2020-10-29 10:27:05,369 - tim_client.api_client:save_json:74 - Saving JSONs functionality has been disabled
[INFO] 2020-10-29 10:27:05,372 - tim_client.api_client:json_saving_folder_path:89 - JSON destination folder changed to logs

Specify configuration

Model is built using a range between 2011-01-01 00:00:00 and 2013-06-15 23:00:00. The rest (the last 4416 samples) we want to leave out to be used for the validation. To achieve that we can set the "backtest length" to 4416. The proper way of emulating the competition setup would be to create 14 different models and their day ahead forecasts (the building period would become bigger by new data every time). But for the simplicity, we will keep the building period static. To get better insights from our model we will also want extended importance and prediction intervals to be returned.

In [8]:
configuration_backtest = {
    'usage': {
        'predictionTo': {
            'baseUnit': 'Day',                # units that are used for specifying the prediction horizon length (one of 'Day', 'Hour', 'QuarterHour', 'Sample')
            'offset': 1                       # number of units we want to predict into the future (24 hours in this case)
        },
        'backtestLength': 4416                 # number of samples that are used for backtesting (note that these samples are excluded from model building period)
    },
    "predictionIntervals": {
        "confidenceLevel": 90                  # confidence level of the prediction intervals (in %)
    },
    'extendedOutputConfiguration': {
        'returnExtendedImportances': True      # flag that specifies if the importances of features are returned in the response
    }
}

Data description

Dataset used in this example has hourly sampling rate and contains data from 2012-01-01 to 2014-12-31.

Target

The target variable is, of course, the electricity price and the data are measured hourly from the beginning of the year 2011 to the end of the year 2013.

Predictor candidates

Zonal load forecasts from 2 different areas.

Timestamp

Timestamp is the first column and each value of the timestamp is the beginning of the period it corresponds to i.e. ‘Price’ in the row with timestamp 2011-01-01 00:00:00 corresponds to an average of Price during period between 2011-01-01 00:00:00 and 2011-01-01 01:00:00.

Forecasting scenario

In this example we will simulate a day ahead scenario as was used in the competition. Each day at 23:00 we wish to have forecasts for each hour of the next day. The last target value will be from the same exact hour 23:00 and both load predictors will be available for every hour of our prediction. To let TIM know that this is how it would be used in the production we can simply use the dataset in a form that would represent a real situation (as can be seen in the view below - notice the NaN values representing missing data for the following day we wish to forecast).

In [9]:
data = tim_client.load_dataset_from_csv_file('data.csv', sep=',')                                  # loading data from data.csv
data                                                                                               # quick look at the data
Out[9]:
TIMESTAMP Price P_Forecast Total Load P_Forecast Zonal Load
0 2011-01-01 00:00:00 43.17 15187 5091
1 2011-01-01 01:00:00 36.24 14464 4918
2 2011-01-01 02:00:00 34.64 13940 4763
3 2011-01-01 03:00:00 33.76 13609 4660
4 2011-01-01 04:00:00 33.08 13391 4599
... ... ... ... ...
25963 2013-12-17 19:00:00 NaN 23091 7167
25964 2013-12-17 20:00:00 NaN 22504 6958
25965 2013-12-17 21:00:00 NaN 21538 6707
25966 2013-12-17 22:00:00 NaN 20025 6316
25967 2013-12-17 23:00:00 NaN 18306 5812

25968 rows × 4 columns

Run TIM

In [10]:
backtest = api_client.prediction_build_model_predict(data, configuration_backtest)                 # running the RTInstantML forecasting using data and defined configuration
backtest.status                                                                                    # status of the job
Out[10]:
'Finished'

Visualize backtesting

In [16]:
fig = plt.subplots.make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.02)      # plot initialization

fig.add_trace(go.Scatter(x = data.loc[:, "TIMESTAMP"], y=data.loc[:, "Price"],
                         name = "target", line=dict(color='black')), row=1, col=1)              # plotting the target variable

fig.add_trace(go.Scatter(x = backtest.prediction.index,
                         y = backtest.prediction.loc[:, 'Prediction'],
                         name = "production forecast",
                         line = dict(color='purple')), row=1, col=1)                            # plotting production prediction

fig.add_trace(go.Scatter(x = backtest.prediction_intervals_upper_values.index,
                         y = backtest.prediction_intervals_upper_values.loc[:, 'UpperValues'],
                         marker = dict(color="#444"),
                         line = dict(width=0),
                         showlegend = False), row=1, col=1)
fig.add_trace(go.Scatter(x = backtest.prediction_intervals_lower_values.index,
                         y = backtest.prediction_intervals_lower_values.loc[:, 'LowerValues'],
                         fill = 'tonexty',
                         line = dict(width=0),
                         showlegend = False), row=1, col=1)                                     # plotting confidence intervals

fig.add_trace(go.Scatter(x = backtest.aggregated_predictions[0]['values'].index,
                         y = backtest.aggregated_predictions[0]['values'].loc[:, 'Prediction'],
                         name = "in-sample MAPE: " + str(round(backtest.aggregated_predictions[0]['accuracyMetrics']['MAPE'], 2)),
                         line=dict(color='goldenrod')), row=1, col=1)                           # plotting in-sample prediction

fig.add_trace(go.Scatter(x = backtest.aggregated_predictions[1]['values'].index,
                         y = backtest.aggregated_predictions[1]['values'].loc[:, 'Prediction'],
                         name = "out-of-sample MAPE: " + str(round(backtest.aggregated_predictions[1]['accuracyMetrics']['MAPE'], 2)),
                         line = dict(color='red')), row=1, col=1)                               # plotting out-of-sample-sample prediction

fig.add_trace(go.Scatter(x = data.loc[:, "TIMESTAMP"], y=data.loc[:, "P_Forecast Total Load"],
                         name = "P_Forecast Total Load", line=dict(color='forestgreen')), row=2, col=1)   # plotting the predictor P_Forecast Total Load

fig.update_layout(height=600, width=1000,
                  title_text="Backtesting, modelling difficulty: "
                  + str(round(backtest.data_difficulty, 2)) + "%" )                             # update size and title of the plot

fig.show()

Visualize predictor and feature importances

In [17]:
simple_importances = backtest.predictors_importances['simpleImportances']                                                                # get predictor importances
simple_importances = sorted(simple_importances, key = lambda i: i['importance'], reverse=True)                                           # sort by importance
extended_importances = backtest.predictors_importances['extendedImportances']                                                            # get feature importances
extended_importances = sorted(extended_importances, key = lambda i: i['importance'], reverse=True)                                       # sort by importance

si_df = pd.DataFrame(index=np.arange(len(simple_importances)), columns = ['predictor name', 'predictor importance (%)'])                 # initialize predictor importances dataframe
ei_df = pd.DataFrame(index=np.arange(len(extended_importances)), columns = ['feature name', 'feature importance (%)', 'time', 'type'])   # initialize feature importances dataframe
In [18]:
for (i, si) in enumerate(simple_importances):
    si_df.loc[i, 'predictor name'] = si['predictorName']                   # get predictor name
    si_df.loc[i, 'predictor importance (%)'] = si['importance']            # get importance of the predictor

for (i, ei) in enumerate(extended_importances):
    ei_df.loc[i, 'feature name'] = ei['termName']                          # get feature name
    ei_df.loc[i, 'feature importance (%)'] = ei['importance']              # get importance of the feature
    ei_df.loc[i, 'time'] = ei['time']                                      # get time of the day to which the feature corresponds
    ei_df.loc[i, 'type'] = ei['type']                                      # get type of the feature
In [19]:
si_df.head()                                                               # predictor importances data frame
Out[19]:
predictor name predictor importance (%)
0 Price 66.07
1 P_Forecast Total Load 17.01
2 P_Forecast Zonal Load 16.92
In [20]:
fig = go.Figure(go.Bar(x=si_df['predictor name'], y=si_df['predictor importance (%)']))      # plot the bar chart
fig.update_layout(height=400,                                                                # update size, title and axis titles of the chart
                  width=600,
                  title_text="Importances of predictors",
                  xaxis_title="Predictor name",
                  yaxis_title="Predictor importance (%)")
fig.show()
In [21]:
ei_df.head()                                                               # first few of the feature importances
Out[21]:
feature name feature importance (%) time type
0 Price(t-1) 37.97 00:00:00 TargetAndTargetTransformation
1 Price(t-2) 32.23 01:00:00 TargetAndTargetTransformation
2 Price(t-22) 31.09 21:00:00 TargetAndTargetTransformation
3 Price(t-24) 30.01 23:00:00 TargetAndTargetTransformation
4 Price(t-24) 29.76 22:00:00 TargetAndTargetTransformation
In [22]:
time = '12:00:00'                                                                            # time for which the feature importances are visualized
fig = go.Figure(go.Bar(x=ei_df[ei_df['time'] == time]['feature name'],                       # plot the bar chart
                       y=ei_df[ei_df['time'] == time]['feature importance (%)']))
fig.update_layout(height=700,                                                                # update size, title and axis titles of the chart
                  width=1000,
                  title_text="Importances of features (for {})".format(time),
                  xaxis_title="Feature name",
                  yaxis_title="Feature importance (%)")
fig.show()