Portfolio wind production

Problem Description

In these solution template we will describe forecasting of aggregated production of more wind farms. There are two possible cases, in this solution template we will describe the second scenario.

  1. Production of each individual wind farm is available and production of the whole portfolio is sum of individual productions. In this case individual model and forecast can be generated for each wind farm (see Single Asset Wind Forecasting) and then summed up to obtain production forecast of whole portfolio. With this approach is easy to consider scheduled maintenance of individual wind farms or adapt to change in portfolio size because only active wind farms will be summed up.
  2. Only production of the whole portfolio is available. In this case is model generated for whole portfolio considering location of individual wind farms and meteorological situation there. In this solution template we will describe this scenario. Typical scenarios in wind production forecasting are from few hours ahead to few days ahead (usually up to 36h or 48h ahead). Data sampling rate is mostly 15 minutes, 30 minutes or hourly. Typical metrics used to evaluate how good is the prediction are MAE and RMSE. To evaluate error in percent’s nMAE, nRMSE, rMAE or rRMSE are used.

Data Recommendation Template

It is essential for wind production forecasting to have a good wind speed forecast. The most important forecasts are wind speed and wind direction at hub high of individual wind turbines. The key is to find appropriate GPS coordinates that represent portfolio the best. This task doesn’t have an exact solution. We recommend taking locations of individual wind farms and their installed capacities into account. Our best practice for finding the GPS coordinates for meteo data is to do clustering on locations of wind farms weighted by their installed capacity. We use centroids of these clusters as our GPS coordinates for meteo data. However, number of clusters that get the best results is still a question. Best practice is to use historical actuals of meteo predictors for the model building and meteorological forecasts for the out-of-sample validation.
Other meteo predictors as wind gusts, temperature, irradiation and pressure may improve models. They are recommended only for further fine-tuning. In general, the typical situation is that the portfolio is changing. Currently our best practice is to choose the last stable part for the training.

TIM Setup

TIM requires no setup of TIM's mathematical internals and works well in business user mode. All that is required from a user is to let TIM know a forecasting routine and desired prediction horizon. TIM can automatically learn that there is no weekly pattern, in some cases, however, (e.g. short datasets) it can be difficult to learn this and therefore we recommend switching off the weekdays dictionary.

Demo using Python API Client

Set up Python Libraries

In [1]:
import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json

import tim_client

Credentials and logging

(Do not forget to fill in your credentials in the credentials.json file)
In [2]:
with open('credentials.json') as f:
    credentials_json = json.load(f)                     # loading the credentials from credentials.json

TIM_URL = 'https://timws.tangent.works/v4/api'          # URL to which the requests are sent

SAVE_JSON = False                                       # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/'                            # folder where the requests and responses are stored

LOGGING_LEVEL = 'INFO'
In [3]:
level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)
In [4]:
credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)

api_client.save_json = SAVE_JSON
api_client.json_saving_folder_path = JSON_SAVING_FOLDER
[INFO] 2020-10-29 22:43:49,177 - tim_client.api_client:save_json:74 - Saving JSONs functionality has been disabled
[INFO] 2020-10-29 22:43:49,179 - tim_client.api_client:json_saving_folder_path:89 - JSON destination folder changed to logs

Specify configuration

In this example we will simulate a day ahead scenario. Each day at 09:15 we wish to have forecast for each hour up until the end of the next day - we will set the "predictionTo" to 77 samples. Model is built using a range between 2018-01-01 00:00:00 - 2019-06-30 23:30:00. Out-of-sample forecasts are made in the range between 2019-07-01 00:00:00 - 2019-08-14 09:00:00 (the last 2131 samples). To get better insights from our model we will also want extended importance and prediction intervals to be returned.

In [5]:
configuration_backtest = {
    'usage': {
        'predictionTo': {
            'baseUnit': 'Sample',                # units that are used for specifying the prediction horizon length (one of 'Day', 'Hour', 'QuarterHour', 'Sample')
            'offset': 77                       # number of units we want to predict into the future (24 hours in this case)
        },
        'backtestLength': 2131                 # number of samples that are used for backtesting (note that these samples are excluded from model building period)
    },
    "predictionIntervals": {
        "confidenceLevel": 90                  # confidence level of the prediction intervals (in %)
    },
    'extendedOutputConfiguration': {
        'returnExtendedImportances': True      # flag that specifies if the importances of features are returned in the response
    }
}

Data description

Dataset used in this example has half-hourly sampling rate and contains data from 2018-01-01 00:00:00 to 2019-08-15 23:30:00.

Target

Data used in this example are from the UK. Production data are available and can be downloaded from the web page https://www2.bmreports.com/bmrs/?q=generation/windforcast/out-turn. Sum of production of all wind farms in the UK is our target. It is the second column in CSV file, right after column with timestamps. In this case name of the target is Quantity. Data are in half-hourly granularity.

Predictor candidates

We will use 10 GPS coordinates across the UK. As predictors will be used wind speeds at heights 100m, 120m and wind direction at high 100m for each of 10 given GPS coordinates. In this demo we are using historical actuals for the model building and meteo forecasts for the out-of-sample forecasting. The CSV file contains merged historical actuals with forecasts of meteo predictors. Predictors used for model building are historical actuals (up to the timestamp 2019-06-30 23:30:00) and for out-of-sample validation the CSV consists of meteo forecasts (from the timestamp 2019-07-01 00:00:00).

Forecasting scenario

We simulate a day ahead scenario – each day at 10:00 we would want to forecast target one whole day into the future. We assume that values of all predictors are available till the end of the next day (the end of the prediction horizon). The last value of the target is from 09:00. To let TIM know that this is how it would be used in the production we can simply use the dataset in a form that would represent a real situation (as can be seen in the view below - notice the NaN values representing the missing data for the following day we wish to forecast). In this demo data set, out-of-sample validation is performed using historical actuals of meteorological data. More representative validation may be obtained by using historical forecasts of meteorological data instead.

In [6]:
data = tim_client.load_dataset_from_csv_file('data.csv', sep=',')                                  # loading data from data.csv
data                                                                                               # quick look at the data
Out[6]:
Date Quantity GPS1_wind_speed_100m_ms GPS1_wind_speed_120m_ms GPS1_wind_dir_100m_d GPS2_wind_speed_100m_ms GPS2_wind_speed_120m_ms GPS2_wind_dir_100m_d GPS3_wind_speed_100m_ms GPS3_wind_speed_120m_ms ... GPS7_wind_dir_100m_d GPS8_wind_speed_100m_ms GPS8_wind_speed_120m_ms GPS8_wind_dir_100m_d GPS9_wind_speed_100m_ms GPS9_wind_speed_120m_ms GPS9_wind_dir_100m_d GPS10_wind_speed_100m_ms GPS10_wind_speed_120m_ms GPS10_wind_dir_100m_d
0 2018-01-01 00:00:00 8596.0 4.4 4.4 221.4 17.0 17.1 250.9 10.4 10.3 ... 252.1 12.3 12.7 275.9 20.6 20.9 257.9 12.2 12.2 259.8
1 2018-01-01 00:30:00 8750.0 4.5 4.5 218.3 17.4 17.6 245.8 9.3 9.2 ... 249.8 11.2 11.6 266.5 19.0 19.3 257.9 11.3 11.3 259.8
2 2018-01-01 01:00:00 8631.0 4.6 4.6 215.4 17.9 18.1 241.0 8.2 8.1 ... 247.5 10.1 10.5 254.9 17.4 17.7 257.8 10.4 10.4 259.7
3 2018-01-01 01:30:00 8595.0 3.7 3.7 216.4 17.8 18.0 239.8 8.4 8.3 ... 244.8 9.7 10.1 246.3 16.6 16.9 257.0 10.2 10.2 252.6
4 2018-01-01 02:00:00 8437.0 2.7 2.7 218.0 17.8 17.9 238.5 8.6 8.6 ... 242.0 9.2 9.6 236.8 15.9 16.1 256.0 10.0 10.0 245.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28411 2019-08-15 21:30:00 NaN 4.2 4.2 208.1 10.1 10.8 212.3 4.3 4.4 ... 213.6 4.1 4.4 240.9 8.8 8.9 290.8 7.9 7.9 233.4
28412 2019-08-15 22:00:00 NaN 4.3 4.3 209.1 10.2 11.0 212.1 6.6 6.6 ... 209.5 4.4 4.7 234.5 8.5 8.5 290.4 8.1 8.1 224.2
28413 2019-08-15 22:30:00 NaN 4.5 4.5 207.2 11.8 12.6 214.5 5.5 5.6 ... 208.1 4.9 5.2 216.4 8.3 8.3 291.6 8.8 8.8 210.9
28414 2019-08-15 23:00:00 NaN 4.8 4.8 205.4 13.4 14.2 216.4 4.4 4.5 ... 206.8 5.5 5.7 201.9 8.2 8.2 292.8 9.6 9.6 199.8
28415 2019-08-15 23:30:00 NaN 4.7 4.7 207.7 13.6 14.3 217.1 6.0 6.1 ... 202.5 6.5 6.7 187.7 7.5 7.4 290.6 10.9 10.9 200.1

28416 rows × 32 columns

Run TIM

In [7]:
backtest = api_client.prediction_build_model_predict(data, configuration_backtest)                 # running the RTInstantML forecasting using data and defined configuration
backtest.status                                                                                    # status of the job
Out[7]:
'Finished'

Visualize backtesting

In [8]:
fig = plt.subplots.make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.02)      # plot initialization

fig.add_trace(go.Scatter(x = data.loc[:, "Date"], y=data.loc[:, "Quantity"],
                         name = "target", line=dict(color='black')), row=1, col=1)              # plotting the target variable

fig.add_trace(go.Scatter(x = backtest.prediction.index,
                         y = backtest.prediction.loc[:, 'Prediction'],
                         name = "production forecast",
                         line = dict(color='purple')), row=1, col=1)                            # plotting production prediction

fig.add_trace(go.Scatter(x = backtest.prediction_intervals_upper_values.index,
                         y = backtest.prediction_intervals_upper_values.loc[:, 'UpperValues'],
                         marker = dict(color="#444"),
                         line = dict(width=0),
                         showlegend = False), row=1, col=1)
fig.add_trace(go.Scatter(x = backtest.prediction_intervals_lower_values.index,
                         y = backtest.prediction_intervals_lower_values.loc[:, 'LowerValues'],
                         fill = 'tonexty',
                         line = dict(width=0),
                         showlegend = False), row=1, col=1)                                     # plotting confidence intervals

fig.add_trace(go.Scatter(x = backtest.aggregated_predictions[0]['values'].index,
                         y = backtest.aggregated_predictions[0]['values'].loc[:, 'Prediction'],
                         name = "in-sample MAE: " + str(round(backtest.aggregated_predictions[0]['accuracyMetrics']['MAE'], 2)),
                         line=dict(color='goldenrod')), row=1, col=1)                           # plotting in-sample prediction

fig.add_trace(go.Scatter(x = backtest.aggregated_predictions[1]['values'].index,
                         y = backtest.aggregated_predictions[1]['values'].loc[:, 'Prediction'],
                         name = "out-of-sample MAE: " + str(round(backtest.aggregated_predictions[1]['accuracyMetrics']['MAE'], 2)),
                         line = dict(color='red')), row=1, col=1)                               # plotting out-of-sample-sample prediction

fig.add_trace(go.Scatter(x = data.loc[:, "Date"], y=data.loc[:, "GPS1_wind_speed_100m_ms"],
                         name = "GPS1_wind_speed_100m_ms", line=dict(color='forestgreen')), row=2, col=1)   # plotting the predictor GPS1_wind_speed_100m_ms

fig.update_layout(height=600, width=1000,
                  title_text="Backtesting, modelling difficulty: "
                  + str(round(backtest.data_difficulty, 2)) + "%" )                             # update size and title of the plot

fig.show()

Visualize predictor and feature importances

In [9]:
simple_importances = backtest.predictors_importances['simpleImportances']                                                                # get predictor importances
simple_importances = sorted(simple_importances, key = lambda i: i['importance'], reverse=True)                                           # sort by importance
extended_importances = backtest.predictors_importances['extendedImportances']                                                            # get feature importances
extended_importances = sorted(extended_importances, key = lambda i: i['importance'], reverse=True)                                       # sort by importance

si_df = pd.DataFrame(index=np.arange(len(simple_importances)), columns = ['predictor name', 'predictor importance (%)'])                 # initialize predictor importances dataframe
ei_df = pd.DataFrame(index=np.arange(len(extended_importances)), columns = ['feature name', 'feature importance (%)', 'time', 'type'])   # initialize feature importances dataframe
In [10]:
for (i, si) in enumerate(simple_importances):
    si_df.loc[i, 'predictor name'] = si['predictorName']                   # get predictor name
    si_df.loc[i, 'predictor importance (%)'] = si['importance']            # get importance of the predictor

for (i, ei) in enumerate(extended_importances):
    ei_df.loc[i, 'feature name'] = ei['termName']                          # get feature name
    ei_df.loc[i, 'feature importance (%)'] = ei['importance']              # get importance of the feature
    ei_df.loc[i, 'time'] = ei['time']                                      # get time of the day to which the feature corresponds
    ei_df.loc[i, 'type'] = ei['type']                                      # get type of the feature
In [11]:
si_df.head()                                                               # predictor importances data frame
Out[11]:
predictor name predictor importance (%)
0 GPS2_wind_speed_120m_ms 17.15
1 Quantity 15.35
2 GPS5_wind_speed_120m_ms 11.49
3 GPS6_wind_speed_120m_ms 9.81
4 GPS9_wind_speed_120m_ms 7.13
In [12]:
fig = go.Figure(go.Bar(x=si_df['predictor name'], y=si_df['predictor importance (%)']))      # plot the bar chart
fig.update_layout(height=400,                                                                # update size, title and axis titles of the chart
                  width=600,
                  title_text="Importances of predictors",
                  xaxis_title="Predictor name",
                  yaxis_title="Predictor importance (%)")
fig.show()
In [13]:
ei_df.head()                                                               # first few of the feature importances
Out[13]:
feature name feature importance (%) time type
0 Quantity(t-4) 38 [4] TargetAndTargetTransformation
1 Quantity(t-3) 34.33 [3] TargetAndTargetTransformation
2 Quantity(t-2) 33.7 [2] TargetAndTargetTransformation
3 Quantity(t-5) 33.7 [5] TargetAndTargetTransformation
4 Quantity(t-7) 32.47 [7] TargetAndTargetTransformation
In [14]:
time = '[1]'                                                                            # time for which the feature importances are visualized
fig = go.Figure(go.Bar(x=ei_df[ei_df['time'] == time]['feature name'],                       # plot the bar chart
                       y=ei_df[ei_df['time'] == time]['feature importance (%)']))
fig.update_layout(height=700,                                                                # update size, title and axis titles of the chart
                  width=1000,
                  title_text="Importances of features (for {}-sample ahead forecast)".format(time),
                  xaxis_title="Feature name",
                  yaxis_title="Feature importance (%)")
fig.show()