Given huge volumes of payment card transactions (either debit or credit card) processed every day around the globe, to recognize suspicious transaction by humans is mission impossible and getting assistance by ML/AI in such use case is inevitable.
From ML perspective such problem can be framed as classification problem.
Business objective: | Increase security of card transactions |
Value: | Decrease costs associated with fraud transactions |
KPI: | Lower cost in given timeframe |
import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json
import datetime
from sklearn.metrics import confusion_matrix, recall_score, precision_score
import scikitplot as skplt
import tim_client
Credentials and logging
(Do not forget to fill in your credentials in the credentials.json file)
with open('credentials.json') as f:
credentials_json = json.load(f) # loading the credentials from credentials.json
TIM_URL = 'https://timws.tangent.works/v4/api' # URL to which the requests are sent
SAVE_JSON = False # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/' # folder where the requests and responses are stored
LOGGING_LEVEL = 'INFO'
level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)
credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)
api_client.save_json = SAVE_JSON
api_client.json_saving_folder_path = JSON_SAVING_FOLDER
[INFO] 2021-02-16 09:00:27,996 - tim_client.api_client:save_json:66 - Saving JSONs functionality has been disabled [INFO] 2021-02-16 09:00:27,997 - tim_client.api_client:json_saving_folder_path:75 - JSON destination folder changed to logs
The datasets contains transactions made by credit cards in September 2013 by European card holders. This dataset is highly unbalanced.
"Class" is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Even though data contains timestamp sampled every second it does not represent real timestamp of transaction, in our case it serves as an "index" for each record so it can be ingested by TIM engine and properly evaluated during back-test.
Column name | Description | Type | Availability |
---|---|---|---|
Time | Index in form of a timestamp | Timestamp column | |
Class | Binary target (0 or 1) | Target | Sample+0 |
V1...V28 | Result of a PCA transformation, due to confidentiality, original features and background information about the data was not provided | Predictor | Sample+1 |
Amount | Transaction amount | Predictor | Sample+1 |
If we want TIM to do classification the very last record of target must be kept empty (NaN/None). TIM will use all available predictors to classify given record. Furthermore, this situation will be replicated to calculate results for all out-of-sample records during back-testing.
Dataset was split into several CSV files that can be downloaded here.
The original dataset has been collected by Machine Learning Group of ULB (Université Libre de Bruxelles) and was published at Kaggle.
datasets = ['data1.csv','data2.csv','data3.csv']
data = tim_client.load_dataset_from_csv_file( datasets[2], sep=',')
data.tail()
Time | Class | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | ... | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
94930 | 2012-01-19 23:08:00 | 0.0 | 2.039560 | -0.175233 | -1.196825 | 0.234580 | -0.008713 | -0.726571 | 0.017050 | -0.118228 | ... | -0.256922 | -0.268048 | -0.717211 | 0.297930 | -0.359769 | -0.315610 | 0.201114 | -0.080826 | -0.075071 | 2.68 |
94931 | 2012-01-19 23:09:00 | 0.0 | 0.120316 | 0.931005 | -0.546012 | -0.745097 | 1.130314 | -0.235973 | 0.812722 | 0.115093 | ... | 0.000676 | -0.314205 | -0.808520 | 0.050343 | 0.102800 | -0.435870 | 0.124079 | 0.217940 | 0.068803 | 2.69 |
94932 | 2012-01-19 23:10:00 | 0.0 | -11.881118 | 10.071785 | -9.834783 | -2.066656 | -5.364473 | -2.606837 | -4.918215 | 7.305334 | ... | 1.475829 | 0.213454 | 0.111864 | 1.014480 | -0.509348 | 1.436807 | 0.250034 | 0.943651 | 0.823731 | 0.77 |
94933 | 2012-01-19 23:11:00 | 0.0 | -0.732789 | -0.055080 | 2.035030 | -0.738589 | 0.868229 | 1.058415 | 0.024330 | 0.294869 | ... | 0.059616 | 0.214205 | 0.924384 | 0.012463 | -1.016226 | -0.606624 | -0.395255 | 0.068472 | -0.053527 | 24.79 |
94934 | 2012-01-19 23:12:00 | NaN | 1.919565 | -0.301254 | -3.249640 | -0.557828 | 2.630515 | 3.031260 | -0.296827 | 0.708417 | ... | 0.001396 | 0.232045 | 0.578229 | -0.037501 | 0.640134 | 0.265745 | -0.087371 | 0.004455 | -0.026561 | 67.88 |
5 rows × 31 columns
data.shape
(94935, 31)
target_column = 'Class'
timestamp_column = 'Time'
Parameters that need to be set are:
We also ask for additional data from engine to see details of sub-models so we define extendedOutputConfiguration parameter as well.
30% of data will be used for out-of-sample interval.
backtest_length = int( data.shape[0] * .3 )
backtest_length
28480
configuration_backtest = {
'usage': {
'predictionTo': {
'baseUnit': 'Sample',
'offset': 1
},
'backtestLength': backtest_length
},
'allowOffsets': False,
'extendedOutputConfiguration': {
'returnExtendedImportances': True
}
}
Proportion of classes for in-sample interval.
data.iloc[:-backtest_length][ target_column ].value_counts()
0.0 66355 1.0 100 Name: Class, dtype: int64
Proportion of classes for out-of-sample interval.
data.iloc[-backtest_length:][ target_column ].value_counts()
0.0 28457 1.0 22 Name: Class, dtype: int64
backtest = api_client.prediction_build_model_predict(data, configuration_backtest)
backtest.status
'Finished'
backtest.result_explanations
[]
out_of_sample_predictions = backtest.aggregated_predictions[1]['values'] # 1 points to ouf-of-sample interval
out_of_sample_predictions.rename( columns = {'Prediction': target_column+'_pred'}, inplace=True)
out_of_sample_timestamps = out_of_sample_predictions.index.tolist()
evaluation_data = data.copy()
evaluation_data[ timestamp_column ] = pd.to_datetime( data[ timestamp_column ] ).dt.tz_localize('UTC')
evaluation_data = evaluation_data[ evaluation_data[ timestamp_column ].isin( out_of_sample_timestamps ) ]
evaluation_data.set_index( timestamp_column ,inplace=True)
evaluation_data = evaluation_data[ [ target_column ] ]
def encode_class( x ):
if x < .5: return 0
return 1
evaluation_data = evaluation_data.join( out_of_sample_predictions )
evaluation_data[ target_column+'_pred_p' ] = evaluation_data[ target_column+'_pred' ]
evaluation_data[ target_column+'_pred' ] = evaluation_data[ target_column+'_pred' ].apply( encode_class )
evaluation_data
Class | Class_pred | Class_pred_p | |
---|---|---|---|
Time | |||
2011-12-31 04:32:00+00:00 | 0.0 | 0 | 0.000170 |
2011-12-31 04:33:00+00:00 | 0.0 | 0 | 0.002300 |
2011-12-31 04:34:00+00:00 | 0.0 | 0 | 0.003942 |
2011-12-31 04:35:00+00:00 | 0.0 | 0 | 0.002690 |
2011-12-31 04:36:00+00:00 | 0.0 | 0 | 0.000556 |
... | ... | ... | ... |
2012-01-19 23:07:00+00:00 | 0.0 | 0 | 0.002937 |
2012-01-19 23:08:00+00:00 | 0.0 | 0 | 0.000000 |
2012-01-19 23:09:00+00:00 | 0.0 | 0 | 0.000000 |
2012-01-19 23:10:00+00:00 | 0.0 | 0 | 0.000000 |
2012-01-19 23:11:00+00:00 | 0.0 | 0 | 0.000000 |
28480 rows × 3 columns
evaluation_data[target_column].value_counts()
0.0 28458 1.0 22 Name: Class, dtype: int64
Simple and extended importances are available for you to see to what extent each predictor contributes in explaining variance of target variable.
simple_importances = backtest.predictors_importances['simpleImportances']
simple_importances = sorted(simple_importances, key = lambda i: i['importance'], reverse=True)
simple_importances = pd.DataFrame.from_dict( simple_importances )
# simple_importances
fig = go.Figure()
fig.add_trace(go.Bar( x = simple_importances['predictorName'],
y = simple_importances['importance'] )
)
fig.update_layout(
title='Simple importances'
)
fig.show()