From Kaggle to Enterprise Machine Learning

“Kaggle interacts with business process”

On 11th July 2018 at Cerved, Data Science Milan has organized an event about Kaggle topic. This is a platform well known by data science community where you can find dataset, learn data science with exercises, compete with other data scientists and not only, if you win you can gain either money or a job!!!


“Kaggle – State of the art ML”, by Alberto Danese, Cerved

What is Kaggle?

It’s the biggest predictive modelling competition platform in the world born in 2010 and bought by Google in 2017.

In this platform enterprise come to look for a predictive solution for some of their problems through data scientists’ answers that compete from over the world proposing best performing algorithms.

This platform affords companies to recruit best scientists and researchers are available to trial new technologies: Keras and XGboost were tested in Kaggle before their success and the same is happening with LightGBM.

How does it work?

Companies make real datasets available on the platform with anonymous features and splitting it into train set and test set. The first one with outcome and the second one without it because it is used to evaluate predictive models for 20%-30% in the public leaderboard and the rest in the private leaderboard.

You can realize your predictive models with R, Python, Julia programming languages and submit the solution in a csv file; there is also the availability of kernels used to run your code and releasing it for everyone.

The job of a kaggler is more focused on machine learning activity meanwhile a data scientist works in a wide process that embeds machine learning but starts from the definition of the problem to the identification of data, algorithms, engineering of the solution with pipelines and deployment until storytelling.

Does it worth?

From Alberto Danese opinion Kaggle worths to try because is a very good data science platform where you can learn machine learning, try solutions and understand what works and what not, look at code available and so on, the only weakness is it requires time because you need to compete with kagglers from all the world.

Recently was published winning solutions of Kaggle competitions, a repository about past challenges.

Look at the video.




“Credit Scoring – ML in a regulated environment”, by Giovanni Tessiore, Cerved

The second talk has shown a business case on how machine learning is applied in the real world: credit scoring by Cerved rating agency

Credit scoring is a statistical model that combines several financial characteristics to evaluate a default risk of an enterprise by a single score to assess a customer creditworthiness.

It works in a regulated framework: Basel II/III that is an internationally agreed set of measures developed by the Basel Committee on Banking Supervision regarding the capital requirements of banks, according to which banks must set aside proportional shares of capital, based on the risk assumed and evaluated by a rating tool.

Basel II/III is structured in “three pillars”: minimum capital requirements, supervisory review and market discipline.

In the pillar I there are 3 approaches to evaluate credit risk: standard, foundation and advanced.

In the first approach banks don’t develop any internal model and for the minimal capital requirements banks use rating from external agencies, instead for the third approach banks develop an internal model to evaluate the expected loss (EL).


The Expected Loss is the amount expected to be lost on a credit risk exposure within a year timeframe.

PD: Probability of Default provides a likelihood assessment that a counterparty will be unable to pay back its debt obligations within a specified timeframe.

EAD: Exposure At Default is an outstanding expected amount following a default by a counterparty, taking account of: any credit risk mitigation, drawn balances, any undrawn amounts of commitments and contingent exposures.

LGD: Loss Given Default is the estimated loss on an exposure, following a default by the counterparty. It’s the share of an asset that is lost when a borrower default. The recovery rate is defined as (1-LGD), the share of an asset that is recovered when a borrower default.

In the advanced model is required the Unexpected Loss calculated through formulas provided by the Supervision Regulator.

The output of the model is a master scale of classes linked with a probability of default score.

As used in Kaggle competitions the goal is to use a machine learning model to calculate the probability of default using accuracy/AUC as a metric of evaluation, but while in Kaggle competition you need to optimize the accuracy, in the real world you need to respect some rules defined by the Regulator: a calibrated PD in appropriate range, robustness of the model, use the same parameters to evaluate counterparties, a transparent model, good quality of data used and understandability.

In a regulated sector unsupervised machine learning can be used to decide how many models you can build for each target of the market by cluster analysis, correlation analysis, component analysis.

Feature selection is used to evaluate variables to custom the model and supervised machine learning can be used as a benchmark to perform better the model.

Both traditional approaches and modern approaches can be used to define the PD master scale and the calibration using also econometrics approaches.

Look at the video.


Author: Claudio Giancaterino

Actuary & Data Science Enthusiast

Follow up