The post 3D Point Cloud Analysis using Deep Learning appeared first on Data Science Milan.

]]>

On 17th October 2018 at Buildo, Data Science Milan has organized an event about 3D image processing. Deep learning on 2D images has been achieved good results on classification tasks thanks to the use of Convolutional Neural Networks and the availability of data. Now, 3D data are growing in fast way.

__“3D Point Cloud Analysis using Deep Learning”, by SK Reddy, Chief Product Officer AI in Hexagon__

In this talk were showed several technologies used to manage 3D point clouds, so what is the mean of point cloud?

Point cloud is a database containing points in the three-dimensional coordinate system. It is a very accurate digital record of an object or space and it is saved in form of a large amount of points that cover surfaces of an identified object.

Tasks with point cloud can be shared into neural network challenges (unstructured grid data for CNN filters, invariance to permutations of point clouds, the number of points changes depending from the sensor used) and data challenges (the use of scanned models bring missing data, noise from sensors used and rotation implies different point clouds).

Octree-based Convolutional Neural Network (CNN) for 3D shape analysis (O-CNN) is built upon the octree representation of 3D shapes. It takes for input the average of vectors from 3D model sampled and performs 3D CNN operations on the octants occupied by the 3D profile surface. O-CNN supports numerous CNN architectures and works for 3D images in different representations. Look out the github repository.

The architecture approach of PointNet is the use of a single symmetric function: max pooling. The network learns a set of optimization functions/criteria that select informative points of the point cloud and encode the purpose for their selection. The last fully connected layers of the network aggregate these learnt optimal values into the global descriptor for the entire shape (shape classification) or they are used to predict point labels (shape segmentation). Check out the code

SPLATNet is based on architecture that process point clouds without any pre-processing, it takes point clouds as input and computes hierarchical and spatially features with lattice filters; it allows easy mapping of 2D information into 3D images and vice-versa. Apply the code

Check out the video of the event.

The post 3D Point Cloud Analysis using Deep Learning appeared first on Data Science Milan.

]]>The post Deep Time-to-Failure: predicting failures, churns and customer lifetime with RNN appeared first on Data Science Milan.

]]>On 20th September 2018 at Spirit De Milan, Data Science Milan has organized an event as part of IBM #Party Cloud: Deep Time to Failure.

Machineries and customers are an asset for companies as well are subjected to failure: break down for machineries and churn for customers.

__“Traditional Survival Analysis”, by Gianmario Spacagna, Chief Scientist at Cubeyou__

Predict failure requires survival study and Gianmario in the first part of his talk has explained traditional method for survival analysis.

Survival analysis is used to analyse data in which the time until the event is of interest. The response is often referred to as a failure, survival time or event time.

The survival function S(t) gives the probability that a subject will survive past time t and has the following properties:

-Monotonically decreasing;

-Right-continuous;

-The probability of surviving past time 0 is 1; as time goes to infinity, the survival curve goes to 0.

In theory, the survival function is smooth. In practice, we observe events on a discrete time scale (days, weeks, etc.).

The survival model can be described by the hazard function, h(t), that is the instantaneous rate at which events occur, given no previous events, or by the cumulative hazard function H(t) that describes the accumulated risk up to time t.

Given one of these previous functions S(t), H(t), h(t) is possible to derive the other two ones and to derive the time-to-failure, namely the remaining time work for a device or other product.

With incomplete raw data (truncated or censored), raw empirical estimators will not produce good results and in this scenario two techniques are available: the Kaplan-Meier product limit estimator that can be used to generate a survival distribution function or the Nelson-Aalen estimator that can be used to generate a cumulative hazard rate function.

The survival distribution can be estimated by making parametric assumptions: for this task has been used Weibull distribution that is applied in many real-world use cases.

They are examples of univariate analysis and useful when the predictor variable is categorical.

An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model can assess simultaneously the effect of several risk factors on survival time. The idea behind the Cox model is to separate the estimation of the heterogeneity parameter on one hand and the baseline hazard function on the other one. When the proportional hazard hypothesis are not satisfied, is possible to turn into Aalen’s additive model where coefficients can be parametric, semiparametric or nonparametric.

__“Time-to-failure using Weibull and Recurrent Neural Network (RNN)”, by Gianmario Spacagna, Chief Scientist at Cubeyou__

In the second part of the talk Gianmario go deeply into the wtte-rnn application (Weibull time-to-event RNN).

In time-to-failure Weibull distribution gives a distribution for which the failure rate is proportional to a power of time. It’s flexible and explained by two parameters: α and β. The first one is the scale parameter of the distribution and the second one is the shape parameter.

-β<1 indicates that the failure rate decreases over time;

-β=1 indicates that the failure rate is constant over time and the shape is an exponential distribution;

-β>1 indicates that the failure rate increases with time;

-β=2 the shape is a log-normal distribution;

-3,5<β<4 the shape is a gaussian distribution.

The task is to estimate α and β by Recurrent Neural Networks.

Recurrent neural networks are a kind of neural network where outputs from previous time steps are taken as inputs for the current time step, with one time of step there is a generation of a cycle.

RNNs are fit and make predictions over many time steps.

Considering multiple time steps of input (X(t), X(t+1), …), multiple time steps of internal state (u(t), u(t+1), …), and multiple time steps of output (y(t), y(t+1), …) the previous cycle is removed and outputs (y(t) and u(t)) from previous time step are passed into the network as inputs for processing the next time step, so the network doesn’t change between the unfolded time steps. Same weights are used for each time step, only the outputs and the internal states differs.

Gianmario has showed how wtte-rnn works and has explained a practical application: a dataset of jet of engines from NASA.

Read and apply the code from the tutorial

The post Deep Time-to-Failure: predicting failures, churns and customer lifetime with RNN appeared first on Data Science Milan.

]]>The post From Kaggle to Enterprise Machine Learning appeared first on Data Science Milan.

]]>On 11th July 2018 at Cerved, Data Science Milan has organized an event about Kaggle topic. This is a platform well known by data science community where you can find dataset, learn data science with exercises, compete with other data scientists and not only, if you win you can gain either money or a job!!!

__“Kaggle – State of the art ML”, by Alberto Danese, Cerved__

What is Kaggle?

It’s the biggest predictive modelling competition platform in the world born in 2010 and bought by Google in 2017.

In this platform enterprise come to look for a predictive solution for some of their problems through data scientists’ answers that compete from over the world proposing best performing algorithms.

This platform affords companies to recruit best scientists and researchers are available to trial new technologies: Keras and XGboost were tested in Kaggle before their success and the same is happening with LightGBM.

How does it work?

Companies make real datasets available on the platform with anonymous features and splitting it into train set and test set. The first one with outcome and the second one without it because it is used to evaluate predictive models for 20%-30% in the public leaderboard and the rest in the private leaderboard.

You can realize your predictive models with R, Python, Julia programming languages and submit the solution in a csv file; there is also the availability of kernels used to run your code and releasing it for everyone.

The job of a kaggler is more focused on machine learning activity meanwhile a data scientist works in a wide process that embeds machine learning but starts from the definition of the problem to the identification of data, algorithms, engineering of the solution with pipelines and deployment until storytelling.

Does it worth?

From Alberto Danese opinion Kaggle worths to try because is a very good data science platform where you can learn machine learning, try solutions and understand what works and what not, look at code available and so on, the only weakness is it requires time because you need to compete with kagglers from all the world.

Recently was published winning solutions of Kaggle competitions, a repository about past challenges.

Look at the video.

__“Credit Scoring – ML in a regulated environment”, by Giovanni Tessiore, Cerved__

The second talk has shown a business case on how machine learning is applied in the real world: credit scoring by Cerved rating agency

Credit scoring is a statistical model that combines several financial characteristics to evaluate a default risk of an enterprise by a single score to assess a customer creditworthiness.

It works in a regulated framework: Basel II/III that is an internationally agreed set of measures developed by the Basel Committee on Banking Supervision regarding the capital requirements of banks, according to which banks must set aside proportional shares of capital, based on the risk assumed and evaluated by a rating tool.

Basel II/III is structured in “three pillars”: minimum capital requirements, supervisory review and market discipline.

In the pillar I there are 3 approaches to evaluate credit risk: standard, foundation and advanced.

In the first approach banks don’t develop any internal model and for the minimal capital requirements banks use rating from external agencies, instead for the third approach banks develop an internal model to evaluate the expected loss (EL).

EL = PD x EAD x LGD

The Expected Loss is the amount expected to be lost on a credit risk exposure within a year timeframe.

PD: Probability of Default provides a likelihood assessment that a counterparty will be unable to pay back its debt obligations within a specified timeframe.

EAD: Exposure At Default is an outstanding expected amount following a default by a counterparty, taking account of: any credit risk mitigation, drawn balances, any undrawn amounts of commitments and contingent exposures.

LGD: Loss Given Default is the estimated loss on an exposure, following a default by the counterparty. It’s the share of an asset that is lost when a borrower default. The recovery rate is defined as (1-LGD), the share of an asset that is recovered when a borrower default.

In the advanced model is required the Unexpected Loss calculated through formulas provided by the Supervision Regulator.

The output of the model is a master scale of classes linked with a probability of default score.

As used in Kaggle competitions the goal is to use a machine learning model to calculate the probability of default using accuracy/AUC as a metric of evaluation, but while in Kaggle competition you need to optimize the accuracy, in the real world you need to respect some rules defined by the Regulator: a calibrated PD in appropriate range, robustness of the model, use the same parameters to evaluate counterparties, a transparent model, good quality of data used and understandability.

In a regulated sector unsupervised machine learning can be used to decide how many models you can build for each target of the market by cluster analysis, correlation analysis, component analysis.

Feature selection is used to evaluate variables to custom the model and supervised machine learning can be used as a benchmark to perform better the model.

Both traditional approaches and modern approaches can be used to define the PD master scale and the calibration using also econometrics approaches.

Look at the video.

The post From Kaggle to Enterprise Machine Learning appeared first on Data Science Milan.

]]>The post 50 Shades of Text – Leveraging Natural Language Processing (NLP) appeared first on Data Science Milan.

]]>On 21th June 2018 at Buildo, Data Science Milan has organized an event on a fashion topic: Natural Language Processing (NLP). Nowadays we found many applications of NLP, such as machine translation (Google translator), question answering (chatbot), web and application search (Amazon), lexical semantics (Thesaurus), sentiment analysis (Cambridge Analytica), natural language generator (Reddit bot).

__“50 Shades of Text – Leveraging Natural Language Processing (NLP) to validate, improve, and expand the functionalities of a product”, by Alessandro Panebianco, Grainger__

What is the mean of natural language processing?

Natural language processing is a branch of artificial intelligence representing a bridge between humans and computers; it can be broadly defined as the automatic manipulation of natural language, like speech and text, by software. There are many ways to represents words in NLP and you cannot use text data directly on machine learning algorithms.

The first step is to transform raw text into numerical features by vectorization of words and there are several techniques:

-Bag of words: it is a way of extracting features from text as input for machine learning, it is a representation of text that describes the occurrence of words taken from a vocabulary obtained within a corpus by labelling it in a binary vector. It is called “bag of words” because it doesn’t care about the order of structure of words in the corpus.

-Hashing trick: hash function can be used to map data of arbitrary size to data of a fixed vector size set of numbers. Hashing trick or feature hashing consist to apply hashing function to the features and using their hashing values directly: with same input we have same output. A binary score or count can be used to score the word. Hash function is a one way process and sometimes it can be a problem because you can’t back from the output to the input space and collisions between the two mapped spaces can happen.

-TF-IDF: another approach is to rescale the frequency of words by how often they appear in all documents; this approach is called Term Frequency – Inverse Document Frequency. Term Frequency apply a score of the frequency of the word in the current corpus; Inverse Document Frequency apply a score of how rare the word is across corpus. With tf-idf technique terms are weighted and so on with scores are highlighted words with useful information.

The second level is word embedding with the goal to generate vectors encoding semantics: individual words are represented by vectors in a predefined vector space. Also for word embedding there are several techniques:

-Word2vec: it is a neural network that try to maximize the probability to see a word in a context window, the cosine similarity between two vectors. This task is achieved by two learning models: continuous bag-of-words or CBOW model that try to predict a word from its context; continuous skip-gram model that try to predict the context from a word.

-GloVe: it is an extension of word2vec, it constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus instead to predict words with boost results in terms of computation and it uses cosine similarity.

-FastText: it is a library for learning of word embeddings and text classification created by Facebook’s AI Research (FAIR). It gains the same accuracy of the previous models but with better performance, it can be explained by this relationship:

FastText : Word Embeddings = XGBoost : Random Forest

The last level is sentence embeddings, with this approach the goal is to represent more than single words by vectors and also in this case are available several models:

-Doc2vec: it works in the same way of word2vec but the goal is to use a network of paragraphs and words, so a sentence can be thought as another word, which is document-unique. There is Distributed Memory (DM) similar to CBOW model and Distributed Bag of Words (DBOW) similar to skip-gram as in word2vec.

-CNNs: Convolutional Neural Network were born for computer vision and more recently they are also applied to problems in Natural Language Processing. They are basically composed by several layers of convolutions with nonlinear activation functions applied to the results. Convolutions are used over the input layer to compute the output, this results in local connections, where each region of the input is connected to a neuron of the output. Each layer applies different filters and combines their results. The process start stacking words together creating a matrix, filters scan words, max pooling highlight the most important words and LSTM layer keeps the words order.

-LSTM: Long Short Term Memory are fancy Recurrent Neural Networks with some additional features among which memory cell for every time step. An application of RNN is Google searches, it links a search with an item. LSTM layer creates from input words a new output giving relevance to the words order and the next filter layers give relevance to the most important local features.

After this presentation has been showed a demo using a dataset from Kaggle and GloVe vectors with repository code available on Github.

The post 50 Shades of Text – Leveraging Natural Language Processing (NLP) appeared first on Data Science Milan.

]]>The post Operations Research and Optimization: Improving Decisions from Data appeared first on Data Science Milan.

]]>On 24th May 2018 at Mikamai, Data Science Milan has organized another interesting meetup about operations research topic. It is the application of scientific methods, techniques and tools used looking for the optimum solutions for the problems.

__ “Operations Research & Optimization: A New Dimension to Data Science”, by Andrea Taverna, Università degli Studi di Milano__

Data science (DS) and operation research (OR) can be seen as complementary to each other, where the first one is more focused on data, how to extract information and knowledge from data to take decisions; the second one tries to evaluate decisions and modelling them in the process with the goal to find optimal solutions. In this way, operations research can be considered as a new dimension of data science.

If you look at the growth of operational research it has a flat trend, meanwhile data science & machine learning have been growing in recent years, but looking at the analytics maturity model by studies from PWC, Gartner and SAS we’ll move towards prescriptive analytics and operation research is positioned in this fourth dimension.

__“Optimized Assignment Patterns in Mobile Edge Cloud Networks”, by Alberto Ceselli, Università degli Studi di Milano __

Machine learning and operation research can interact in three ways:

-in machine learning there are sub-problems which are optimization problems;

-replacing some heuristic methods with some exacts methods;

-solving prescriptive analytical questions.

An application of prescriptive analytics is developed by Mobile Edge Computing network (MEC): given an existing MEC with virtualization facilities of limited capacity and a set of mobile Access Points (AP) whose data traffic demand changes over time, the aim is to find plans for assigning APs traffic to MEC facilities satisfying each AP demand without exceed MEC facility capacity.

In the data-driven architecture there are two fundamental components: pre-processing and optimization. The first one is used to map the problem and optimization component is used to solve the problem by mathematical programming.

__”Optimization modeling in Python”, by Marco Casazza, Università degli Studi di Milano__

In the last section has been showed “Pyomo”, a Python module that allows users to formulate optimization problems using Python language.

The first application has regarded knapsack problem: given a set of items, each with a weight and a value, the goal is to determine the number of each item to include in a collection so that the total weight must be less than or equal to a given limit and the total value as large as possible, it can be considered as a maximization profit problem.

The second application has showed flight assignment problem: given a set of flights and a crew of airline company, the goal is to create a weekly plan minimizing the overall cost.

The last example has explained drones surveillance problem: given a number of drones equipped with camera and an area to be controlled, the goal is to optimize the number of drones to cover the whole area.

The post Operations Research and Optimization: Improving Decisions from Data appeared first on Data Science Milan.

]]>The post TensorFlow Dev Summit 2018 viewing party appeared first on Data Science Milan.

]]>On 22th May 2018 at Fintech District, Data Science Milan in collaboration with Google and BCG Italy have gathered its community to view together some of the main talks of TensorFlow Dev Summit 2018 from last March in Mountain View (CA). TensorFlow is an open source library for machine learning started as a library for deep learning and neural networks. Now is a machine learning platform collecting many algorithms with the goal to make easier to use them.

__ “Keynote”__

TensorFlow represents a revolution in the field of machine learning and helps to build artificial intelligence applications; problems that were impossible to solve before are now solved using this technology.

TensorFlow has added value to many different areas such as astronomy (discovering a new planet), healthcare (help to asses a person’s risk for cardiovascular diseases looking at scans of the human eye), aviation (predict the trajectory of a flight) and many others applications.

TensorFlow is at the forefront of machine learning, making it all possible; it is a platform that can solve challenging problems for all of us, it is powerful, scalable and its popularity has grown in the last two years with several innovations, such as TensorFlow Hub, a library to help developers share and reuse models.

Look at the video

__“Machine Learning in JavaScript” __

TensorFlow.js is an open-source library you can use to define, train, and run machine learning models entirely in the browser, using JavaScript and a high-level layers API.

It uses TensorFlow Playground, an in-browser visualization of a small neural network and it shows in real time all the internals of the network that it’s training.

The browser has become a development environment where you can share the things you build, with anyone with just a link; people that open your app don’t have to install any drivers and can give access to the sensors like microphone, camera and accelerometer, making the app highly interactive.

In the livestream Nikhil Thorat and Daniel Smilkov trained a model to control a pac-man game using computer vision and a webcam, entirely in the browser.

Look at the video

__“TensorFlow Lite”__

TensorFlow Lite is a lightweight library and tools for doing machine learning on embedded and small platforms, with a different architecture from the one TensorFlow uses: there is an interpreter which runs on-device, there are a set of optimized kernels and then there are interfaces you can use to take advantage of hardware acceleration when it is available.

It’s cross-platform, it supports Android and iOS and also have support for Raspberry Pi and most of other devices which are running Linux.

In the workflow you take a trained TensorFlow model and then you convert it to the TensorFlow Lite format using a converter then you update your apps to invoke the interpreter using the Java or C++ APIs.

Look at the video

__“Applied AI at The Coca-Cola Company”__

TensorFlow has granted to update Coca-Cola North America loyalty marketing programs into a mobile-web platform. The pipeline starts from a pin-code recognition from a cap by OCR (Optical Character Recognition) system to apply a CNNs (Convolutional Neural Networks) combined with TensorFlow to train and predict strings from images that contain small character sets with lots of variance. An active learning system with user interface by a feedback loop allows the model to gradually improve by returning correct predictions to the training pipeline.

Look at the video

Reviewed by Fabio Concina

The post TensorFlow Dev Summit 2018 viewing party appeared first on Data Science Milan.

]]>The post Reinforcement Learning Workshop appeared first on Data Science Milan.

]]>The event was presented by **Orobix** (an Italian engineering company focused on building artificial intelligence-powered systems) and hosted by **Buildo**.

Luca Antiga, CEO at Orobix, introduced the basics of Reinforcement Learning.

RL bumped into popularity when Deepmind, which wasn’t owned by Google yet, published a paper on Nature (https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf).

In the paper Deepmind coupled reinforcement learning with DL. The algorithm could have equal or exceed human performance on some Atari games. It used only raw pixels as inputs and from that could devise a strategy. The agent had no previous knowledge about game rules.

Luca introduced the concepts of *Agent*, *State*, *Action*, *Environment* and *Reward*, which are all foundational to the theory of RL.

He then explained the concept of Markov Decision Process, policy, value function and q-value function, and how quickly becomes unfeasible to compute optimal policies and hence the need for function approximations.

For a detailed introduction on the topic one can look at the following references:

UCL course by David Silver (Google Deepmind):

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

Richard S. Sutton and Andrew G. Barto textbook:

http://incompleteideas.net/book/bookdraft2017nov5.pdf

Daniele Cortinovis, physicist by training and Data Scientist at Orobix, gave then a great overview on the process of training the agent on some classic examples like the Cart-pole problem, Atari Breakout and Atari Pong, using PyTorch and OpenAI Gym.

The post Reinforcement Learning Workshop appeared first on Data Science Milan.

]]>The post General Data Protection Regulation (GDPR): a data science perspective appeared first on Data Science Milan.

]]>Reviewed by Fabio Concina

On 25th May, 2018, the General Data Protection Regulation (GDPR) will be fully enforceable in the European Union. This new regulation succeeds the Data Protection Directive, a two-decade old directive that’s grown slowly in recent years due to the growth of available online information. The technology landscape was very different 20 years ago, when the Directive was first adopted. Today, with the widespread usage of social media, apps, and internet generally, personal data is being shared and transferred across borders more than ever before, and many felt that the Directive was due for a review after these changes.

On 15^{th} March, 2018 at Buildo there was an event organized by Data Science Milan about GDPR topic.

__GDPR: keep calm and be compliant by Anna Capoluongo, Studio Legale Capoluongo Law Firm__

GDPR is applied to all process where are involved personal data, by individuals (data controllers or data processors) who carry out their activities in the European Union. A big news regards accountability principle about personal data protection: information needs to be processed lawfully, fairly and in transparent manner. Some relevant topics about the application of accountability principle are data protection by Design and by Default, clear roles and responsibilities, assessment of the risks and the adoption of measures suitable to mitigate these risks. There is a new function: the Data Protection Officer (DPO), which is at the heart of the process of implementing the principle of “accountability” and is responsible for the data protection. This role is not mandatory, only for public administration, for large-scale monitoring activities and for sensitive personal data treatment. DPO is a point of contact between Company and the Supervisory Authority.

The violation of personal data (so-called “Data breach”) consists of any event that puts at risk the personal data held by the data controller. When there is a violation of personal data the data controller proceeds to the notification of the violation to the Supervisory Authority. It means a breach of security leading to the accidental or unlawful destruction, loss, alteration, unauthorised disclosure of, or access to, personal data transmitted, stored or otherwise processed. All data breaches have to be reported within 72 hours, notification without unjustified delay must be explained. If the risk is really high, notification must reach individuals and also is mandatory a book of violations. There are administrative fines divided into two brackets: from 10 million euro or up to 2% of the total worldwide annual turnover to 20 million or up 4% of the total worldwide annual turnover.

__Come adeguarsi al GDPR in pratica, con e senza iubenda by Andrea Giannangelo, founder of iubenda__

Iubenda is a service that allows you to create privacy and policy conditions quickly. The service covers not only websites but also a mobile app and a Facebook app.

Andrea carry on GDPR topic stating that Companies must prepare internal documentation observing certain guidelines in which are indicated security measures, how and which data to process, the instruction of responsibilities and response times in case of request by customers. For example, Companies must have an internal document about instructions in case of data breach, a document with instructions when customers want to exercise their rights. Inside of all these documents there is a corporate risk register: a sort of corporate privacy statement. It’s a written document, also in electronic format, which contains a series of information concerning the processing activities performed by the data controller. The register is mandatory for Companies or organizations that have more than 250 employees. It contains information as conditions about data processed and interested parties involved, purposes, security measures, where they are stored and how long and so on. Data protection assessment is an assessment document to draw up when Companies are starting a new project embedding which data are involved and why, with an assessment on risks and all measures to mitigate it. About web and cookies topics, privacy statement requires owner’s data, purposes, third parties involved, legal basis…is important to store offline forms and documents signed by customers because could raise a dispute about the request to receive promotions.

__Turn GDPR into an added-value for your business by Andy Petrella, Kensu__

Andy introduces perspective between data science catalog and data science governance tools and so how GDPR can add value to enterprise.

Data science is an umbrella on top of all activities on data and data pipeline connect activities on data from input to output transforming data, involving several assumptions and technologies: an end-to-end processing line to solve one problem, to take a decision. Data science governance controls that data activity meets precise standards and involves monitoring against production data activity: how accurate is the model, what are the patterns. In this process are involved technologies, users (who is responsible), sources, data and processing. Many tools are using data and the number of processing activities are growing so all this information is connected, “data flow”, in this way is possible to create a map by graphs about tools and process, know what data regards transactions, what data regards customers and so on. With this map is possible to assess governance activities such as impact analysis, dependency analysis, pipeline optimization, data/model recommendation. Accountability principle of GDPR requires to implement adequate technical solutions and internal audits of processing activities, with data science governance you can monitor activities (ex. machine learning performance) realizing a process registry with all data involved and tasks pursued. In this way are realized transparent reports of activities across the whole chain of processing.

The post General Data Protection Regulation (GDPR): a data science perspective appeared first on Data Science Milan.

]]>The post Banksealer: a decision support system for online banking fraud analysis appeared first on Data Science Milan.

]]>

On 12^{th} February 2018 at Buildo, Data Science Milan and Buildo have opened 2018 data science events, with talks about online banking fraud detection.

Financial fraud is a broad term with several potential meanings, it can be defined as the intentional use of illegal methods with the purpose to obtain financial gain. There are many different types of financial fraud, from credit card fraud to automobile insurance fraud and advancements in modern technologies such as internet and mobile have led to an increase of financial fraud.

__ “What is Banksealer” by Daniele Gallingani, Buildo__

Daniele presented Banksealer, an online banking fraud and anomaly detection framework used by analysts as a decision support system. It started in 2016 from research by Politecnico di Milano sponsored by Secure Network and by Buildo.

It can be defined as a decision support system of the It Security teams that, by aggregating the historical transaction data, summarizes the interaction of each customer with the e-banking system and, using advanced statistical and machine learning techniques, notes if, and how, a transaction is atypical.

Some usual frauds in a bank scenario comes from phishing to credentials database compromise until most advanced techniques.

In this tool there is a real time ranking, with high level scoring to block the transaction instead with low level giving the opportunity to proceed, so a device integrated with bank infrastructure.

Banksealer can be defined an explaining machine learning with graphs, by a dashboard that visualize many useful information for the analyst and with other window, a top list of anomalous transactions.

__“Banksealer Algorithms and Architecture” by Claudio Caletti, Buildo__

In the second speech Claudio Caletti talked about software architecture and algorithms implemented in Banksealer.

Bansealer is a system that moves transactions from different states, the main entities in this tool are exactly transactions: bank transfers, payments, prepaid cards transactions, phone recharge transactions and so on.

The inputs comes from bank with transactions and are trained by machine learning algorithms, then labelled by a scored transaction step forwarding to the output by external systems.

The process can be shared in three blocks: a block exposed to external systems made by import transactions from banks in a raw format and the score transactions; a block of data made by relational database and elastic search; last block is the Banksealer core made by machine learning models and user interface (Front-end, Back-end).

All services in Banksealer are written in Scala because is a type-safety language, while components are made by generic components and specific components, the first one are the core of the system equal for all banks and the second one is a bank driver made up ad-hoc.

Banksealer approach is based on three main algorithm, the first one, local profile, is the most important.

Local profile works on single customers, it defines each user’s individual spending pattern to evaluate the anomaly of each new transaction. During training process transactions are aggregated by customer and each feature distribution is approximated by an histogram.

The anomaly score of each new transaction is calculated using the HBOS (Histogram Based Outlier Score) method. It computes the log-likelihood of a transaction according to the marginal distribution learned. HBOS score is a weighted sum of the normalization applied at the frequency histogram of each feature, where weighting coefficients are tuned by analyst and in the upgrade version they are calculated by a genetic algorithm.

HBOS assumes independence of features making it really faster than multivariate approaches but with a less precision, in fact it performs poor on local outlier problems.

The second algorithm is the global profile, it’s good for new users, it defines “classes” of spending patterns and mitigate the undertraining problem. Each user is represented by six components: total number of transactions, average transaction amount, total amount, average time span between subsequent transactions, number of transactions executed from overseas countries, number of transactions to overseas recipients.

To cluster customer’s profiles is used a DBSCAN (Density-based spatial clustering of applications with noise) using the Mahalanobis distance. For each global profile is calculated the CBLOF (Cluster Based Local Outlier Factor) anomaly score, which tells the analyst how uncommon is the spending pattern respect other closest customers. It detects how much the user profile deviates from the density cluster of “normal” users, small clusters are considered outliers respect large clusters.

The third algorithm is the temporal profile looks on with frauds that take advantage of many transactions made in a time window, by comparing the current spending profile with their history. During training, are calculated mean and standard deviation of these aggregated features for each customer: total amount, total and maximum daily number of transactions. At runtime, is calculated the cumulative value for each features belonging each user and compared it against the previously computed metrics.

All these algorithms are merged to an output ranking score.

Banksealer can be defined a white-box despite other similar tools (black-box) because analyst understand what’s going on, it’s not completely automated and it should be easy to deploy besides a good false positive ratio.

This tool is mainly focused on HBOS algorithm based on histograms, easy to understand by analyst, also there is a ranking score that help him to manage the number of reported transactions with the presence of false positive.

References:

https://hal.archives-ouvertes.fr/hal-01370386v1

https://www.sciencedirect.com/science/article/pii/S0167404815000437

https://link.springer.com/chapter/10.1007/978-3-319-60080-2_17

The post Banksealer: a decision support system for online banking fraud analysis appeared first on Data Science Milan.

]]>The post Pricing Optimization: Close-out, Online and Renewal strategies by DataReply appeared first on Data Science Milan.

]]>On 11^{th} December 2017 at YoRoom, Data Science Milan and Data Reply have closed this plenty year of data science events, with talks on price optimization.

Pricing is one of the most important strategic leverages used by companies to identify its competitive position because is able to increase revenues, has positive impacts on renewal rate, gives more visibility to new products and get better customer satisfaction.

The price is the economic value of a good or service expressed in current currency in a given time and place, which varies according to changes in supply and demand and is expressed by the ratio between the total revenue desired by the company from that commodity and the quantity produced.

Pricing optimization is a technique that utilizes analysis of data to predict the behaviour of potential buyers to different prices for company products and services through different channels.

__“Optimal discount strategy for products in close-out phase” by Ilaria Gianoli, Data Reply__

Ilaria shared her experience in optimizing the close-out strategy for a multinational retail leader, which consist in identify the optimal discount strategy for products in their close-out phase, as a trade-off between margin loss and inventory cost. The solution adopted is switched in three steps: the first one based on collecting all sales information to realize a time series able to create a forecast model; then to develop an elasticity model both at product level and hierarchical level; at the end the choice for the optimal discount as max function of difference between margin and fee.

Are used several algorithms for each step: linear regression to develop an elasticity model with DBSCAN (Density-based spatial clustering of applications with noise) to clusterize products. It is a density-based clustering algorithm: given a set of points in some space, it groups points that are closely packed together (with many nearby neighbours), marking as outliers points that lie alone in low-density regions. About time series forecasting are used either ARIMA (autoregressive integrated moving average) or FFNN (feed-forward neural network) models depending at least two seasonality. The whole process is managed using R and Cloudera tools.

Algorithm was tested in two kind of family goods: products sold frequently (high-rotating) and products sold rarely (low-rotating). Fine results for both in terms of KPIs compared with previous situation where algorithm wasn’t set: reduction of coverage days in the stores and reduction of inventory costs matched with improvement of revenues due to increased sales and reduction of fees with investment of saved resources in other projects. Not only, qualitative results as like more visibility of new products introduced to substitute the older, better space allocation of products in the stores and homogeneity among stores because the solution offered replies to specific needs, but it’s a general solution.

__“Online pricing: from theory to application” by Giovanni Corradini, Data Reply__

Giovanni showed Multi-Armed Bandit algorithm used in e-commerce by ticket selling company which consist in choosing the best price to maximize the revenue. The price optimization comes from trade-off between exploration and exploitation: the first one means to find the best price given several prices; exploitation means to propose current best price to make revenue.

Multi-Armed Bandit algorithm (MAB) is a fundamental dynamic optimization problem in reinforcement learning, the decision maker is faced with a set of possible decisions (arms). Typically, each arm has stable reward distribution unknown to the decision maker, so he has to select arms with the goal of optimizing cumulative reward. Suppose to have n rounds, then play a game where the decision maker receives a request and offers a price (arm) with this resulting payoff by the customers: the price proposed if the customer bought the ticket or anything otherwise.

One of the simplest possible algorithms for trading off exploration and exploitation is called the epsilon-Greedy algorithm. A greedy algorithm is an algorithm that always takes whatever action seems best at the present moment, even when that decision might lead to bad long term consequences. The term epsilon in the algorithm’s name refers to the odds that the algorithm explores instead of exploiting. It’s one of the easiest bandit algorithm because it tries to be fair to the two opposite goals of exploration and exploitation by using a mechanism like flips a coin.

This algorithm has one systematic weakness: it doesn’t keep track of how much it knows about any of the arms available. It can be done better by using an algorithm that pays attention to not only what it knows, but also how much it knows: the Upper Conﬁdence Bound (UCB) algorithm and works as follows. In each time period, the algorithm assigns each arm a so-called UCB value, the sum of expected reward and potential value from experimentation, then the algorithm plays the arm with the highest value. The decision maker observes a noisy reward, and updates these values for each arm.

In the e-commerce context part of provider’s revenue comes from metasearch engines and it also hasn’t access to the user directly. Solution provided has used simulated requests from users with either C++ or Python tools. Results comes from two different non stationary environments switching between them with 15.000 type of interactions and comparing ORAT algorithm (Online Risk Averse Tree) over UCB1 algorithm with an increase of gain by the first one on the second one at the beginning then to become stationary. There are business benefits measured by the increase of revenues and customer satisfaction, linked with decreasing cost of maintenance and analytic know – how of the process.

__“Renewal Price Optimization for Subscription products” by Riccardo Lorenzon, Data Reply__

Riccardo presented an application of subscription renewal pricing optimization models for a company belonging to the publishing industry which consist in to decide the optimal prices for renewal subscription products, given some boundaries and objectives input by the customer.

Solution is developed in three steps: the first step collecting all customers features to develop a database; then to develop an elasticity model and at the end choosing the optimal price for each contract given input KPIs. The elasticity curves is realized by a logistic regression with elastic net regression as in a churn model, but instead to use a binary output (0-1) is used a probability distribution as a metric to predict renewals. Whole process is managed using known data science tools: R, Python and Cloudera for data preparation.

Optimization problem works in this way: given an elasticity curve for each customer let the marketing to put some targets on the global set of customers, maybe a customer pay more and other one pay less, surely there is a retention from one of them and Company want to gain a retention from the other one.

Were tested two scenarios with marketing targets as an input and KPIS as an output with delta compared either to the previous month or previous year depending on the time frame and compared with previous situation where algorithm wasn’t set. In the first scenario the target was to maximize the margin, in the second one to maximize the renewal rate; in the first situation profit margin and revenue increased with steady renewal rate; in the second situation profit margin and renewal rate increased with decreasing revenue. This solution gives lot of quantitative benefits increasing economic KPIS as revenues, margins, sales volume, renewal rate, discount rates, process automation and in terms of qualitative benefits it helps the marketing to focus on customer needs.

The post Pricing Optimization: Close-out, Online and Renewal strategies by DataReply appeared first on Data Science Milan.

]]>