Written by Claudio Giancaterino and reviewed by Fabio Concina
On 15th May at ICTeam were hosted talks about Big Data. Both speeches were focused on using Spark framework to scale deep learning and R environment for Big Data processing.
Apache Spark is an open source cluster computing framework able to provide a unified and distributed programming interface to different applications (Java, Python, Scala and R).
In the first talk was faced the use of deep learning for image caption generation.
The approach exposed started from IT architecture: one Big Data Cluster (Spark) with four virtual Cloudera nodes using the same Python distribution (by Anaconda) in all layer nodes for the scoring deep learning.
So the matter is how to deploy data from the cluster to deep learning training to a multi-device execution and the solution was a parallel training for each device to maximise the algorithm using TensorFlow library.
Now deep learning is a recurrent topic in data science: it’s used in context such as computer vision, speech recognition and natural language recognition. With deep learning input data flows through hierarchical layers, and in each layer data is transformed with the target to minimize the error. Deep learning architecture is a neural network based on a large collection of connected simple units called “artificial neurons”, similar to the ones in the biological brain, because the task is mimic a biological neuron.
The question is: how can you train the model? Help for training deep learning in distributed processing comes from TensorFlow, an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. It takes advantages of data flow charts to represent computation, shared state, and the operations that mutate that state. These information, pass through “nodes” in which they are processed (input data with regression) to allow “learning” and development of neural networks. This library provides a Python API as well as a Java and C++ one. It’s able to operate at a large scale and in heterogeneous environments. It was open sourced by Google in 2015.
The example explained by Luca Grazioli regarded image caption generation, a methodology able to translate a picture to text. How does it work? There is a first step, inception, in which every image is classified into a series of objects (inside the picture) linked with its likelihood. The second step is a LSTM system (Long Short Term Memory) to make phrases with more words at each next step.
So the baseline model used to create image captions is a generative recurrent neural network in which the output word at time (t−1) becomes the input word at time step t.
Reading a picture, the system is able to define a scoring function in Python and so on to make a map partition in which is defined the likelihood of words useful to describe the image.
In the second talk Serena Signorelli explained a new resource to handle Big Data by Spark framework in R environment.
In the data science process with Big Data, R has a weakness because it requires a large amount of memory to process such big data; until last year one solution was to use SparkR.
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. It provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. but on large datasets. SparkR also supports distributed machine learning using MLlib. It was natively included in Spark after version 1.6.2., but it requires to use another programming language because it’s pre-configured on Spark clusters.
Recently has been developed a really interesting solution for R users, a CRAN package: SparklyR by RStudio. With this solution you have the opportunity to access and analyze the data inside the cluster and bring into R only the results. This package is easy to install in R and also allows to download and install Spark because SparklyR is an interface for Apache Spark, it affords connection to both local instances of Spark and remote Spark clusters. With this instrument the data science process is updated: analysis process is made using Spark (Hive tables, Machine learning libraries, SQL).
The strength of this package is the language, quite the same of R and based on 3 pillars: dplyr, machine learning Spark algorithms and extensions.
Sparklyr provides a complete dplyr backend for data manipulation, analysis and visualization, it translates R language in SQL Spark. It has two characteristics: the same operator such as piping from magrittr package and never pulls data into R unless the operator asks on purpose for it.
The second pillar regards 3 family of functions for machine learning pipeline provided by Spark ML package:
-Machine learning algorithms for analyzing data;
-Feature transformers for manipulating individual features;
-Functions for manipulating SparkDataFrames.
The last pillar concerns the extensions that can be created to call the full Spark API.
A strength about SparkR compared to Sparklyr is the UDF function (User-Defined Function) available in SparkR and not on Sparklyr yet.
Author: Claudio Giancaterino
Actuary & Data Science Enthusiast