50 Shades of Text – Leveraging Natural Language Processing (NLP)

“Towards human-like comprehension of texts/languages by computers”

On 21th June 2018 at Buildo, Data Science Milan has organized an event on a fashion topic: Natural Language Processing (NLP). Nowadays we found many applications of NLP, such as machine translation (Google translator), question answering (chatbot), web and application search (Amazon), lexical semantics (Thesaurus), sentiment analysis (Cambridge Analytica), natural language generator (Reddit bot).

“50 Shades of Text – Leveraging Natural Language Processing (NLP) to validate, improve, and expand the functionalities of a product”, by Alessandro Panebianco, Grainger

What is the mean of natural language processing?

Natural language processing is a branch of artificial intelligence representing a bridge between humans and computers; it can be broadly defined as the automatic manipulation of natural language, like speech and text, by software. There are many ways to represents words in NLP and you cannot use text data directly on machine learning algorithms.

The first step is to transform raw text into numerical features by vectorization of words and there are several techniques:

-Bag of words: it is a way of extracting features from text as input for machine learning, it is a representation of text that describes the occurrence of words taken from a vocabulary obtained within a corpus by labelling it in a binary vector. It is called “bag of words” because it doesn’t care about the order of structure of words in the corpus.

-Hashing trick: hash function can be used to map data of arbitrary size to data of a fixed vector size set of numbers. Hashing trick or feature hashing consist to apply hashing function to the features and using their hashing values directly: with same input we have same output. A binary score or count can be used to score the word. Hash function is a one way process and sometimes it can be a problem because you can’t back from the output to the input space and collisions between the two mapped spaces can happen.

-TF-IDF: another approach is to rescale the frequency of words by how often they appear in all documents; this approach is called Term Frequency – Inverse Document Frequency. Term Frequency apply a score of the frequency of the word in the current corpus; Inverse Document Frequency apply a score of how rare the word is across corpus. With tf-idf technique terms are weighted and so on with scores are highlighted words with useful information.

The second level is word embedding with the goal to generate vectors encoding semantics: individual words are represented by vectors in a predefined vector space. Also for word embedding there are several techniques:

-Word2vec: it is a neural network that try to maximize the probability to see a word in a context window, the cosine similarity between two vectors. This task is achieved by two learning models: continuous bag-of-words or CBOW model that try to predict a word from its context; continuous skip-gram model that try to predict the context from a word.

-GloVe: it is an extension of word2vec, it constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus instead to predict words with boost results in terms of computation and it uses cosine similarity.

-FastText: it is a library for learning of word embeddings and text classification created by Facebook’s AI Research (FAIR). It gains the same accuracy of the previous models but with better performance, it can be explained by this relationship:

FastText : Word Embeddings = XGBoost : Random Forest

The last level is sentence embeddings, with this approach the goal is to represent more than single words by vectors and also in this case are available several models:

-Doc2vec: it works in the same way of word2vec but the goal is to use a network of paragraphs and words, so a sentence can be thought as another word, which is document-unique. There is Distributed Memory (DM) similar to CBOW model and Distributed Bag of Words (DBOW) similar to skip-gram as in word2vec.

-CNNs: Convolutional Neural Network were born for computer vision and more recently they are also applied to problems in Natural Language Processing. They are basically composed by several layers of convolutions with nonlinear activation functions applied to the results. Convolutions are used over the input layer to compute the output, this results in local connections, where each region of the input is connected to a neuron of the output. Each layer applies different filters and combines their results. The process start stacking words together creating a matrix, filters scan words, max pooling highlight the most important words and LSTM layer keeps the words order.

-LSTM: Long Short Term Memory are fancy Recurrent Neural Networks with some additional features among which memory cell for every time step. An application of RNN is Google searches, it links a search with an item. LSTM layer creates from input words a new output giving relevance to the words order and the next filter layers give relevance to the most important local features.

After this presentation has been showed a demo using a dataset from Kaggle and GloVe vectors with repository code available on Github.

Author: Claudio Giancaterino

Data science is the new gold
Actuary & Data Science Enthusiast

Follow up