NLP and Word Embedding

How to feed text data to a machine?

By converting words/text to vector form. But while converting text into a vector we must assure that their true similarity must hold.

Different Techniques:

BOW
TF_IDF
W2V
avg W2V
TF-IDF W2V

Bag of Words:

The first step is to construct a dictionary or set where we have all the unique words in the dictionary.

Each sentence is converted into a d-dimensional vector where d is the number of unique words in the dictionary. Each cell in the d-dimensional vector represents the following below

The vector could be very sparse because we will have a large number of unique words in a real-time database.

As we can see from the sentence "This pasta is very tasty and affordable" the word this in the d-dimensional vector is given the value of 1 at the position where the word this occurs in the d unique words. Here 1 indicates how many times the word "this" appeared in the review that we are currently processing.

Does the Similarity rule stand well?

No in bag of words if we consider an example like below

Though both the sentences are quite opposite to each other in meaning, by using BOW the distance between those vectors is not much large. From this, we can see that BOW is essentially useful to identify how different 2 sentences are.

NOTE:

Another variation of BOW is Binary BOW, where instead of looking out for the occurrence of each word we just simply place 1 and 0 based on the existence of those words in the sentence.

Techniques to improve Text:

STOP-WORD REMOVAL:

In English, we may have words like this, is, an, the, and, etc. which do not contribute any meaning to the sentences, especially when we are trying to use techniques like BOW, etc.

In the first sentence words like "this, is, and" don't contribute much to semantic meaning. Removing the form dictionary we can have a smaller and meaningful vector of dimension d. But note that stopword removal is not always suggested, we need to choose based on our requirements.

BRINGING EVERYTHING TO LOWER CASE:

Since words like Pasta and pasta will be considered different making the vector more complex and meaningless.

STEMMING:

In English, words like beautiful, and beauty convey the same meaning, so we can use one-word beauty. This approach is called stemming.

Types of Stemming:

Porter Stemmer
Snowball Stemmer

Lemmatization:

In English words like New York might be considered two different words while we perform NLP. But this could damage their original meaning. So to avoid that we use the lemmatization which takes care of this issue.

Uni-gram/Bi-gram/N-gram:

In the case of simple stop word removal, we are facing the below issue

In English words like 'very' and 'not' are also considered stopwords, which in turn cause issues in scenarios like above. If we remove both 'not' and 'very' from both sentences, the vector of both sentences will be the same, which should not happen. To avoid that we use the concept called Bi-grams and N-grams.

Bi-grams:

In Bi-gram pairs of consecutive words will make one cell in a d-dimension vector.

Since in the above representation, there is a possibility that the word "not tasty" will be part of one cell. Which prevents the effect of "not" being removed by stopword removal.

Similarly, we have N-grams. But not that we will have the problem of dimensionality being increased when we use N-grams.

TF-IDF:

Drawbacks:

It won't consider the semantic meaning of the words Ex: Tasty, and delicious.

DataBase and Representation:

Our NLP-based algorithms will rely on a word database consisting of a set of English words, where <UNK> denotes an unknown word.

Using one-hot encoding is the way to represent those words in a format that the system can process.

The issues with the above representation:

Due to this kind of representation, the model always treats each word as an individual different term. And it won't try to understand the relation among each of these. For example, even if the model is trained with the sentence "I want a glass of Orange juice" it can generalize and answer the below question

So we need to find an efficient representation so that we can make the model understand this kind of inference.

Alternative representation(Featurized Representation):

Here in the above image, we can see that instead of representing these words as simple one-hot vectors, we can represent them using this feature representation. In the image assume your database has 10000 words if the words that you considered are man, women, King, etc.

The features can be anything For example we can consider gender as a feature and the corresponding values could be as above, that is for a man the value could be -1 whereas for a female it could be 1, similarly for the King the gender is obviously relative to man so it is very close to -0.95, whereas in case of Apple we cannot define gender so to 0.

In this way for every feature, there will be certain values assigned with it for each word in the database.

here the author assumed that there will be 300 features and the database has 10k words so in total the size of the matrix is 300X10000.

Tackling the problem:

Now if you try to solve this kind of scenario since there is a lot of similarity between the vector representing the apple and that orange, there is a high chance that the model could infer this.

Note that in the future it is our task to learn these featured weights while training.

When we visualize these feature vectors they can be seen as follows

Before plotting 300D is converted to 2D while plotting. As we can see how each feature is relative to the other in the form of clusters.

How these embeddings are useful in NLP:

Sometimes words like Dorian cultivator may not be in our database but because of the embedding relationship between Dorian and apple, and farmer and cultivator, we can easily solve these name entity problems.

These embeddings can either be trained on your database, if it is large enough, or else you can also use transfer learning for it.

Analogies of word embeddings:

Given these embeddings, we can answer questions like the one on the screen. By using the differences concept we can simply identify the similar vectors

the differences may not be exactly the same but almost similar. In representation wise, we can write the problem below

The most commonly used similarity matrix is cosine similarity. We can also use Euclidean distance but we should minimize the distance.

Use cases:

Learning of word embedding Matrix:

The embedding matrix in general is learned by randomly choosing the number of features(n) to try to build a matrix of size n X N(number of vocabulary words). These are randomly initialized and improved via Gradient Decent.

How We Use Embedding Matrix:

As shown in the image if we are planning to train a sentence like "I want a glass of orange _____". It is trained in the format as above, that is for each word we obtain an embedded matrix, and at the input level these embedding matrices are converted to a 1D vector and it is fed to the neural network. At the neural network stage, the training for both weights and the training of the embedding matrix will happen.

Training:

Continuous BOW:

As we can see in the image, context is nothing but the number of words that we are considering for training in the previous way of learning the algorithm.

In general, if the context is 5 we will consider the middle word as the target variable and the remaining words as training variables. In the above image, the 5X7 before the softmax layer indicates the word embedding matrix that we achieved after training. This indicates that for a vocabulary of 7 words we are using 5 different feature values for embedding.

Note that here weights are nothing but our word embedding matrix that we train from NN.

Skip-Grams:

As we can see in the above picture, we first randomly choose a word(Ex: Orange), we will try to find a random word within the window span of + or - N, that is for example in a span of 4 forward or backward we choose a word at random and keep it as our target and try to predict the output.

In general, it is visualized as the above which is the opposite version to Continuous BOW.

The working algorithm is as follows as shown in the figure. This kind of approach is used by the skip-gram model.

problems with the traditional Skip-gram model:

It takes a lot of work to do this complex calculation. For every word in the span of the window again and again. So we opt to use the hierarchical softmax function to solve this.

NOTE: When we randomly choose a context word from a sentence we don't often use uniform random distribution to pick the word, because the probability of words like "the, if, a, etc." has more probability than other words. Instead, we use different heuristics.

Negative Sampling:

In this, our approach is to consider, a word from the context, and one more word from the + or - 10 span, make it a target word, and label it as 1. For k times we take the same context word but pick k random words from the dictionary and mark them as negative words by assigning them a 0 label. The k value is taken as given in the image. The working of this model is as follows:

How to sample these k words from the dictionary. If we choose frequently repeated words we will end up choosing words like the, is, an, etc. So to avoid that professors came across the following idea

Glove Algorithm:

NOTE: the major takeaway here is that the glove algorithm concentrates on the global importance of words. That is in the case of skip-gram and CBOW we are getting word embedding based on the context values surrounding that. But while coming to the case of Glove we will consider the relation between each word with one another.

This helps to tackle the scenarios below:

Assume that the word "apple" is used as a tech company in the initial sentences, But later its context might be a fruit. if we consider a context-based scenario we may not be able to properly catch its multiple context existence.

The main idea of the glove algorithm is to consider how many times a target word has appeared in the span of n words around the context word. That is represented in the above image. In the above image, Xij = Xji happens when the span is large. if span is the immediate next value then this can hold true.

This is how the co-occurrence matrix is calculated. Now let's see how these values help us understand context.

Assume these are values from the large trained dataset.

and these are the facts:

Now let's see how these will be interpreted by the model

The model is defined and trained as follows:

Here our objective is to minimize the above equation by adjusting the values of (tita and e) embedding parameters for each pair of words I and J. The f(x) is a weight function that helps us to prevent log(X) from being an undefined value, apart from that it also contributes to giving appropriate weights to the right words.

Here both the tita and e are symmetric matrices so we can consider embedding matrix values by taking an average of each individual word.

NOTE: All real-time embedding features that we learn are not human-interpretable.

Sentimental analysis Usecase:

The simplest form of this use-case could be the averaging method as below

but the problem with the simplest approach is that, for a sentence like the one below, because words like good are repeated multiple times it is misinterpreted as a positive sentence.

So we can use RNN to avoid this problem.

BIAS PROBLEMS:

Solution:

Note that to identify the right words to Neutralize we train a simple classifier that finds the words(doctor, babysitter) that could possibly affected by bias.

The neutralization step in simple words brings words like doctor and babysitter to fall on that unbiased line to avoid the effect of bias on them.

Similarly, an equalizer step keeps the distance of words like grandmother and grandfather to the same value, so that they both have equality regarding that job. As we can see initially the points in blue show that the word grandmother is closer to babysitter than grandfather which creates a bias, So to avoid that we move the points to violet spots making it stay equidistant with the word "babysitter".

SUMMARY OF WORD EMBEDDING:

Though all these word-embedding concepts seem overwhelming and complex to learn let's just break everything into a simple short summary.
At initial times developing an NLP model is defined as "given a set of words what could be possibly the next word".

While examining that researchers found this issue

The general word2vec approaches failed to understand the semantic and syntactic meaning and generalize for context.
So they came up with a different approach called feature vector or word embedding vector to solve this.
Then the question arises of how it is trained. For this, they defined two different approaches one Continuous BOW and Another Skip Gram and later Glove.
In which CBOW and Skip-gram models are vise-versa to each other. These training processes can be viewed above. But one simple catch is that though the training is done through NN, we don't really care about the NN output rather we are extracting weights as the embedding matrix.
However, these algorithms eventually resulted in computational complexity. So to tackle that they introduced Hirarchial Softmax and Negative sampling.
Later they discovered the context difference between some words at different levels of the entire data. So, instead of using simple CBOW and Skip-gram, they introduced one new term to tackle this context difference. And also introduced a weight function to control a few parameters.
Eventually, they found that there are some Bias issues regarding gender and all. And they found few solutions to solve them further.

Machine learning guider

Search This Blog