In this section I have explored Siamese networks applied to natural language processing and fundamentals of Trax. I learned how to implement models with different architectures.
In this section, we will:
- Learn about Siamese networks
- Understand how the triplet loss works
- Understand how to evaluate accuracy
- Use cosine similarity between the model's outputted vectors
- Use the data generator to get batches of questions
- Predict using our own model
We will start by preprocessing the data. After processing the data we will build a classifier that will allow us to identify whether to questions are the same or not. We will process the data first and then perform padding. Our model will take in the two question embeddings, run them through an LSTM, and then compare the outputs of the two sub networks using cosine similarity.
We will be using the Quora question answer dataset to build a model that could identify similar questions. This is a useful task because we don't want to have several versions of the same question posted.
You will now convert every question to a tensor, or an array of numbers, using our vocabulary built above.
A Siamese network is a neural network which uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.The Siamese network we are about to implement looks like this:
we get the question embedding, run it through an LSTM layer, normalize v_1 and v_2, and finally use a triplet loss (explained below) to get the corresponding cosine similarity for each pair of questions. As usual, we will start by importing the data set. The triplet loss makes use of a baseline (anchor) input that is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized. In math equations, we are trying to maximize the following.
A is the anchor input, for example q1_1, P the duplicate input, for example, q2_1, and N the negative input (the non duplicate question), for example q2_2.
\alpha is a margin; we can think about it as a safety net, or by how much we want to push the duplicates from the non duplicates.
Instructions: Implement the Siamese function below. we should be using all the objects explained below.
To implement this model, we will be using trax. Concretely, we will be using the following functions.
-
tl.Serial: Combinator that applies layers serially (by function composition) allows we set up the overall structure of the feedforward. docs / source code- we can pass in the layers as arguments to
Serial, separated by commas. - For example:
tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))
- we can pass in the layers as arguments to
-
tl.Embedding: Maps discrete tokens to vectors. It will have shape (vocabulary length X dimension of output vectors). The dimension of output vectors (also called d_feature) is the number of elements in the word embedding. docs / source codetl.Embedding(vocab_size, d_feature).vocab_sizeis the number of unique words in the given vocabulary.d_featureis the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
-
tl.LSTMThe LSTM layer. It leverages another Trax layer calledLSTMCell. The number of units should be specified and should match the number of elements in the word embedding. docs / source codetl.LSTM(n_units)Builds an LSTM layer of n_units.
-
tl.Mean: Computes the mean across a desired axis. Mean uses one tensor axis to form groups of values and replaces each group with the mean value of that group. docs / source codetl.Mean(axis=1)mean over columns.
-
tl.FnLayer with no weights that applies the function f, which should be specified using a lambda syntax. docs / source docex-> This is used for cosine similarity.tl.Fn('Normalize', lambda x: normalize(x))Returns a layer with no weights that applies the functionf
-
tl.parallel: It is a combinator layer (likeSerial) that applies a list of layers in parallel to its inputs. docs / source code
we will now implement the TripletLoss.
As explained in the lecture, loss is composed of two terms. One term utilizes the mean of all the non duplicates, the second utilizes the closest negative. Our loss expression is then:
Further, two sets of instructions are provided. The first set provides a brief description of the task. If that set proves insufficient, a more detailed set can be displayed.
Instructions (Brief): Here is a list of things we should do:
- As this will be run inside trax, use
fastnp.xyzwhen using anyxyznumpy function - Use
fastnp.dotto calculate the similarity matrixv_1v_2^Tof dimensionbatch_sizexbatch_size - Take the score of the duplicates on the diagonal
fastnp.diagonal - Use the
traxfunctionsfastnp.eyeandfastnp.maximumfor the identity matrix and the maximum.
Now we are going to train our model. As usual, we have to define the cost function and the optimizer. we also have to feed in the built model. Before, going into the training, we will use a special data set up. We will define the inputs using the data generator we built above. The lambda function acts as a seed to remember the last batch that was given. Run the cell below to get the question pairs inputs.
we will now write a function that takes in our model and trains it. To train our model we have to decide how many times we want to iterate over the entire data set; each iteration is defined as an epoch. For each epoch, we have to go over all the data, using our training iterator.
Instructions: Implement the train_model below to train the neural network above. Here is a list of things we should do, as already shown in lecture 7:
- Create
TrainTaskandEvalTask - Create the training loop
trax.supervised.training.Loop - Pass in the following depending on the context (train_task or eval_task):
labeled_data=generatormetrics=[TripletLoss()],loss_layer=TripletLoss()optimizer=trax.optimizers.Adamwith learning rate of 0.01lr_schedule=lr_schedule,output_dir=output_dir
we will be using our triplet loss function with Adam optimizer. Please read the trax documentation to get a full understanding.
This function should return a training.Loop object. To read more about this check the docs.
In this section we will learn how to evaluate a Siamese network. we will first start by loading a pretrained model and then we will use it to predict.
To determine the accuracy of the model, we will utilize the test set that was configured earlier. While in training we used only positive examples, the test data, Q1_test, Q2_test and y_test, is setup as pairs of questions, some of which are duplicates some are not. This routine will run all the test question pairs through the model, compute the cosine simlarity of each pair, threshold it and compare the result to y_test - the correct response from the data set. The results are accumulated to produce an accuracy.
Instructions
- Loop through the incoming data in batch_size chunks
- Use the data generator to load q1, q2 a batch at a time. Don't forget to set shuffle=False!
- copy a batch_size chunk of y into y_test
- compute v1, v2 using the model
- for each element of the batch - compute the cos similarity of each pair of entries, v1[j],v2[j] - determine if d > threshold - increment accuracy if that result matches the expected results (y_test[j])
- compute the final accuracy and return
Due to some limitations of this environment, running classify multiple times may result in the kernel failing. If that happens Restart Kernal & clear output and then run from the top. During development, consider using a smaller set of data to reduce the number of calls to model().
In this section we will test the model with our own questions. we will write a function predict which takes two questions as input and returns 1 or 0 depending on whether the question pair is a duplicate or not.
But first, we build a reverse vocabulary that allows to map encoded questions back to words:
Write a function predictthat takes in two questions, the model, and the vocabulary and returns whether the questions are duplicates (1) or not duplicates (0) given a similarity threshold.
Instructions:
- Tokenize our question using
nltk.word_tokenize - Create Q1,Q2 by encoding our questions as a list of numbers using vocab
- pad Q1,Q2 with next(data_generator([Q1], [Q2],1,vocab['']))
- use model() to create v1, v2
- compute the cosine similarity (dot product) of v1, v2
- compute res by comparing d to the threshold
Siamese networks are important and useful. Many times there are several questions that are already asked in quora, or other platforms and we can use Siamese networks to avoid question duplicates.





