How To Create a ChatBot With tf-seq2seq For Free!

Viacheslav Kovalevskyi
Deep Learning as I See It
6 min readJun 2, 2018

--

written by: Viacheslav Kovalevskyi, Gautam Goswami and Younghee Kwon. Disclaimer: Our opinions are our own.

tf-seq2seq is a new framework based on TensorFlow that can be used for a variety of tasks where seq2seq models are useful. Let me quote authors of the framework.

tf-seq2seq is a general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more.

In this article we will be using it to train a chatbot. More precisely we will be using the following tutorial for neural machine translation (NMT). If you wonder how an NMT model could be used for a chatbot, please see my previous article (“Own ChatBot Based on Recurrent Neural Network
for 6$/6 hours and ~100 lines of code.”).

At this point one may ask, okay I see what this is all about, but how is it possible to train a model for free? Google recently have announced that they are giving one Nvidia K80 GPU for 12 hours for free with their new service Colab. Essentially Colab is a custom version of Jupyter Notebook.

So looks like we have both components:

  • Model that we want to train and
  • service with the Nvidia K80 that we will be using for the actual training.

Let’s begin our journey…

Preparing The Colab Notebook

We are ready to use the colab service and create our first notebook. By the time of this writing colab would default to “hello world” notebook that is good for familiarizing yourself with the environment. Let’s create brand new notebook with Python3 support:

As soon as notebook is created we need to change the runtime type:

And set it to GPU:

Our notebook is ready and we can proceed with the second step:

Preparing The Training Data

Let us start with preparing the data. We will be using same old script that we have utilized in all the previous articles. Only this time we will be doing all this in the colab.

We will be using NLTK lib that requires some “pre-warming”:

Now we can clone the branch and run the logic that prepares training data:

%%bash
rm -rf dialog_converter
git clone https://github.com/b0noI/dialog_converter.git
cd dialog_converter
git checkout b9cc7b7d82a959c80e5048b18e956841233c7688
python3 ./converter.py
ls

Now we have raw dialog data for training. One of the key differences this time around is that we are going to have more sophisticated dictionary. In order to explain the difference in our new dictionary here’s a quote from the original tutorial:

… learning a model based on words has a couple of drawbacks. Because NMT models output a probability distribution over words, they can became very slow with large number of possible words. If you include misspellings and derived words in your vocabulary, the number of possible words is essentially infinite and we need to impose an artificial limit of how of the most common words we want our model to handle. This is also called the vocabulary size and typically set to something in the range of 10,000 to 100,000. Another drawback of training on word tokens is that the model does not learn about common “stems” of words. For example, it would consider “loved” and “loving” as completely separate classes despite their common root.

One way to handle an open vocabulary issue is learn subword units for a given text. For example, the word “loved” may be split up into “lov” and “ed”, while “loving” would be split up into “lov” and “ing”. This allows to model to generalize to new words, while also resulting in a smaller vocabulary size. There are several techniques for learning such subword units, including Byte Pair Encoding (BPE), which is what we used in this tutorial. To generate a BPE for a given text, you can follow the instructions in the official subword-nmt repository:

So our next logical steps are:

  • Get all the required software that can learn a BPE vocabulary from training text
  • Convert training data to BPE and create a vocabulary
  • Convert all text with the vocabulary.

A. Get All The Required Software that can learn a BPE vocabulary from training text

We will be installing the subword-nmt pip package in order to perform required manipulations. In order to do so run the following command in the cell:

%%bash
rm -rf subword-nmt
git clone https://github.com/b0noI/subword-nmt.git
cd subword-nmt
git checkout dbe97c8f95f14d06b2e46b8053e2e2f9b9bf804e

now we are finally ready to

B. Convert Training Data To BPE and Create a Vocabulary

We will need to execute this in following 3 steps.

Step 1: This step is responsible to create a vocabulary based on input training data and specified size of vocabulary. It creates code.bpe which is basically ‘compressed trie’ of all the words in training data. It also generates most frequent words from training data along with their frequencies in the files vocab.train.bpe.{a,b}.

%%bash
# Create unique words (vocabulary) from training data
subword-nmt/learn_joint_bpe_and_vocab.py --input dialog_converter/train.a dialog_converter/train.b -s 50000 -o code.bpe --write-vocabulary vocab.train.bpe.a vocab.train.bpe.b

Step 2: Our training data has few tabs which are not needed in vocabulary, so lets clean the vocabulary.

%%bash
# Remove the tab from vocabulary
sed -i '/\t/d' ./vocab.train.bpe.a
sed -i '/\t/d' ./vocab.train.bpe.b

Step 3: The output files vocab.train.{a,b} has list of words along with their frequencies, tf-seq2seq takes input as set of words, so we can get rid of frequency.

%%bash
# Remove the frequency from vocabulary
cat vocab.train.bpe.a | cut -f1 --delimiter=' ' > revocab.train.bpe.a
cat vocab.train.bpe.b | cut -f1 --delimiter=' ' > revocab.train.bpe.b

C. Convert all text with the vocabulary.

This cell will create BPE encoding and dictionaries per each raw file. And now we can re-apply this dictionaries to our raw files:

%%bash
subword-nmt/apply_bpe.py -c code.bpe --vocabulary vocab.train.bpe.a --vocabulary-threshold 5 < dialog_converter/train.a > train.bpe.a
subword-nmt/apply_bpe.py -c code.bpe --vocabulary vocab.train.bpe.b --vocabulary-threshold 5 < dialog_converter/train.b > train.bpe.b
subword-nmt/apply_bpe.py -c code.bpe --vocabulary vocab.train.bpe.a --vocabulary-threshold 5 < dialog_converter/test.a > test.bpe.a
subword-nmt/apply_bpe.py -c code.bpe --vocabulary vocab.train.bpe.b --vocabulary-threshold 5 < dialog_converter/test.b > test.bpe.b

Preparation for training

Step 1: Download the nmt model

%%bash
rm -rf /content/nmt_model
rm -rf nmt
git clone https://github.com/tensorflow/nmt/

Step 2: Move all the required file for the training to one place. Which includes the training data, test data and vocabulary (just set of words).

%%bash
mkdir -p /content/nmt_model
cp dialog_converter/train.a /content/nmt_model
cp dialog_converter/train.b /content/nmt_model
cp dialog_converter/test.a /content/nmt_model
cp dialog_converter/test.b /content/nmt_model
cp revocab.train.bpe.a /content/nmt_model
cp revocab.train.bpe.b /content/nmt_model
cp train.bpe.a /content/nmt_model
cp test.bpe.a /content/nmt_model
cp train.bpe.b /content/nmt_model
cp test.bpe.b /content/nmt_model

Start the training:

!cd nmt && python3 -m nmt.nmt \
--src=a --tgt=b \
--vocab_prefix=/content/nmt_model/revocab.train.bpe \
--train_prefix=/content/nmt_model/train.bpe \
--dev_prefix=/content/nmt_model/test.bpe \
--test_prefix=/content/nmt_model/test.bpe \
--out_dir=/content/nmt_model \
--num_train_steps=45000000 \
--steps_per_stats=100000 \
--num_layers=2 \
--num_units=128 \
--batch_size=16 \
--num_gpus=1 \
--dropout=0.2 \
--learning_rate=0.2 \
--metrics=bleu

Several things to note here:

  • num_train_steps — this is the amount of steps that the network will take before stopping, make it big since it is always better to be in a situation when you need to stop training manually than in the situation when the network stopped when you have not expected it to stop;
  • steps_per_stats — frequency in which network will output some stats. A thing to keep in mind here: outputting stats takes time so you need to find balance between outputting this too often and training model in a complete dark;
  • metrics — logic of computing the distance between 2 sentences to evaluate model quality;

Important things here is not to use “%%bash” but to use “!”. “%%bash” will wait till the cell completely executed before showing the output, which means almost forever due to the training of the model. On opposite, the “!” is showing output dynamically.

As soon as training finishes its first epoch you can start chatting with the model.

Be sure to kill the training if it is still in progress, otherwise you will not be able to chat with the model.

Next, copy paste following code in a file (lets say chat.sh) under nmt directory and run it like ./chat.sh <path to the model>. You might need to change the permission like chmod +x chat.sh.

%%bash
pwd
cd nmt
touch /content/output
chat () {
echo $1 > /content/input $HOME/subword-nmt/apply_bpe.py -c $HOME/code.bpe --vocabulary $HOME/vocab.train.bpe.a --vocabulary-threshold 5 < /content/input > /content/input.bpe cd $HOME/nmt python -m nmt.nmt --out_dir=/content/nmt_model --inference_input_file=/content/input.bpe --inference_output_file=/content/output > /dev/null 2>&1 cat /content/output
}
chat "hi"

PS: We are still playing with different model configurations, stay tuned as we updating the article. Also at some point in the future we will upload pre-trained model!

Last update: July 72018

--

--