Note: The following post was originally posted on Medium on Sept. 8, 2019

When we first started deploying neural network models in production at Talentpair, they were small (-ish), they only ran occasionally, and most importantly there were only a few of them. But as one proved its worth, another soon followed and our naive deployment strategy started to run into resource limits. Inevitably we were repeating some expensive work in the vectorization steps as many of our models used the same base word vectors. We needed a way to keep the core business logic humming along, yet serve our ever growing deep learning pipelines as well. Enter *Tensorflow Serving*. This is the tale of the first step on that journey. How does one bundle up existing models for TF Serving? How can we move as much as possible into the model server. We need a new kind of model. A *Meta Model* if you will.

**tl;dr**: So how do we get from a preserved Keras model (.h5 format) to a Tensorflow saved_model?

**code**: Implementations of all the code in the post can be found in the 2 notebooks contained in this repo.

With the arrival of TF 2.0, the simplified API and deep integration with other elements of TF Extended (TFX) a good deal of this post will be gleefully irrelevant soon. But for those of you running flavors of TF 1.x in production, hopefully this will be of some use.

As Talentpair is primarily a technical recruiting company, our corpus is skewed toward the domain. As such we trained our own word vector model on our full corpus and kept the top 300,000 tokens. The models themselves were built with early versions of Keras. Simple 1-D CNN’s, they expect a tensor of fixed shape, composed of 1400 word vectors, each vector is 100-D. So 1400 x 100. All of the pre-processing of text, tokenization, lookup of word vectors, and padding/truncating happened in Python. Carrying all of the word vectors in either Redis or memory was not cheap.

We set out to move the model inference to Tensorflow Serving. Having a dedicated service for inference would free the API server from carrying around the growing model library and associated computational resources with all of those inferences.

But could we go farther? Each model expected a matrix created from the stack of word vectors associate with each token, but this lookup of the word vectors and creation of this matrix is not cheap in terms of memory and cpu. We leveraged Keras’ embedding layer to bake our existing word vectors into the Tensorflow graph itself.

With the model thusly bundled we just needed a path to export this as a Tensorflow SavedModel, the required format for Tensorflow Serving. This post walks through these last 2 steps.

**Quick Toy Model to Work With**

There are plenty of resources that cover building simple neural networks, so we won’t cover it again here. However, we do want to reconstruct a simple model in the format of our original classifiers, so the rest of the story makes sense.

Let’s start with some requirements:

```
pip3 install gensim spacy keras==2.2.4 numpy
python3 -m spacy download en_core_web_sm
```

Some imports and constants:

```
import glob
import os
import pickle
from random import shuffle
import keras
import numpy as np
import spacy
np.random.seed(0)
nlp = spacy.load(“en_core_web_sm”) # We’ll use this for tokenization
MAX_LEN = 100 # Let’s keep training small
x = np.random.random((50, )) # We’ll only use the 50d word vectors
OOV_VECTOR = x / np .linalg.norm(x)
PAD_VECTOR = np.zeros((50, ))
```

The padding and out of vocabulary (OOV) vectors will need to be held constant, for later inference. We will pickle the OOV vector. We can always just make another zeros vector for padding, so there is no reason to store it.

```
with open(‘data/oov_vector.p’, ‘wb’) as f:
pickle.dump(OOV_VECTOR, f)
```

Next we’ll grab the IMDB Sentiment Dataset (1) from here. You’ll need to unpack it into a directory of your choosing, here we’ll call it `data/`.

And we’ll also grab the Glove (2) word vectors from here. We can unpack that in `data/` too.

```
def load_imdb_data(dir_path):
pos_filepath = os.path.join(dir_path, 'pos/*.txt')
neg_filepath = os.path.join(dir_path, 'neg/*.txt')
examples = []
files = glob.glob(pos_filepath)
for file in files:
with open(file, 'r') as f:
examples.append((f.read(), 1))
files = glob.glob(neg_filepath)
for file in files:
with open(file, 'r') as f:
examples.append((f.read(), 0))
shuffle(examples)
return examples
train = load_imdb_data('data/aclImdb/train')
# If you want to use this later as a val or test set
# test = load_imdb_data('data/aclImdb/test')
print(f'Number of train examples: {len(train)}')
print(f'Number of test examples: {len(test)}')
```

```
Number of train examples: 25000
Number of test examples: 25000
```

Then we create a “loadable” word vector file from the earlier Glove download and load that model into memory:

```
def load_vecs(filepath):
vecs = {}
with open(filepath) as f:
for line in f:
split_line = line.split()
word = split_line[0]
embedding = np.array([float(val) for val in split_line[1:]])
vecs[word] = embedding
return vecswv_model = load_vecs(‘data/glove.6B.50d.txt’)
```

The file is a vocabulary token followed by the Glove vector representation of that token. So, for each line we split on space set the first entry of the line as the key in our lookup dictionary (vecs) and the value set the np.array(of the rest of values, converted to floats from strings).

Side note, you can also use gensim’s KeyedVectors (in the code block below), but I’m trying to replicate the journey we started on years ago, so we’ll stick with the code from above for now.

```
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file=”data/glove.6B.50d.txt”, word2vec_output_file=”data/gensim_glove_vectors.txt”)
wv_model = KeyedVectors.load_word2vec_format(“gensim_glove_vectors.txt”, binary=False)
```

And just to make sure our tokenizer and word vectors aren’t nonsensical:

```
tokenized_sample = nlp(‘Hey there, tokenize me.’)
print([x for x in tokenized_sample])
print(wv_model[str(tokenized_sample[0])])
```

Which should give us:

```
[hey, there, ,, tokenize, me, .]
[-0.7001 0.36781 0.34424 -0.42318 -0.046018 -0.66072 -0.33993
0.18271 -0.92863 0.5684 -0.43819 0.70827 -0.47459 -0.079269
1.0187 0.2213 0.43073 0.76719 0.18774 -0.49214 -0.53063
0.56379 0.63571 0.64622 1.2649 -0.82901 -1.3903 0.3749
0.61316 -1.5994 1.3005 0.64347 -0.58004 1.0372 -0.27156
-0.43382 0.8554 -0.8967 0.80176 -0.33333 -0.17654 -0.12277
-0.70508 -0.28412 0.71149 -0.13487 0.049514 -0.8134 0.34293
1.0381]
```

Cool. A list of 6 tokens and a 50d vector for ‘hey’. Now we want some preprocessing and a function to use our *PAD* and *OOV* vectors, so that we end up with an array that contains exactly *MAX_LEN* (100 in this case) vectors.

```
def preprocess_and_tokenize(line):
tokens = nlp(line.lower())
return [x for x in tokens if x]
def vectorize_pad_sample(example, wv_model):
line, target = example
vectors = []
tokens = preprocess_and_tokenize(line)
for token in tokens:
try:
vectors.append(wv_model[str(token)])
except KeyError:
vectors.append(OOV_VECTOR)
pad_len = MAX_LEN — len(vectors)
if pad_len > 0:
vectors.extend([PAD_VECTOR] * pad_len)
return (np.array(vectors[:MAX_LEN]), target)
```

And a quick sanity check:

```
text_with_target = (‘python or else ‘ * 32, 1) # To mimic the imdb format
x, y = vectorize_pad_sample(text_with_target), wv_model)
print(x.shape)
```

`(100, 50)`

Then we prepare the dataset for training. Again this step is solely to replicate our journey, there are newer tools to do this in a cleaner way.

```
X_train = []
y_train = []
for example in tqdm(train):
x, y = vectorize_pad_sample(example, wv_model)
X_train.append(x)
y_train.append(y)
X_train = np.array(X_train)
y_train = np.array(y_train)
```

To recreate one of our classifiers, we use Keras (not tensorflow.keras, that will come later)

```
input_shape = X_train[0].shape
_input = keras.layers.Input(input_shape)
x = keras.layers.Conv1D(25, 3, activation=’relu’)(_input)
x = keras.layers.MaxPooling1D(2)(x)
x = keras.layers.Conv1D(50, 3, activation=’relu’)(x)
x = keras.layers.MaxPooling1D(2)(x)
x = keras.layers.Conv1D(100, 3, activation=’relu’)(x)
x = keras.layers.GlobalMaxPooling1D()(x)
x = keras.layers.Dense(60, activation=’relu’)(x)
x = keras.layers.Dropout(.2)(x)
x = keras.layers.Dense(1, activation=’sigmoid’, name=’final_output’)(x)
model = keras.models.Model(_input, x)
```

So our model will be a simple convolutional model over each 100 x 50 tensor and output a single value between 0 and 1.

```
weights_file = ‘tmp_model_1_weights’
callbacks = [
keras.callbacks.ModelCheckpoint(weights_file,
monitor=’val_loss’,
save_weights_only=True),
keras.callbacks.EarlyStopping(patience=3,
monitor=’val_loss’)
]
model.compile(keras.optimizers.RMSprop(),
loss=’binary_crossentropy’,
metrics=[‘accuracy’])
model.fit(X_train,
y_train,
validation_split=0.2,
batch_size=32,
nb_epoch=30,
callbacks=callbacks)
```

Here is where you would normally tune your hyperparameters and really dial the model in, but as this a post about converting existing models we’ll skip that and work with the toy model as it comes out.

```
model.load_weights(weights_file) # Reload the best weights saved
model.save(‘model_1.h5’)
```

This is where we found ourselves, after 12 months of iterations, a pile of models, mostly on the scale of the toy model above, but some a fair bit larger. Each was doing its pre-processing and inference in a Python Celery worker along side the business logic of the app. In the early days of our experimentation, this was a fine approach. It allowed the data science team to develop along side the rest of the application and move at their cadence, critical for a team of our size. But as with all things, growth brings some pain. And this was the first place we looked to alleviate it.

**Enter Tensorflow-Serving**

Tensorflow-Serving is a highly optimized serving package that is part of the ever-growing Tensorflow eco-system. It allows for setting up a dedicated micro-service (perhaps a misnomer) to efficiently manage Tensorflow models specifically for inference. All of the handling of *model versioning*, *request handling*, *batching*, etc. are all handled out of the box. We won’t cover how to set up Tensorflow-Serving here, but this tutorial will walk you through setting it up via Docker. The important piece of information for our current purposes is the model needs to be in Tensorflow *SavedModel* format.

So how do we get from a preserved Keras model (.h5 format) to a Tensorflow *SavedModel*?

**Important Note**: The code snippets in the sections above and below are in two distinct notebooks for a reason. In the section below we will begin working with Tensorflow *Session*s and as they maintain state in sometimes unintuitive ways we need to be very explicit about how we instantiate them and re-instantiate them should something go wrong. That being said, as you work with the code below, remember that should anything go haywire, just rerunning the cell (should you be in a notebook) may lead to unexpected errors. There is a saying in the Data Science research community, “Restart & Run All or it didn’t happen,” referring to restarting the notebook’s kernel. The use case here is slightly different, but the sentiment is just as valid.

Let’s start with everything we will need:

```
import os
import pickle
from copy import copy
import keras
import numpy as np
import tensorflow as tf
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models.keyedvectors import KeyedVectors
```

We are adding gensim here as the KeyedVectors model allow us some quick wins for handling the word vectors in building up our new model as you’ll see below. Also note, the gensim has a get_keras_embedding utility method does some of the functionality described below. However as we are working with a specific, frozen Out of Vocabulary vector it seemed more explicit to do this step manually.

Next we need to load the model we created in the first section. And for this we need to use the Keras package, not to be confused with the tf.keras API we will use in later steps. This differentiation is crucial.

`orig_model = keras.models.load_model(‘model_1.h5’)`

And we load the same vector file from the last script.

```
glove2word2vec(glove_input_file=”data/glove.6B.50d.txt”, word2vec_output_file=”data/gensim_glove_vectors.txt”)
wv = KeyedVectors.load_word2vec_format(“data/gensim_glove_vectors.txt”, binary=False)
```

Now we make a quick test so we can verify the output of the model doesn’t change after we package it up.

```
test = []
for i in range(100):
test.append(wv[‘python’])
test = np.array(test).reshape(1, 100, 50)
print(orig_model.predict(test))
```

`array([[0.4573401]], dtype=float32)`

This next step turns out to be necessary. Here is the first place a model created in Keras will start to conflict with models created via the tf.keras API. As such, the model needs to be converted into tf.keras.

We want to take a model and extract it’s structure and weights separately. And then we can use the *tf.keras* api to re-instantiate it. While the APIs of tf.keras and keras are almost identical by design, a great deal of what goes on under the hood is completely different. This will be even more relevant with the release of Tensorflow 2.0. Designed with tf.keras as its core model for ease, a lot of the downstream tasks (such as *SavedModel*) rely on different instantiations of the various objects in the pipeline.

```
def convert_to_tf(model, name):
model.name = ‘{}’.format(str(name).lower()
.replace(‘(‘, ‘’)
.replace(‘)’, ‘’)
.replace(‘,’, ‘’)
.replace(‘/’, ‘_’)
.replace(‘ ‘, ‘_’))
js = model.to_json()
filepath = ‘/tmp/{}_weights.h5’.format(model.name)
model.save_weights(filepath)
new_model = tf.keras.models.model_from_json(js)
new_model.load_weights(filepath)
models_dir = ‘models'
if not os.path.exists(models_dir):
os.makedirs(models_dir)
new_model.save(‘{}/{}.h5’.format(models_dir, model.name))
return new_modeltf_orig_model = convert_to_tf(orig_model, ‘cnn_1’)
```

Now the magic. Well it seems like magic, as this was the hard part. We want to take our word vectors and wrap them into a tensorflow layer to use within the model. Note everything from here on out will be *tf.keras* instead of “classic” keras.

```
def create_wv_embedding_layer(wv, oov_vector_file, pad_string='--PAD--', max_sequence_length=100):
"""
From an existing gensim word2vec model create a tf.keras embedding layer to use ahead of pre-trained models in tf serving
Pad string will assigned a vector of zeros
Out of Vocabulary vector would've been predefined when original model was created. We need to keep it around.When saving a combined model:
signature = tf.saved_model.signature_def_utils.predict_signature_def(
inputs={'image': model.input}, outputs={'scores': model.output}
)
builder = tf.saved_model.builder.SavedModelBuilder('/tmp/wvmodel7')
builder.add_meta_graph_and_variables(
sess=K.get_session(),
tags=[tf.saved_model.tag_constants.SERVING],
signature_def_map={tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: signature},
legacy_init_op=tf.tables_initializer()
)
builder.save()Args:
wv gensim word vector model
oov_token_file str
pad_string str
max_sequence_length strReturns:
tf.keras model that will take tokens as strings and return a matrix of word vectors
""" # Get the oov token
with open(oov_vector_file, 'rb') as f:
oov_vec = pickle.load(f) # Construct new index to word mapping
index2word = copy(wv.index2word)
index2word.append(pad_string) # Build the embedding matrix
embedding_dim = wv[wv.index2word[0]].shape[0]
embedding_matrix = np.zeros((len(wv.index2word) + 2, embedding_dim)) # +2 because 1 for pad token, 1 for oov
# use the original lookup index as it doesn't have a pad token
for i, word in enumerate(wv.index2word):
embedding_vector = wv[word]
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector # Set the last entry to the oov_token
embedding_matrix[len(embedding_matrix) - 1] = oov_vec # Build the tf lookup table
lookup_op = tf.contrib.lookup.index_table_from_tensor(tf.constant(index2word), # not wv.index2word as we want the pad token
default_value=len(embedding_matrix) - 1,
name='lookup_op') # points to the oov/unk token # Build the tf.keras Embedding layer
embedding_layer = tf.keras.layers.Embedding(len(wv.index2word) + 2,
embedding_dim,
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=False,
name='embedding_layer') # Build a model that takes tokens
_input = tf.keras.layers.Input(shape=(max_sequence_length,), dtype=tf.string, name='wv_in')
x = tf.keras.layers.Lambda(lookup_op.lookup, output_shape=(max_sequence_length,), name='lambda_lookup')(_input)
x = embedding_layer(x)
wv_model = tf.keras.models.Model(_input, x) # Everything in tf.keras please lookup_op.init.run(session=tf.keras.backend.get_session()) return wv_modelOOV_VECTOR_FILE = 'data/oov_vector.p'
embedding = create_wv_embedding_layer(wv, OOV_VECTOR_FILE, max_sequence_length=100)
print(embedding.summary())
```

```
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
wv_in (InputLayer) (None, 100) 0
_________________________________________________________________
lambda_lookup (Lambda) (None, 100) 0
_________________________________________________________________
embedding_layer (Embedding) (None, 100, 50) 20000100
=================================================================
Total params: 20,000,100
Trainable params: 0
Non-trainable params: 20,000,100
_________________________________________________________________
```

A lot is going here, but the there are a few things to notice.

1) We are regathering the preserved out of vocabulary vector and adding it to the lookup dictionary.

2) We are building a reference matrix out of all of the word vectors including a zero vector for PADs and the OOV vector.

3) We use `tf.contrib.lookup.index_table_from_tensor` to create a lookup table within the graph itself.

4) We use the `tf.keras.layer.Embedding` to hold this newly constructed matrix.

5) We initialize the lookup table. Notice the reference to the current tf.keras.backend session. This is where things can go screwy if this code has run once and you try to rerun it later. The session can easily get out of a correct state. So proceed carefully.

Let’s print a sample from our new layer just see we are on the right track.

```
vec = embedding.predict(np.array([['python'] * 100]))
print(vec)
```

```
array([[[ 0.5897 , -0.55043, -1.0106 , ..., 0.15425, -0.93256,
-0.15025],
[ 0.5897 , -0.55043, -1.0106 , ..., 0.15425, -0.93256,
-0.15025],
[ 0.5897 , -0.55043, -1.0106 , ..., 0.15425, -0.93256,
-0.15025],
...,
[ 0.5897 , -0.55043, -1.0106 , ..., 0.15425, -0.93256,
-0.15025],
[ 0.5897 , -0.55043, -1.0106 , ..., 0.15425, -0.93256,
-0.15025],
[ 0.5897 , -0.55043, -1.0106 , ..., 0.15425, -0.93256,
-0.15025]]], dtype=float32)
```

We then need to wrap this is in a new model along with the original pre-trained model.

```
MAX_TOKENS = 100
_input = tf.keras.layers.Input(shape=(MAX_TOKENS,), dtype=tf.string)
x = embedding(_input)
y = tf_orig_model(x)
# for extra fun you can stack as many models as you want here as long as they have the same input requirements
# z = tf_orig_OTHER_MODEL(x)
# zz = tf_orig_YET_OTHER_MODEL(x)
model = tf.keras.models.Model(_input, y)
# model = tf.keras.models.Model(_input, [y, z, zz])
legacy_init_op = tf.group([tf.tables_initializer(), tf.local_variables_initializer()], name='legacy_init_op')
init_op = tf.group(tf.local_variables_initializer())
init_op.run(session=tf.keras.backend.get_session()
```

And finally save the model in *SavedModel *format.

```
tf.keras.backend.set_learning_phase(0)
outputs = {'blog_model': model.outputs[0]} # If you have multiple outputs this should be a dictionary comprehension
signature = tf.saved_model.signature_def_utils.predict_signature_def(
inputs={'tokens': model.input}, outputs=outputs
)
builder = tf.saved_model.builder.SavedModelBuilder('/tmp/blog_model')
builder.add_meta_graph_and_variables(
sess=tf.keras.backend.get_session(),
tags=[tf.saved_model.tag_constants.SERVING],
signature_def_map={tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: signature},
legacy_init_op=legacy_init_op
)
builder.save()
```

```
WARNING:tensorflow:From .virtualenvs/tf_blog_wrapper/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:205: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.
WARNING:tensorflow:From <ipython-input-17-ad6197f05a85>:11: calling SavedModelBuilder.add_meta_graph_and_variables (from tensorflow.python.saved_model.builder_impl) with legacy_init_op is deprecated and will be removed in a future version.
Instructions for updating:
Pass your op to the equivalent parameter main_op instead.
INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: /tmp/blog_model/saved_model.pbb'/tmp/blog_model/saved_model.pb'
```

Note the comment above in the definition of *outputs* for the signature definition. Here we just have one output generated, so we use the first. But if you have multiple ‘original models’ being wrapped here this *outputs *definition should gather a key/value for each of those. The TF-server will return the json with the inference results with the associated keys you provide here.

And with that we can test the final model with some tokens. Unfortunately the tokenization isn’t moved into the model here but with Tensorflow-Extended (TFX) more and more of this will move into the graph. And thus can be shifted to the Tensorflow Server.

```
text = 'python ' * 100
tokens = text.split()
print(np.array([tokens]).shape)
```

`(1, 100) `

And then infer and compare to our original output.

`print(model.predict(np.array(tokens).reshape((1, -1))))`

`array([[0.4573401]], dtype=float32)`

A perfect match!

Now you have your *SavedModel *version of a classic Keras model, complete with the embedding lookup in the graph. Note the .*pb* file is just part of what is generated by *SavedModel*. The .*pb* file and the directory containing the variables (and assets if there are any) are what constitutes the entire saved model for Tensorflow Serving, so all of these files need to be migrated to the Tensorflow Server.

**Conclusion**

So what did we achieve here? The setup of the toy model aside, we were able to move both the model inference and the memory intensive vector lookup off of the core infrastructure. In actual deployment terms, there are 36 classifiers that all shared the same initial matrix input, all created on top of the same word vector model. So while this toy model was quite small, moving 36 sequential inferences off of the api server was a huge relief. And by bundling them into this single *meta model* we can make a single request to the TF-server and get all 36 results in approximately 60ms. That is over a 10x speedup over inferring them locally and sequentially, even with the added network request.

What did we miss? Unfortunately the tokenization and padding is still happening ahead of the request on the API server. This is less than ideal as the core infrastructure should only care about raw data (the initial text) and the result. New features coming in Tensorflow 2.0, TFX, and TF-Text will clean all of this up. So there is still work to be done in getting this last part into the *Meta Model*, but progress has been made. And hopefully this is helpful to some of you still running TF 1.X in production.

Special thanks to Jonathan Mackrory, Machine Learning Engineer at Talentpair, for the genius idea and key insights throughout.

Footnotes:

(1) @InProceedings{maas-EtAl:2011:ACL-HLT2011,

author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},

title = {Learning Word Vectors for Sentiment Analysis},

booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},

month = {June},

year = {2011},

address = {Portland, Oregon, USA},

publisher = {Association for Computational Linguistics},

pages = {142–150},

url = {http://www.aclweb.org/anthology/P11–1015}

}

(2) @inproceedings{pennington2014glove,

author = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},

booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},

title = {GloVe: Global Vectors for Word Representation},

year = {2014},

pages = {1532–1543},

url = {http://www.aclweb.org/anthology/D14-1162},

}