Serving Word Vectors for Distributed Computations

Word vectors are amazing! Vectors for everything!

And that is why we use them for various deep learning tasks at Talentpair. What makes them amazing? Each word in the model is represented by vector. Each vector captures the semantic meaning of the word. Therefore, related terms will be close to each other, unrelated terms faraway. At Talentpair, we use word models to determine related skills and job titles. If you want to learn more word vectors, we recommend this intro by Adrian Colyer.

Word vectors can be memory hungry

Companies like Google and Facebook have open sourced their word models. Google started in 2013 with its model based on Google news articles and Facebook followed with its FastText model in 2016. Loading these general purpose models takes time and consumes a fair amount of memory.

Loading the word model into the instance memory
htop memory consumption before and after the loading of the word vector model with 3M token and 300 dim vectors

After loading the full Google Word2Vec model (3 million tokens with 300 dim vectors), the word model consumed 64% of the available memory of the AWS instance (in our case an m4.large).

Loading an additional word model or running some memory intensive computations will quickly lead to this disappointing error …

Darn, MemoryError …

Due to the large memory consumption and the rather long time to load a word model into the instance’s memory, word models did not play nicely with scaling up and down celery workers efficiently. We noticed that we often only used the raw vector representation of the tokens rather than full Gensim word vector API for calculating similar tokens. At Talentpair, we classify job titles, convert job descriptions into semantic vectors, etc. One of the preprocessing steps converts the tokenized document into a list of token vectors. This list of word vectors, aka matrix, is then the input to our deep learning models.

Reducing the loading time and the memory consumption of large word models

Using pre-trained word models requires that we download the 1.6 GB compressed weights file to our data science instances, unpack it and load it into the memory before being able to use it. That isn’t just time consuming, but also requires a significant amount of instance memory.

One obvious fix is to limit the loaded word model to the top X tokens. Gensim’s API provides a loading argument limit which allows you restrict the vocabulary loaded. Limiting the word model will speed up the model loading and reduce the memory foot print. However, you might miss important tokens which are rare, but significant.

Our alternative was using an in-memory database which runs on a separate instance. All workers access the database to obtain the token vectors.

So, instead of loading one word model into every name space of every worker …

Memory consumption if one worker loads the word model into its own memory

… we allow every worker to access a Redis database which contains the word vectors.

Multiple workers accessing the same Redis database

Hacking Gensim, a bit …

We are using the Gensim package loading pre-trained word models and for training our own. We envisioned an object API which mimics Gensim’s API, in order to allow plug-and-play with our existing classification models.

Redis seemed to be a good choice since the key-value database is known for fast and efficient lookups. AWS also offers Redis instances as a managed service, bonus!

The result of our implementation was the class we named RedisKeyedVectors which is inheriting from Gensim’s KeyedVectors class. Therefore, we have all methods available. However, we are overriding a few (important) methods and, more importantly, some class attributes like syn0 and syn0norm which we can’t support via the Redis instance.

All vectors (which are numpy arrays) are pickled and the compressed version is then stored in the Redis DB. Yes, every vector gets extracted, pickled, bzip and stored in Redis. When a vector is requested, the opposite happens: value retrieved from Redis, shipped back to the worker, decompressed and unpickled to a numpy array.

Individual word vectors are loaded from the word model, pickled, compressed and then stored in the Redis DB

One word about our Redis setup: At Talentpair, we have set up multiple Redis databases and created a dedicated db called “word2vec”. Within a database, Redis stores everything as key-value pairs.

token : b'BZh91AY&SY\xc2"\x11)\x00\x02\x0c ...

Since we need to support multiple word vector models through the same database, we added a prefix to the token to distinguish between different word models.

<key>+<token> : pickled, compressed vector

GOOGLE_W2V:car : b'BZh91AY&SY\xc2"\x11)\x00\x02\x0c ...

import bz2
import numpy as np
import pickle
from django.conf import settings
from django_redis import get_redis_connection
from gensim.models.keyedvectors import KeyedVectors
from .constants import GOOGLE_WORD2VEC_MODEL_NAME
from .redis import load_word2vec_model_into_redis, query_redis
class RedisKeyedVectors(KeyedVectors):
    Class to imitate gensim's KeyedVectors, but instead getting the vectors from the memory, the vectors
    will be retrieved from a redis db
    def __init__(self, key=GOOGLE_WORD2VEC_MODEL_NAME): = get_redis_connection(alias='word2vec')
        self.syn0 = []
        self.syn0norm = None
        self.index2word = []
        self.key = key
    def check_vocab_len(cls, key=GOOGLE_WORD2VEC_MODEL_NAME, **kwargs):
        rs = get_redis_connection(alias='word2vec')
        return len(list(rs.scan_iter(key + "*")))
    def load_word2vec_format(cls, **kwargs):
        raise NotImplementedError("You can't load a word model that way. It needs to pre-loaded into redis")
    def save(self, *args, **kwargs):
        raise NotImplementedError("You can't write back to Redis that way.")
    def save_word2vec_format(self, **kwargs):
        raise NotImplementedError("You can't write back to Redis that way.")
    def word_vec(self, word, **kwargs):
        This method is mimicking the word_vec method from the Gensim KeyedVector class. Instead of
        looking it up from an in memory dict, it
        - requests the value from the redis instance, where the key is a combination between the word vector
        model key and the word itself
        - decompresses it
        - and finally unpickles it
        :param word: string
        :returns: numpy array of dim of the word vector model (for Google: 300, 1)
            return pickle.loads(bz2.decompress(query_redis(, word)))
        except TypeError:
            return None
    def __getitem__(self, words):
        returns numpy array for single word or vstack for multiple words
        if isinstance(words, str):
            # allow calls like trained_model['Chief Executive Officer']
            return self.word_vec(words)
        return np.vstack([self.word_vec(word) for word in words])
    def __contains__(self, word):
        """ build in method to quickly check whether a word is available in redis """
        return + word)

The word vectors get pre-loaded once and remain in the Redis DB until the next model update. The function below loads an entire word model into a Redis database.

import bz2
import pickle
from django.conf import settings
from djang_redis import get_redis_connection
from tqdm import tqdm
from .constants import GOOGLE_WORD2VEC_MODEL_NAME
def load_word2vec_into_redis(rs, wvmodel, key=GOOGLE_WORD2VEC_MODEL_NAME):
    """ This function loops over all available words in the loaded word2vec model and loads
    them into the redis instance via the rs object.
    :param rs: redis connection object from django_redis
    :param wvmodel: word vector model loaded into the memory of this machine.
      Once the loading is completed, the memory will be available again.
    :param key: suffix for the redis keys
    print("Update Word2Vec model in redis ...")
    for word in tqdm(list(wvmodel.vocab.keys())):
        rs.set(key + word, bz2.compress(pickle.dumps(wvmodel[word])))

Advantages of this Approach

By exporting the word models to the external Redis instance, we can take advantage of a few benefits.

  1. Easy Scaling and Deployment
    In case we need to quickly scale up our worker instances, we don’t need to download the word model from a repository and load it into the instance’s memory. The vectors are persistent in the Redis database. By connecting with the pre-loaded Redis db, the word vectors are immediately available once the deployment to our worker instances is completed.
    Some of our deep learning models use the general word models. By accessing the Redis instance, our GPUs have also access to the word models after the instance creation.
  2. Faster Scaling and Deployment
    The biggest plus is the immediate availability of the word models. Initializing the word models takes a few minutes. Starting up every worker without the distributed setup would require the initialization of the word models and that would mean that we would have to stop all celery queues for up to 10 minutes. Waiting 10 minutes while scaling the number of notes is too long. In the moment of scaling, the number of instances need to be online asap. By using this setup of distributed word models, the models are available instantaneously and the model initialization isn’t required.
  3. Smaller Memory Foot Print
    With our Redis setup, we are able to run more workers on the same AWS instance. Ultimately, this allows us using full word models on smaller instances, which saves us money in the long run.
  4. Cost Savings
    Let’s compare two scenarios (all costs are for on-demand instances US-West2): A) 6 instances with 8 GB each (t2.large) to keep the word model in the instance’s memory: The hourly costs are 6 instances at $0.0928/hour = **$0.5568/hour** B) One instance with 8 GB (t2.large) to hold 4 workers and one suitable ElastiCache Redis instance (cache.m3.xlarge): One EC2 instance at $0.0928/hour plus one ElastiCache instance at $0.364/hour = **\$0.4568/hour Such a setup could save up to \$876/year for the outlined scenario.
    **In most scenarios, projects are using Redis databases for other purposes therefore the costs of the ElastiCache instance would be distributed across the other use-cases. Considering other Redis applications, the cost savings will be much higher than the outline difference above.

Disadvantages of this Approach

  1. Load time per word vector
    One of the major downsides of the Redis setup is the latency for the token lookup. Converting a token into a word vector requires a round trip between the worker instance -> Redis instance ->Redis overhead -> memory of the Redis instance.
    Loading the word vector for the word “car” from the model in the instance’s memoryLoading the word vector for the word “car” from the model in the instance’s memory Loading the word vector for the word “car” from the Redis databaseLoading the word vector for the word “car” from the Redis database The increase of the lookup time of almost 1000x isn’t insignificant!
  2. Limited word vector API
    Gensim’s word vector API provides other methods which let you look up most similar terms or calculate the distance between tokens. These methods require a very specific data structure which can’t be supported by Redis. Gensim calculates the cosine distance of a token against a matrix of all tokens which we can’t hold easily in a Redis db.


This setup isn’t for everyone. The breakeven for the costs shows pretty clearly that you’ll need a good number of instances running single workers to warrant this setup.

However, if you run a large number of workers in your data science pipelines which need to convert tokens into word vectors or you are scaling your number of workers frequently, the distributed word vector setup can be beneficial.

If you have questions about the steps or suggestions how to simplify the steps or improve performance, please leave us a comment.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.