Parallelizing the training of Doc2vec Model

This article describes the steps we took while training Doc2vec as part of Go Bold Day, including from acquiring data, preprocessing data, a training the model. Time is of the essence during hackathons, so we used parallel processing to save time when training the model. It was quite interesting to see the results.

I recently trained Doc2Vec as a part of an internal hackathon at Mercari US called "Go Bold Day" (GBD). This is a two Day Event at Mercari US, in which you drop what you are working on, and instead collaborate with others to experiment with new ideas and get out of our comfort zone. For my GBD submission, I collaborated with another colleague, Vamshi Teja who works in a different machine learning team at Mercari to build embeddings from scratch using data at Mercari.

Every item listed on Mercari has an item title and item description. We decided to train word embeddings from scratch on Mercari Data, i.e.. item titles and descriptions, to compare the performance and quality of embeddings generated with general language corpuses. Time is of the essence during hackathons, so we used parallel processing to save time when training the model.

‍

Why Doc2Vec ?

Although Doc2Vec has been superseded by newer techniques like Glove, fasttext, elmo and more recently Transformer based approaches, Doc2Vec is fast to train and can generate word embeddings as well as document embeddings. This made Doc2Vec perfect for our use case.

‍

Acquiring Data

In Mercari, we store most of our analytics data in BigQuery, which is a cloud based data warehouse. BigQuery has an interface similar to SQL tables. And all data is stored in databases, which contain tables, on which SQL queries can be executed.

‍

We were however going to train models on Vertex AI Workbench, which is a cloud based Jupyter Notebook as a Service platform by GCP. The Jupyter Notebook IDE allows for rapid iteration while developing a machine learning model, while Vertex AI Workbench enables machine learning developers to quickly change from a small cloud machine for prototyping to a very large machine for multiprocessing.

To move data from BigQuery to Vertex AI Workbench, we first created an intermediate table in BigQuery. This intermediate table consisted only of a subset of data that we would be training on. This intermediate table was then downloaded into Vertex AI Workbench.

The following lines show the query used for creating the dataset.

SELECT name,

description

FROM `project_id.dataset.item_analytics`

WHERE status = "sold_out"

AND created BETWEEN "yyyy-mm-dd 00:00:00" AND "yyyy-mm-dd 23:59:59"

‍

The query results were then exported with the help of Cloud console UI to an intermediate table.

‍

To speed up the download of data from Bigquery to the notebook, we used the Bigquery storage api along with the official Bigquery client library. Following code snippet demonstrates the use of Bigquery client with Bigquery storage api to download the data from Bigquery as a pandas dataframe.

from google.cloud import bigquery

bqclient = bigquery.Client()

table_ref = 'project_id.dataset.intermediate_table'

table = bigquery.TableReference.from_string(table_ref)

rows = bqclient.list_rows(table)

dataframe = rows.to_dataframe(create_bqstorage_client=True)

dataframe.to_feather("dataset.feather")

‍

We chose feather format to persist the dataframe to disk as it has good I/O performance. This data would not be retained for long, so disk space wasn’t too much of a concern.

‍

Pre-processing of Data

Item names and descriptions often consist of information that is not that useful and add noise to the model training process. So we preprocessed the data, removing non-ascii characters, punctuations, whitespaces, digits:

Emojis and Non-ascii Characters: 💫 🌟 Most emojis are just used formatting or beautifying text and add noise when learning embeddings.
Punctuation and Whitespace: Removing punctuation and whitespace, helps with the tokenization process
Digits: Removing numerical data, to remove model numbers, sizes, dimensions which are not useful when learning word representations.
Lemmatization: Lemmatization removes inflectional forms and derived forms of a word. For example both cars and car are mapped to car. This reduces the number of tokens for which embeddings are learned.
Lowercasing text: This step also reduces the number of tokens for which embeddings are learned and leads to better embeddings and reduces memory usage.

‍

The following snippet shows the preprocessing code that was applied on each document, a concatenation of item name and description.

import re

import string

import spacy

nlp = spacy.load("en_core_web_sm")

# remove whitespace and punctuations

bad_whitespace = "\t\n\r\x0b\x0c"

trans_table = str.maketrans(bad_whitespace," "*len(bad_whitespace), string.punctuation)

‍

def preprocess(x):

‍

# Ignore nonascii

x = x.encode('ascii', 'ignore').decode()

‍

# Lowercase

x = x.lower()

‍

# Remove nonspace whitespace and delete punctuation

x = x.translate(trans_table)

‍

# Delete Digits

x = re.sub(r"\d+", '', x)

‍

# Delete Multispace

x = ' '.join(x.split())

‍

# Lemmatize and remove if in stop words

x = ' '.join(token.lemma_ for token in nlp(x)

if token.lemma_.lower() not in nlp.Defaults.stop_words)

return x

‍

For example if we had a sentence:

‍

Elizabeth and James Nirvana Amethyst EDP 1.7 oz bottle

‍

It would be converted to:

‍

elizabeth james nirvana amethyst edp oz bottle

‍

Parallel Processing of Data

Once the dataset was acquired we converted the pandas dataframe to a list of strings, so that it would be easier to multiprocess. In previous experiences with multiprocessing with pandas with tools like modin or Dask, I found that the multiprocessing didn’t work well when the dataframe contained string data, and decided it would be much simpler to do python multiprocessing with a list of strings.

‍

We concatenated the data frame columns name and description and converted it to a python list.

‍

dataframe["text"] = dataframe["name"] + ' ' + dataframe["description"]

documents = dataframe["text"].tolist()

‍

With all documents in a python list, we parallelized the process of preprocessing all documents. Additionally we decided to use tqdm to print a progress bar.

‍

import multiprocessing

import tqdm

pool = multiprocessing.Pool(processes=200)

processed = list(tqdm.tqdm(pool.imap(preprocess, documents), total=len(documents)))

with open('corpus.txt', 'w') as f:

for item in processed:

f.write("%s\n" % item)

‍

Since this process is compute intensive and easy to parallelize , we changed the jupyter notebook instance to a larger size with several virtual cpus and set the number of processes equal to the number of virtual CPUs.

‍

Training Doc2Vec Quickly

We can train a Doc2Vec model in parallel with the gensim library. First, we ensured that gensim is using the C compiler otherwise the training can be significantly slower.

‍

import gensim.models.doc2vec

assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

‍

Since we were short on time in a 2 Day Hackathon, we decided to additionally create intermittent checkpoints so that in the event that the training processes took too long, we would still be able to use checkpoint data instead.

‍

from gensim.models import word2vec, Doc2Vec

from gensim.models.callbacks import CallbackAny2Vec

from gensim.test.utils import get_tmpfile

import time

‍

class callback(CallbackAny2Vec):

"""Callback to print loss after each epoch."""

‍

def __init__(self):

self.path_prefix = "d2v"

self.epoch = 0

‍

def on_epoch_start(self, model):

print(f"Start:{self.epoch} epoch at {time.ctime()}")

‍

def on_epoch_end(self, model):

print(f"Done:{self.epoch} epoch at {time.ctime()}")

output_path = get_tmpfile(f"{self.path_prefix}_epoch{self.epoch}.model")

model.save(output_path)

print("Saved:", output_path)

self.epoch += 1

‍

Most online blogs I found online explain the training process with documents mode when training Word2Vec; however, the gensim documentation recommends the corpus file mode instead to get a performance boost, so we used the corpus file parameter instead.

‍

corpus = "corpus.txt"

model = Doc2Vec(corpus_file=corpus, vector_size=100, window=5, min_count=20, workers=14, callbacks=[callback()])

print("Building Vocab")

model.build_vocab(corpus_file=corpus,)

print("Built Vocab")

print(f"Number of Documents:{model.corpus_count}")

print(f"Total Number of Words:{model.corpus_total_words}")

model.train(

corpus_file=corpus,

total_examples=model.corpus_count,

epochs=10,

total_words=model.corpus_total_words,

callbacks=[callback()],

)

model.save("model.doc2vec")

‍

We discovered that the training process for Doc2vec with gensim is not as parallelizable as the preprocessing code shown earlier. Instead the performance boost seemed to max out around 14-16 workers. Adding more workers didn’t improve the performance any further and instead seemed to make it worse.

‍

Results

Once the model was trained we explored the document and word embeddings, by building a streamlit app. The results for the doc2vec embeddings were not positive. Mercari documents are relatively smaller in size, compared to wikipedia articles which have a lot more text, so I believe the increasing the number of epochs might help to improve the performance.

We evaluated the word vector embeddings by trying to find similar words in the embedding space. Since we had a large number of documents in the training set, the model was able to learn word vector representations. This was especially true for concept words in the marketplace, such as discount, sales, and defects. Our results were also successful for popular categories in the marketplace such as fashion goods, electronics and cosmetics. Word embeddings for analogies were also evaluated, however the word embeddings didn’t perform that well for analogies.

‍

Sample Results: Top 5 most similar words

We can find similar words by using word vectors in the doc2vec model:

model.wv.most_similar(word, topn=top_N)

‍

The following section shows some of our results for top 5 similar words given a query word:

‍

Fashion

winter -> summer, fall, spring,beach,outfit

lipstick -> lipgloss, lip, lippie, eyeshadow, eyeliner

shoes -> sneakers, oxfords, boots, slipons, kswiss

perfume -> parfum, scent, spray, fragrance, lotion

‍

Marketplace Concepts

defect -> flaw, damage, imperfection, blemish, assume

sale -> sell, list, price, currently, value

xs -> xsmall, medium, xl, xxs, short

xmas -> christmas, christma, holiday, christmasholiday, halloween

red -> white, orange, purple, black, green

‍

Gender

batwoman-> catwoman, supergirl, huntress, deathstroke, shazam

batman -> superman, joker, dc, comic, marvel

girls -> womans, juniors, siz, ladies, ladie

boys -> kids, shorts, shirts, men, pants

‍

Electronics

ps -> playstation, xbox, game, console,psp

laptop -> tablet, computer, compartment, ipad, phone

keyboard-> ipad, keypad, macbook, trackpad, touchpad

ram -> arcade, drive, ddr, ghz, onboard

‍

Cartoons/Comics

naruto -> goku, anime, itachi, dbz, dragonball

pokemon-> pokmon, charmander, pikachu, kanto, pokeball

‍

Miscellaneous

ocean -> sea, wave, sunset, dream, forest

india -> portugal, guatemala, indonesia, handwoven, italy

fbi -> agency, phil, cia, investigation, coates

jfk -> president, kennedy, roosevelt, presidential, political

glass -> crystal, frosted, wood, clear, holder

‍

The internal hackathon was a great way to collaborate with colleagues from other teams and work on something other outside of my day to day work. It was quite interesting to see the results for similar words with word embeddings generated from the Mercari Market place data. I look forward to participating in future company hackathons and experimenting with embeddings for downstream machine learning tasks.

‍