This article describes the steps we took while training Doc2vec as part of Go Bold Day, including from acquiring data, preprocessing data, a training the model. Time is of the essence during hackathons, so we used parallel processing to save time when training the model. It was quite interesting to see the results.
This article describes the steps we took while training Doc2vec as part of Go Bold Day, including from acquiring data, preprocessing data, a training the model. Time is of the essence during hackathons, so we used parallel processing to save time when training the model. It was quite interesting to see the results.
I recently trained Doc2Vec as a part of an internal hackathon at Mercari US called "Go Bold Day" (GBD). This is a two Day Event at Mercari US, in which you drop what you are working on, and instead collaborate with others to experiment with new ideas and get out of our comfort zone. For my GBD submission, I collaborated with another colleague, Vamshi Teja who works in a different machine learning team at Mercari to build embeddings from scratch using data at Mercari.
Every item listed on Mercari has an item title and item description. We decided to train word embeddings from scratch on Mercari Data, i.e.. item titles and descriptions, to compare the performance and quality of embeddings generated with general language corpuses. Time is of the essence during hackathons, so we used parallel processing to save time when training the model.
Although Doc2Vec has been superseded by newer techniques like Glove, fasttext, elmo and more recently Transformer based approaches, Doc2Vec is fast to train and can generate word embeddings as well as document embeddings. This made Doc2Vec perfect for our use case.
In Mercari, we store most of our analytics data in BigQuery, which is a cloud based data warehouse. BigQuery has an interface similar to SQL tables. And all data is stored in databases, which contain tables, on which SQL queries can be executed.
We were however going to train models on Vertex AI Workbench, which is a cloud based Jupyter Notebook as a Service platform by GCP. The Jupyter Notebook IDE allows for rapid iteration while developing a machine learning model, while Vertex AI Workbench enables machine learning developers to quickly change from a small cloud machine for prototyping to a very large machine for multiprocessing.
To move data from BigQuery to Vertex AI Workbench, we first created an intermediate table in BigQuery. This intermediate table consisted only of a subset of data that we would be training on. This intermediate table was then downloaded into Vertex AI Workbench.
The following lines show the query used for creating the dataset.
SELECT name,
description
FROM `project_id.dataset.item_analytics`
WHERE status = "sold_out"
AND created BETWEEN "yyyy-mm-dd 00:00:00" AND "yyyy-mm-dd 23:59:59"
The query results were then exported with the help of Cloud console UI to an intermediate table.
To speed up the download of data from Bigquery to the notebook, we used the Bigquery storage api along with the official Bigquery client library. Following code snippet demonstrates the use of Bigquery client with Bigquery storage api to download the data from Bigquery as a pandas dataframe.
from google.cloud import bigquery
bqclient = bigquery.Client()
table_ref = 'project_id.dataset.intermediate_table'
table = bigquery.TableReference.from_string(table_ref)
rows = bqclient.list_rows(table)
dataframe = rows.to_dataframe(create_bqstorage_client=True)
dataframe.to_feather("dataset.feather")
We chose feather format to persist the dataframe to disk as it has good I/O performance. This data would not be retained for long, so disk space wasn’t too much of a concern.
Item names and descriptions often consist of information that is not that useful and add noise to the model training process. So we preprocessed the data, removing non-ascii characters, punctuations, whitespaces, digits:
The following snippet shows the preprocessing code that was applied on each document, a concatenation of item name and description.
import re
import string
import spacy
nlp = spacy.load("en_core_web_sm")
# remove whitespace and punctuations
bad_whitespace = "\t\n\r\x0b\x0c"
trans_table = str.maketrans(bad_whitespace," "*len(bad_whitespace), string.punctuation)
def preprocess(x):
# Ignore nonascii
x = x.encode('ascii', 'ignore').decode()
# Lowercase
x = x.lower()
# Remove nonspace whitespace and delete punctuation
x = x.translate(trans_table)
# Delete Digits
x = re.sub(r"\d+", '', x)
# Delete Multispace
x = ' '.join(x.split())
# Lemmatize and remove if in stop words
x = ' '.join(token.lemma_ for token in nlp(x)
if token.lemma_.lower() not in nlp.Defaults.stop_words)
return x
For example if we had a sentence:
Elizabeth and James Nirvana Amethyst EDP 1.7 oz bottle
It would be converted to:
elizabeth james nirvana amethyst edp oz bottle
Once the dataset was acquired we converted the pandas dataframe to a list of strings, so that it would be easier to multiprocess. In previous experiences with multiprocessing with pandas with tools like modin or Dask, I found that the multiprocessing didn’t work well when the dataframe contained string data, and decided it would be much simpler to do python multiprocessing with a list of strings.
We concatenated the data frame columns name and description and converted it to a python list.
dataframe["text"] = dataframe["name"] + ' ' + dataframe["description"]
documents = dataframe["text"].tolist()
With all documents in a python list, we parallelized the process of preprocessing all documents. Additionally we decided to use tqdm to print a progress bar.
import multiprocessing
import tqdm
pool = multiprocessing.Pool(processes=200)
processed = list(tqdm.tqdm(pool.imap(preprocess, documents), total=len(documents)))
with open('corpus.txt', 'w') as f:
for item in processed:
f.write("%s\n" % item)
Since this process is compute intensive and easy to parallelize , we changed the jupyter notebook instance to a larger size with several virtual cpus and set the number of processes equal to the number of virtual CPUs.
We can train a Doc2Vec model in parallel with the gensim library. First, we ensured that gensim is using the C compiler otherwise the training can be significantly slower.
import gensim.models.doc2vec
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
Since we were short on time in a 2 Day Hackathon, we decided to additionally create intermittent checkpoints so that in the event that the training processes took too long, we would still be able to use checkpoint data instead.
from gensim.models import word2vec, Doc2Vec
from gensim.models.callbacks import CallbackAny2Vec
from gensim.test.utils import get_tmpfile
import time
class callback(CallbackAny2Vec):
"""Callback to print loss after each epoch."""
def __init__(self):
self.path_prefix = "d2v"
self.epoch = 0
def on_epoch_start(self, model):
print(f"Start:{self.epoch} epoch at {time.ctime()}")
def on_epoch_end(self, model):
print(f"Done:{self.epoch} epoch at {time.ctime()}")
output_path = get_tmpfile(f"{self.path_prefix}_epoch{self.epoch}.model")
model.save(output_path)
print("Saved:", output_path)
self.epoch += 1
Most online blogs I found online explain the training process with documents mode when training Word2Vec; however, the gensim documentation recommends the corpus file mode instead to get a performance boost, so we used the corpus file parameter instead.
corpus = "corpus.txt"
model = Doc2Vec(corpus_file=corpus, vector_size=100, window=5, min_count=20, workers=14, callbacks=[callback()])
print("Building Vocab")
model.build_vocab(corpus_file=corpus,)
print("Built Vocab")
print(f"Number of Documents:{model.corpus_count}")
print(f"Total Number of Words:{model.corpus_total_words}")
model.train(
corpus_file=corpus,
total_examples=model.corpus_count,
epochs=10,
total_words=model.corpus_total_words,
callbacks=[callback()],
)
model.save("model.doc2vec")
We discovered that the training process for Doc2vec with gensim is not as parallelizable as the preprocessing code shown earlier. Instead the performance boost seemed to max out around 14-16 workers. Adding more workers didn’t improve the performance any further and instead seemed to make it worse.
Once the model was trained we explored the document and word embeddings, by building a streamlit app. The results for the doc2vec embeddings were not positive. Mercari documents are relatively smaller in size, compared to wikipedia articles which have a lot more text, so I believe the increasing the number of epochs might help to improve the performance.
We evaluated the word vector embeddings by trying to find similar words in the embedding space. Since we had a large number of documents in the training set, the model was able to learn word vector representations. This was especially true for concept words in the marketplace, such as discount, sales, and defects. Our results were also successful for popular categories in the marketplace such as fashion goods, electronics and cosmetics. Word embeddings for analogies were also evaluated, however the word embeddings didn’t perform that well for analogies.
We can find similar words by using word vectors in the doc2vec model:
model.wv.most_similar(word, topn=top_N)
The following section shows some of our results for top 5 similar words given a query word:
winter -> summer, fall, spring,beach,outfit
lipstick -> lipgloss, lip, lippie, eyeshadow, eyeliner
shoes -> sneakers, oxfords, boots, slipons, kswiss
perfume -> parfum, scent, spray, fragrance, lotion
defect -> flaw, damage, imperfection, blemish, assume
sale -> sell, list, price, currently, value
xs -> xsmall, medium, xl, xxs, short
xmas -> christmas, christma, holiday, christmasholiday, halloween
red -> white, orange, purple, black, green
batwoman-> catwoman, supergirl, huntress, deathstroke, shazam
batman -> superman, joker, dc, comic, marvel
girls -> womans, juniors, siz, ladies, ladie
boys -> kids, shorts, shirts, men, pants
ps -> playstation, xbox, game, console,psp
laptop -> tablet, computer, compartment, ipad, phone
keyboard-> ipad, keypad, macbook, trackpad, touchpad
ram -> arcade, drive, ddr, ghz, onboard
naruto -> goku, anime, itachi, dbz, dragonball
pokemon-> pokmon, charmander, pikachu, kanto, pokeball
ocean -> sea, wave, sunset, dream, forest
india -> portugal, guatemala, indonesia, handwoven, italy
fbi -> agency, phil, cia, investigation, coates
jfk -> president, kennedy, roosevelt, presidential, political
glass -> crystal, frosted, wood, clear, holder
The internal hackathon was a great way to collaborate with colleagues from other teams and work on something other outside of my day to day work. It was quite interesting to see the results for similar words with word embeddings generated from the Mercari Market place data. I look forward to participating in future company hackathons and experimenting with embeddings for downstream machine learning tasks.