Overcoming the Memory Bottleneck in Distributed Training of Latent Variable Models of Text
Yi Yang, Alexander Yates and Doug Downey
Large unsupervised latent variable models (LVMs) of text, such as Latent
Dirichlet Allocation models or Hidden Markov Models (HMMs), are constructed
using parallel training algorithms on computational clusters. The memory
required to hold LVM parameters forms a bottleneck in training more powerful
models. In this paper, we show how the memory required for parallel LVM
training can be reduced by partitioning the training corpus to minimize the
number of unique words on any computational node. We present a greedy
document partitioning technique for the task. For large corpora, our approach
reduces memory consumption by over 50%, and trains the same models up to three
times faster, when compared with existing approaches for parallel LVM training.
Back to Papers Accepted