Vector space models of language, or *word embeddings*, represent each word with a vector of continuous values that encode semantic and syntactic attributes of the word. Words with similar meaning (e.g. happy, pleased) are represented with similar vectors, while dissimilar words (e.g. happy, asphalt) are distant in vector space. There are several machine learning techniques for efficiently learning word representations from unlabeled data, such as word2vec and GloVe. The resulting vectors can then be re-used as input features to a wide variety of NLP applications.

These distributed representations offer many advantages over traditional bag of words techniques, such as increased generalization power and a more compact format requiring fewer trained parameters in models that apply these features as inputs. However there are many factors that determine the suitability of a set of embeddings for particular tasks, including the choice of training data domain or size, the length of the vectors, and the training algorithm and hyperparameters.

The Neuromorphic NLP workgroup made extensive use of word embeddings as inputs to our spiking neural networks. The distributed representations are an appealing abstraction relative to one-neuron-per-word approaches. However the 64-bit floating point continuous values produced by word2vec are unrealistic in biological neurons. We experimented with a variety of techniques for quantization to lower bit-rates and conversion to input spikes.

For training data, we used a recent download of the entire English Wikipedia. After extracting and pre-processing the text, this dataset consisted of 3,169,913,651 tokens. We varied the vector length between 64 and 256. We used word2vec to produce several alternative sets of embeddings, using the hyperparameters:

word2vec -negative 15 -iter 5 -cbow 0 -min-count 100 -window 10

While the intrinsic quality of a set embeddings is difficult to quantify, it is common to use an extrinsic task-based approach to measure the utility of a given set of embeddings. We use here the vector offset word analogy evaluation proposed by Mikolov (2013). For all 64 length vectors, vectors were built for words only. For 256 length vectors, vectors were learned for words *and* frequently observed phrases (e.g. "true_north" instead of "true" and "north").

64 length vectors, no quantization: 47.33% word analogy accuracy, 99.66% of questions answered

64 length vectors, 9 bit quantization: tbd

64 length vectors, 4 bit quantization: tbd

64 length vectors, 9 bit quantization, True North constrained: 44.62%

64 length vectors, 4 bit quantization, True North constrained: 44.38%

256 length vectors, no quantization, top 64,000 words only: 71.75% accuracy, 80.67% questions seen

256 length vectors, 9 bit quantization, top 64,000 words only: 71.65% accuracy, 80.67% questions seen

256 length vectors, 4 bit quantization, top 64,000 words only: 69.39% accuracy, 80.67% questions seen

The unconstrained 256 length vectors were 70.59% accurate with 95.45% coverage when vocab was expanded to 128000 words. However the 256-length vectors were unable to be transformed to a True North constrained format without total loss of information. For the purposes of all further experiments, the 64 length vectors were used.

When quantizing the vectors, we also clipped the top and bottom 10% of the values. Allowing the extrema to saturate allowed us to represent the remainder of the values with greater precision.