Introduction


Extended Skip-Gram Model Wikipedia2Vec is a tool for learning embeddings of words and entities from Wikipedia. The learned embeddings map similar words and entities close to one another in a continuous vector space.

This tool learns embeddings of words and entities by iterating over entire Wikipedia pages and jointly optimizing the following three submodels:

These three submodels are based on the skip-gram model, which is a neural network model with a training objective to find embeddings that are useful for predicting context items (i.e., neighboring words or entities) given a target item. For further details, please refer to this paper: Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

Optimized Implementation for Learning Embeddings

Wikipedia2Vec is implemented in Python, and most of its code is converted into C++ using Cython to boost its performance. Linear algebraic operations required to learn embeddings are performed by Basic Linear Algebra Subprograms (BLAS). We store the embeddings as a float matrix in a shared memory space and update it in parallel using multiple processes.

One challenge is that many entity names do not appear as links in Wikipedia. This is because Wikipedia instructs its contributors to create a link only when the name first occurs on the page. This is problematic because Wikipedia2Vec uses links as a source to learn embeddings.

To address this, our tool provides a feature that automatically generates links. It first creates a dictionary that maps each entity name to its possible referent entities. This is done by extracting all names and their referring entities from all links contained in Wikipedia. Then, during training, our tool takes all words and phrases from the target page and converts each into a link to an entity, if the entity is referred by a link on the same page, or if there is only one referent entity associated to the name in the dictionary.