Wikipedia2Vec


Star

Introduction

Wikipedia2Vec is a tool used for obtaining embeddings (vector representations) of words and entities from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings that map words and entities into a unified continuous vector space. The embeddings can be used as word embeddings, entity embeddings, and the unified embeddings of words and entities. They are used in the state-of-the-art models of various tasks such as entity linking, named entity recognition, entity relatedness, and question answering.

The embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input. The code is implemented in Python, and optimized using Cython and BLAS.

Pretrained embeddings for 12 languages can be downloaded from this page.

Reference

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

@InProceedings{yamada-EtAl:2016:CoNLL,
  author    = {Yamada, Ikuya  and  Shindo, Hiroyuki  and  Takeda, Hideaki  and  Takefuji, Yoshiyasu},
  title     = {Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  booktitle = {Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  pages     = {250--259},
  publisher = {Association for Computational Linguistics}
}

License

Apache License 2.0