Learning Embeddings


First, you need to download a source Wikipedia dump file (e.g., enwiki-latest-pages-articles.xml.bz2) from Wikimedia Downloads. The English dump file can be obtained by running the following command.

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Note that you do not need to decompress the dump file.

Then, the embeddings can be trained from a Wikipedia dump using the train command.

% wikipedia2vec train DUMP_FILE OUT_FILE

Arguments:

Options:

The train command internally calls the five commands described below (namely, build_dump_db, build_dictionary, build_link_graph, build_mention_db, and train_embedding). Further, the learned model file can be converted to a text file compatible with the format of Word2vec and GloVe using the save_text command.

Building Dump Database

The build_dump_db command creates a database that contains Wikipedia pages each of which consists of texts and anchor links in it.

% wikipedia2vec build_dump_db DUMP_FILE OUT_FILE

Arguments:

Options:

Building Dictionary

The build_dictionary command builds a dictionary of words and entities.

% wikipedia2vec build_dictionary DUMP_DB_FILE OUT_FILE

Arguments:

Options:

The build_link_graph command generates a sparse matrix representing the link structure between Wikipedia entities.

% wikipedia2vec build_link_graph DUMP_DB_FILE DIC_FILE OUT_FILE

Arguments:

Options:

Building Mention DB (Optional)

The build_mention_db command builds a database that contains the mappings of entity names (mentions) and their possible referent entities.

% wikipedia2vec build_mention_db DUMP_DB_FILE DIC_FILE OUT_FILE

Arguments:

Options:

Learning Embeddings

The train_embedding command runs the training of the embeddings.

% wikipedia2vec train_embedding DUMP_DB_FILE DIC_FILE OUT_FILE

Arguments:

Options:

Saving Embeddings in Text Format

save_text outputs a model in a text format.

% wikipedia2vec save_text MODEL_FILE OUT_FILE

Arguments:

Options: