Learning Embeddings#
First, you need to download a source Wikipedia dump file (e.g., enwiki-latest-pages-articles.xml.bz2) from Wikimedia Downloads. The English dump file can be obtained by running the following command.
% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Note that you do not need to decompress the dump file.
Then, the embeddings can be trained from a Wikipedia dump using the train command.
% wikipedia2vec train DUMP_FILE OUT_FILE
Arguments:
- DUMP_FILE: The Wikipedia dump file
- OUT_FILE: The output file
Options:
- --dim-size: The number of dimensions of the embeddings (default: 100)
- --window: The maximum distance between the target item (word or entity) and the context word to be predicted (default: 5)
- --iteration: The number of iterations for Wikipedia pages (default: 5)
- --negative: The number of negative samples (default: 5)
- --lowercase/--no-lowercase: Whether to lowercase words (default: True)
- --tokenizer: The name of the tokenizer used to tokenize a text into words. Possible choices are regexp, icu, mecab, and jieba
- --sent-detect: The sentence detector used to split texts into sentences. Currently, only icu is the possible value (default: None)
- --min-word-count: A word is ignored if the total frequency of the word is less than this value (default: 10)
- --min-entity-count: An entity is ignored if the total frequency of the entity appearing as the referent of an anchor link is less than this value (default: 5)
- --min-paragraph-len: A paragraph is ignored if its length is shorter than this value (default: 5)
- --category/--no-category: Whether to include Wikipedia categories in the dictionary (default:False)
- --disambi/--no-disambi: Whether to include disambiguation entities in the dictionary (default:False)
- --link-graph/--no-link-graph: Whether to learn from the Wikipedia link graph (default: True)
- --entities-per-page: For processing each page, the specified number of randomly chosen entities are used to predict their neighboring entities in the link graph (default: 10)
- --link-mentions: Whether to convert entity names into links (default: True)
- --min-link-prob: An entity name is ignored if the probability of the name appearing as a link is less than this value (default: 0.2)
- --min-prior-prob: An entity is not registered as a referent of an entity name if the probability of the entity name referring to the entity is less than this value (default: 0.01)
- --max-mention-len: The maximum number of characters in an entity name (default: 20)
- --init-alpha: The initial learning rate (default: 0.025)
- --min-alpha: The minimum learning rate (default: 0.0001)
- --sample: The parameter that controls the downsampling of frequent words (default: 1e-4)
- --word-neg-power: Negative sampling of words is performed based on the probability proportional to the frequency raised to the power specified by this option (default: 0.75)
- --entity-neg-power: Negative sampling of entities is performed based on the probability proportional to the frequency raised to the power specified by this option (default: 0)
- --pool-size: The number of worker processes (default: the number of CPUs)
The train command internally calls the five commands described below (namely, build-dump-db, build-dictionary, build-link-graph, build-mention-db, and train-embedding). Further, the learned model file can be converted to a text file compatible with the format of Word2vec and GloVe using the save-text command.
Building Dump Database#
The build-dump-db command creates a database that contains Wikipedia pages each of which consists of texts and anchor links in it.
% wikipedia2vec build-dump-db DUMP_FILE OUT_FILE
Arguments:
- DUMP_FILE: The Wikipedia dump file
- OUT_FILE: The output file
Options:
- --pool-size: The number of worker processes (default: the number of CPUs)
Building Dictionary#
The build-dictionary command builds a dictionary of words and entities.
% wikipedia2vec build-dictionary DUMP_DB_FILE OUT_FILE
Arguments:
- DUMP_DB_FILE: The database file generated using the build-dump-db command
- OUT_FILE: The output file
Options:
- --lowercase/--no-lowercase: Whether to lowercase words (default: True)
- --tokenizer: The name of the tokenizer used to tokenize a text into words. Possible choices are regexp, icu, mecab, and jieba
- --min-word-count: A word is ignored if the total frequency of the word is less than this value (default: 10)
- --min-entity-count: An entity is ignored if the total frequency of the entity appearing as the referent of an anchor link is less than this value (default: 5)
- --min-paragraph-len: A paragraph is ignored if its length is shorter than this value (default: 5)
- --category/--no-category: Whether to include Wikipedia categories in the dictionary (default:False)
- --disambi/--no-disambi: Whether to include disambiguation entities in the dictionary (default:False)
- --pool-size: The number of worker processes (default: the number of CPUs)
Building Link Graph (Optional)#
The build-link-graph command generates a sparse matrix representing the link structure between Wikipedia entities.
% wikipedia2vec build-link-graph DUMP_DB_FILE DIC_FILE OUT_FILE
Arguments:
- DUMP_DB_FILE: The database file generated using the build-dump-db command
- DIC_FILE: The dictionary file generated by the build-dictionary command
- OUT_FILE: The output file
Options:
- --pool-size: The number of worker processes (default: the number of CPUs)
Building Mention DB (Optional)#
The build-mention-db command builds a database that contains the mappings of entity names (mentions) and their possible referent entities.
% wikipedia2vec build-mention-db DUMP_DB_FILE DIC_FILE OUT_FILE
Arguments:
- DUMP_DB_FILE: The database file generated using the build-dump-db command
- DIC_FILE: The dictionary file generated by the build-dictionary command
- OUT_FILE: The output file
Options:
- --min-link-prob: An entity name is ignored if the probability of the name appearing as a link is less than this value (default: 0.2)
- --min-prior-prob: An entity is not registered as a referent of an entity name if the probability of the entity name referring to the entity is less than this value (default: 0.01)
- --max-mention-len: The maximum number of characters in an entity name (default: 20)
- --case-sensitive: Whether to detect entity names in a case sensitive manner (default: False)
- --tokenizer: The name of the tokenizer used to tokenize a text into words. Possible choices are regexp, icu, mecab, and jieba
- --pool-size: The number of worker processes (default: the number of CPUs)
Learning Embeddings#
The train-embedding command runs the training of the embeddings.
% wikipedia2vec train-embedding DUMP_DB_FILE DIC_FILE OUT_FILE
Arguments:
- DUMP_DB_FILE: The database file generated using the build-dump-db command
- DIC_FILE: The dictionary file generated by the build-dictionary command
- OUT_FILE: The output file
Options:
- --link-graph: The link graph file generated using the build-link-graph command
- --mention-db: The mention DB file generated using the build-mention-db command
- --dim-size: The number of dimensions of the embeddings (default: 100)
- --window: The maximum distance between the target item (word or entity) and the context word to be predicted (default: 5)
- --iteration: The number of iterations for Wikipedia pages (default: 5)
- --negative: The number of negative samples (default: 5)
- --tokenizer: The name of the tokenizer used to tokenize a text into words. Possible values are regexp, icu, mecab, and jieba
- --sent-detect: The sentence detector used to split texts into sentences. Currently, only icu is the possible value (default: None)
- --entities-per-page: For processing each page, the specified number of randomly chosen entities are used to predict their neighboring entities in the link graph (default: 10)
- --init-alpha: The initial learning rate (default: 0.025)
- --min-alpha: The minimum learning rate (default: 0.0001)
- --sample: The parameter that controls the downsampling of frequent words (default: 1e-4)
- --word-neg-power: Negative sampling of words is performed based on the probability proportional to the frequency raised to the power specified by this option (default: 0.75)
- --entity-neg-power: Negative sampling of entities is performed based on the probability proportional to the frequency raised to the power specified by this option (default: 0)
- --pool-size: The number of worker processes (default: the number of CPUs)
Saving Embeddings in Text Format#
save-text outputs a model in a text format.
% wikipedia2vec save-text MODEL_FILE OUT_FILE
Arguments:
- MODEL_FILE: The model file generated by the train-embedding command
- OUT_FILE: The output file
Options: