~erutuon/enwiktionary-translations-db

Create a database of translations from the English Wiktionary dump
update smartstring version to fix panic
remove unnecessary autoincrement, better way of getting backtrace
95f0dfb3 — Erutuon 2 years ago
avoid quitting on first Lua error

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~erutuon/enwiktionary-translations-db
read/write
git@git.sr.ht:~erutuon/enwiktionary-translations-db

You can also use your local clone with git send-email.

#enwiktionary-translations-db

This is the source code for a Rust program that creates a SQLite database of translations from the pages-articles.xml.bz2 file (or the decompressed pages-articles.xml file) from the English Wiktionary dump.

#Background

Translation sections are placed under English sections in the main namespace. They are supposed to be underneath a part-of-speech header like ===Noun===. Inside the section, there are translation boxes, which start with the template {{trans-top|definition here}} or {{trans-top-see|definition here}} or {{checktrans-top}} and end with {{trans-bottom}}.

Translations themselves are linked with {{t}}, {{t+}}, {{t-check}}, or {{t+check}}. They can must contain a language code, and can contain a term, link text, a sense ID, a script code, a transliteration, a transcription, genders, and a literal translation.

#Database schema

A translation box is represented by the translation_box table, and a translation by the translation table. All of the fields are included in translation, except genders are placed in translation_gender. translation_box has a null definition if the translation box was headed by {{checktrans-top}}, and a text definition if the translation box was headed by {{trans-top}} or {{trans-top-see}}. {{trans-top-see}} has a second parameter that links to a section of another entry, which isn't included in the database. translation_box has a null part of speech if the translation box did not have a valid part of speech header in the header hierarchy above it.

#Usage

cargo run --release -- parse-dump --dump-path path/to/pages-articles.xml[.bz2]

Last I tried, the program runs in about 20 minutes on my computer and uses a maximum of 23 MB of memory.

#To do

  • Add automatically generated transliterations. This requires a Lua environment that can run Module:languages as well as any transliteration modules used in the translations section. The modules and their dependences should be extracted from the same dump that the translations are to be extracted from. This requires another subcommand.
  • Add entry names. This requires a Lua environment.
  • Add automatically detected Wiktionary script codes (detected in a language-specific or cross-linguistic way). This requires a Lua environment.