update smartstring version to fix panic
remove unnecessary autoincrement, better way of getting backtrace
avoid quitting on first Lua error
This is the source code for a Rust program
that creates a SQLite database
of translations
from the pages-articles.xml.bz2
file (or the decompressed pages-articles.xml
file)
from the English Wiktionary dump.
Translation sections are placed under English sections in the main namespace.
They are supposed to be underneath a part-of-speech header like ===Noun===
.
Inside the section, there are translation boxes,
which start with the template {{trans-top|definition here}}
or {{trans-top-see|definition here}}
or {{checktrans-top}}
and end with {{trans-bottom}}
.
Translations themselves are linked with {{t}}
,
{{t+}}
,
{{t-check}}
,
or {{t+check}}
.
They can must contain a language code,
and can contain a term, link text, a sense ID, a script code,
a transliteration, a transcription, genders, and a literal translation.
A translation box is represented by the translation_box
table,
and a translation by the translation
table.
All of the fields are included in translation
, except genders are placed in translation_gender
.
translation_box
has a null
definition if the translation box was headed by {{checktrans-top}}
,
and a text
definition if the translation box was headed by {{trans-top}}
or {{trans-top-see}}
.
{{trans-top-see}}
has a second parameter that links to a section of another entry,
which isn't included in the database.
translation_box
has a null
part of speech if the translation box did not have
a valid part of speech header in the header hierarchy above it.
cargo run --release -- parse-dump --dump-path path/to/pages-articles.xml[.bz2]
Last I tried, the program runs in about 20 minutes on my computer and uses a maximum of 23 MB of memory.
Module:languages
as well as any transliteration modules
used in the translations section.
The modules and their dependences should be extracted from the same dump
that the translations are to be extracted from.
This requires another subcommand.