tl:dr; lem2en nmt project now on aws ec2 and it's alive!
After initial success processing Lemko to English translation segment using Tensorflow (#tf) neural machine translation (#NMT) Sequence-to-Sequence (#seq2seq), it appears as if it would take literally days or even months to finish given the the available hardware (Lenovo M30 laptop 1.7GHz, 4GB). A solution is to use cloud computing, and specifically Amazon Web Services (#aws). This has been accomplished.Training Data
Eventually, we are going to feed the machine tens of thousands of perfectly aligned, premium-quality training data generously made available by an NGO. But first, we're going to give it something very easy to digest and see how long it takes and what the bill is.What could be simpler than "My name is X":
#train.lem Называм ся Параскєва. Называм ся Петро. Называм ся Ваньо. Я ся называм Катерина тераз. Я ся называм Параска. Я называм ся Митро. Называм ся Ярослав. Называм ся Мария.The English version:
#train.en My name is Paraskiewa. My name is Petro. My name is Wanio. My name is now Kateryna. My name is Paraska. My name is Mytro. My name is Jarosław. My name is Maria.Next, time to create the development and testing data. the Stanford NLP group's codebase English-Vietnamese vocabulary data, pulled from International Workshop on Spoken Language Translation's (#IWSLT2015), has a train:dev:test ratio of about 98:1:1, i.e.:
Dataset | Filename | Lines | Percent |
---|---|---|---|
Training | train.vi | 133317 | 97.928% |
Training | train.en | 133317 | 97.928% |
Development | tst2012.vi | 1553 | 01.141% |
Development | tst2012.en | 1553 | 01.141% |
Testing | tst2013.vi | 1268 | 00.931% |
Testing | tst2013.en | 1268 | 00.931% |
Development data
#dev.lem Я ся называм Ярослав.
#dev.en My name is Jarosław.
Test data
#test.dev Я ся называм Мария.
#test.en My name is Maria.
Vocab file
Surprisingly, the Stanford NLP group's codebase English-Vietnamese vocabulary data, pulled from International Workshop on Spoken Language Translation's (#IWSLT2015), is just the training data words listed (no dev or test data) without any stemming or word-frequency sorting. Each word form occurs once. So, recreating it for Lemko will be pretty easy. Meanwhile I have created a program, Agni, to stem and sort by frequency, if need be. It's currently operational for Hungarian.The vocab file needs to start with the unknown token "<unk>," the starting symbol "<s>," and the end-of-sentence marker "</s>".
#vocab.lem <unk> <s> </s> Называм ся Параскєва . Петро Ваньо Я называм Катерина тераз Параска Митро Ярослав МарияAnd the same for the English:
#vocab.en <unk> <s> </s> My name is Paraskiewa . Petro Wanio now Kateryna Paraska Mytro Jarosław Maria
Output
Now, it says, "Perform external evaluation." But that will have to wait until next week ;-)For all you language hackers out there, here's the secret sauce:
tmux source activate tensorflow_p36 git clone https://github.com/tensorflow/nmt/ wget https://raw.githubusercontent.com/pgleasonjr/nmt/master/scripts/download_lemko.sh sudo chmod +x download_lemko.sh ./download_lemko.sh /tmp/nmt_data vim nmt/utils/misc_utils.py (fix line 34; ZZ) mkdir /tmp/nmt_model cd nmt python -m nmt.nmt \ --src=lem --tgt=en \ --vocab_prefix=/tmp/nmt_data/vocab \ --train_prefix=/tmp/nmt_data/train \ --dev_prefix=/tmp/nmt_data/dev \ --test_prefix=/tmp/nmt_data/dev \ --out_dir=/tmp/nmt_model \ --num_train_steps=12000 \ --steps_per_stats=100 \ --num_layers=2 \ --num_units=128 \ --dropout=0.2 \ --metrics=bleu C-a " split vertically (top/bottom) C-a % split horizontally (left/right)
Comments
Post a Comment
Comments are welcome and a good way to garner free publicity for your website or venture.