Summer Project Done in 48 Hours Flat!

I have discovered a way to replace all undesired translations of a word with a desired one. Coding it took me less than two light days, not all summer. Without further ado:

glossary.py

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
# The above encoding declaration is required and the file must be saved as UTF-8

This is to use Unicode Transformation Format. Moving forward, it would be good to use Unicode in Python 3.

glossary.py (continued)

glossary = {}
source_term = "свет"
#please use dictionary form
desired_translation = "light"
glossary.update({source_term:desired_translation})

This creates a new python dictionary with the source_term as the key and the desired_translation as the value. In this example, we are using the Russian source_term свет and the desired_translation 'light'. Other, undesired translations include 'world' and 'shine' (this will become important later, as translating "light" as "world" would create nonsense - and Google Translate still does!)

Next, we have the file that does the actual replacing.

replacer.py

#!/usr/bin/env python2e<
# -*- coding: utf-8 -*-
# The above encoding declaration is required and the file must be saved as UTF-8

Again, until we upgrade to Unicode 3, some fancy footwork will be needed to accommodate non-latin text.

replacer.py

# fetch desired translation from glossary and make case insensitive
from glossary import desired_translation
desired_translation_capitalized = desired_translation.capitalize()
desired_translation_allcaps = desired_translation.upper()
desired_translation_lowercase = desired_translation.lower()

First, we import the desired_translation from the glossary.py program showcased above. Then, we save all-capps, sentence-case, and lowercase versions of it for later use.

Next, we use the crawler I created from scratch in under a day last week to fetch all the possible translations of the Russian source_term: replacer.py

# crawl all translations from an online dictionary
from glossary import source_term
from scrapers import multitran
translations = multitran(source_term)

In the above example, all translations of Russian свет are fetched from Multitran (Мультитран), a popular Russian-English dictionary, using my handcrafted crawler code.

Next, we remove the desired_translation from our list of translations to be replaced, leaving only the undesired_translations on the list. We also remove the all-caps, sentence-case, and lowercase versions for good measure: replacer.py

# remove desired translation from list of words to be replaced, regardless of capitalization
for translation in translations:
 if translation.text.capitalize() == desired_translation:
  translations.remove(translation)
 if translation.text.upper() == desired_translation:
  translations.remove(translation)
 if translation.text.lower() == desired_translation:
  translations.remove(translation)

Again, we don't want any capitalized versions slipping through.

And, finally, we just do it:

replacer.py

import codecs

#replace undesired translations, while preserving capitalization and final -(')s
with codecs.open('target.text', encoding='utf-8') as file:
 filedata = file.read()
 for undesired_translation in translations:
  if undesired_translation.text not in desired_translation:
   print undesired_translation.text
   filedata = filedata.replace(undesired_translation.text.capitalize(), desired_translation_capitalized)
   filedata = filedata.replace(undesired_translation.text.upper(), desired_translation_allcaps)
   filedata = filedata.replace(undesired_translation.text.lower(), desired_translation_lowercase)
file = open("processed.text", "w")
file.write(filedata.encode('utf-8'))
file.close()
# all done!
# Copyright 2017 Peter Charles Gleason

We open the target.text as a file with utf-8 encoding, run a loop that goes through all the undesired_translations and replaces them with the desired_translation, whether it's capitalized, lowercase, or allcaps. We write the processed.text to file, close it, and we're #all done!

Behold:

Pete's Slavic Artificial Intelligence Neural Machine Translation NLP Lab

Summer Project Done in 48 Hours Flat!

Comments

Post a Comment