
I have discovered a way to replace all undesired translations of a word with a desired one. Coding it took me less than two light days, not all summer. Without further ado:
glossary.py#!/usr/bin/env python2 # -*- coding: utf-8 -*- # The above encoding declaration is required and the file must be saved as UTF-8
This is to use Unicode Transformation Format. Moving forward, it would be good to use Unicode in Python 3.
glossary.py (continued)
glossary = {}
source_term = "свет"
#please use dictionary form
desired_translation = "light"
glossary.update({source_term:desired_translation})
This creates a new python dictionary with the source_term as the key and the desired_translation as the value. In this example, we are using the Russian source_term свет and the desired_translation 'light'. Other, undesired translations include 'world' and 'shine' (this will become important later, as translating "light" as "world" would create nonsense - and Google Translate still does!)
Next, we have the file that does the actual replacing.
replacer.py#!/usr/bin/env python2e< # -*- coding: utf-8 -*- # The above encoding declaration is required and the file must be saved as UTF-8
Again, until we upgrade to Unicode 3, some fancy footwork will be needed to accommodate non-latin text.
replacer.py# fetch desired translation from glossary and make case insensitive from glossary import desired_translation desired_translation_capitalized = desired_translation.capitalize() desired_translation_allcaps = desired_translation.upper() desired_translation_lowercase = desired_translation.lower()
First, we import the desired_translation from the glossary.py program showcased above. Then, we save all-capps, sentence-case, and lowercase versions of it for later use.
Next, we use the crawler I created from scratch in under a day last week to fetch all the possible translations of the Russian source_term:
replacer.py
# crawl all translations from an online dictionary from glossary import source_term from scrapers import multitran translations = multitran(source_term)
In the above example, all translations of Russian свет are fetched from Multitran (Мультитран), a popular Russian-English dictionary, using my handcrafted crawler code.
Next, we remove the desired_translation from our list of translations to be replaced, leaving only the undesired_translations on the list. We also remove the all-caps, sentence-case, and lowercase versions for good measure:
replacer.py
# remove desired translation from list of words to be replaced, regardless of capitalization for translation in translations: if translation.text.capitalize() == desired_translation: translations.remove(translation) if translation.text.upper() == desired_translation: translations.remove(translation) if translation.text.lower() == desired_translation: translations.remove(translation)
Again, we don't want any capitalized versions slipping through.
And, finally, we just do it:
replacer.py
import codecs
#replace undesired translations, while preserving capitalization and final -(')s
with codecs.open('target.text', encoding='utf-8') as file:
filedata = file.read()
for undesired_translation in translations:
if undesired_translation.text not in desired_translation:
print undesired_translation.text
filedata = filedata.replace(undesired_translation.text.capitalize(), desired_translation_capitalized)
filedata = filedata.replace(undesired_translation.text.upper(), desired_translation_allcaps)
filedata = filedata.replace(undesired_translation.text.lower(), desired_translation_lowercase)
file = open("processed.text", "w")
file.write(filedata.encode('utf-8'))
file.close()
# all done!
# Copyright 2017 Peter Charles Gleason
We open the target.text as a file with utf-8 encoding, run a loop that goes through all the undesired_translations and replaces them with the desired_translation, whether it's capitalized, lowercase, or allcaps. We write the processed.text to file, close it, and we're #all done!
Behold:
Comments
Post a Comment
Comments are welcome and a good way to garner free publicity for your website or venture.