I have discovered a way to replace all undesired translations of a word with a desired one. Coding it took me less than two light days, not all summer. Without further ado:
glossary.py#!/usr/bin/env python2 # -*- coding: utf-8 -*- # The above encoding declaration is required and the file must be saved as UTF-8
This is to use Unicode Transformation Format. Moving forward, it would be good to use Unicode in Python 3.
glossary.py (continued)glossary = {} source_term = "свет" #please use dictionary form desired_translation = "light" glossary.update({source_term:desired_translation})
This creates a new python dictionary with the source_term
as the key and the desired_translation
as the value. In this example, we are using the Russian source_term
свет and the desired_translation
'light'. Other, undesired translations include 'world' and 'shine' (this will become important later, as translating "light" as "world" would create nonsense - and Google Translate still does!)
Next, we have the file that does the actual replacing.
replacer.py#!/usr/bin/env python2e< # -*- coding: utf-8 -*- # The above encoding declaration is required and the file must be saved as UTF-8
Again, until we upgrade to Unicode 3, some fancy footwork will be needed to accommodate non-latin text.
replacer.py# fetch desired translation from glossary and make case insensitive from glossary import desired_translation desired_translation_capitalized = desired_translation.capitalize() desired_translation_allcaps = desired_translation.upper() desired_translation_lowercase = desired_translation.lower()
First, we import the desired_translation
from the glossary.py
program showcased above. Then, we save all-capps, sentence-case, and lowercase versions of it for later use.
Next, we use the crawler I created from scratch in under a day last week to fetch all the possible translations of the Russian source_term
:
replacer.py
# crawl all translations from an online dictionary from glossary import source_term from scrapers import multitran translations = multitran(source_term)
In the above example, all translations of Russian свет are fetched from Multitran (Мультитран), a popular Russian-English dictionary, using my handcrafted crawler code.
Next, we remove the desired_translation
from our list of translations
to be replaced, leaving only the undesired_translation
s on the list. We also remove the all-caps, sentence-case, and lowercase versions for good measure:
replacer.py
# remove desired translation from list of words to be replaced, regardless of capitalization for translation in translations: if translation.text.capitalize() == desired_translation: translations.remove(translation) if translation.text.upper() == desired_translation: translations.remove(translation) if translation.text.lower() == desired_translation: translations.remove(translation)
Again, we don't want any capitalized versions slipping through.
And, finally, we just do it:
replacer.pyimport codecs #replace undesired translations, while preserving capitalization and final -(')s with codecs.open('target.text', encoding='utf-8') as file: filedata = file.read() for undesired_translation in translations: if undesired_translation.text not in desired_translation: print undesired_translation.text filedata = filedata.replace(undesired_translation.text.capitalize(), desired_translation_capitalized) filedata = filedata.replace(undesired_translation.text.upper(), desired_translation_allcaps) filedata = filedata.replace(undesired_translation.text.lower(), desired_translation_lowercase) file = open("processed.text", "w") file.write(filedata.encode('utf-8')) file.close() # all done! # Copyright 2017 Peter Charles GleasonWe open the
target.text
as a file with utf-8 encoding, run a loop that goes through all the undesired_translation
s and replaces them with the desired_translation
, whether it's capitalized
, lowercase
, or allcaps
. We write the processed.text
to file, close it, and we're #all done!
Behold:
Comments
Post a Comment
Comments are welcome and a good way to garner free publicity for your website or venture.