Multitran Translation Scraper Already Operational!

I was able to build a translation scraper from scratch in less than a day. It uses Requests to download a url, Beautiful Soup to parse the html, and its select function to find all translations on a Multitran (Мультитран) entry page. It wasn't the easiest project because Multitran is still using 90s style tables without a lot of markup to hook onto. But I found a way by carefully studying the hyperlinks of the anchor tags of the translation entries. So that's our secret sauce. For now, it outputs the translations to the console; later, I'll make an array out of them to replace non-desired translations of Russian terms with the desire one.

Fork import requests from bs4 import BeautifulSoup url = 'http://www.multitran.ru/c/m.exe?l1=2&l2=1&s=%EA%EE%F1%EC%E8%F7%E5%F1%EA%E8%E9%20%EB%E5%F2%E0%F2%E5%EB%FC%ED%FB%E9%20%E0%EF%EF%E0%F0%E0%F2' # edit url manually until import function is developed r = requests.get(url) r print r.status_code #a status code of 200 means that everything is okay soup = BeautifulSoup(r.content, 'html.parser') translations = soup.select("a[href*=m.exe?t=]") #the secret sauce for translation in translations: print translation.text #prints out all translations
#That's a wrap! # Copyright Peter Charles Gleason, 2017

Onward and upward!

Pete's Slavic Artificial Intelligence Neural Machine Translation NLP Lab

Multitran Translation Scraper Already Operational!

Comments

Post a Comment