We can use Python to automatically analyze the sentiment of Reddit posts (i.e. Sentiment analysis, a.k.a. "opinion mining" a.k.a. "emotion artificial intelligence"). This may have practical applications for cryptocurrency traders. Machine learning could be used to look for correlations between price movements and sentiment. Today, we are going to "scrape" the top posts from social news discussion website Reddit's r/ethtrader Ethereum investment community, because the links are already scored for relevance by humans using Reddit's voting system. Here's how we're going to do it: (highlights below)
import requests from bs4 import BeautifulSoup # Imports the Google Cloud client library from google.cloud import language # Instantiates a client language_client = language.Client() url = 'https://www.reddit.com/r/ethtrader/top/?sort=top&t=all' r = requests.get(url, headers = {'User-agent': 'youllneverguess'}) #Use fresh username since Reddit rejects the default Python one r print(r.status_code) #a status code of 200 means that everything is okay soup = BeautifulSoup(r.content, 'html.parser') siteTable = soup.find("div", { "id" : "siteTable"}) hits = siteTable.find_all("div", { "class" : "thing" } ) i = 0 for hit in hits: i = i + 1 print("-------------------------------------------") username = hit.find("a", { "class" : "author"}) datetime = hit.find("time")['datetime'] score = hit.find("div", { "class" : "score unvoted"}).text for link in hit.find_all('a', href=True): if "https://www.reddit.com/r/" in link['href']: link = link['href'] followed = requests.get(link, headers = {'User-agent': 'ethscraper 1.0'}) followed linksoup = BeautifulSoup(followed.content, 'html.parser') content = linksoup.find("div", { "class" : "content"}) paragraphs = content.find_all("p") text = "" for paragraph in paragraphs: text = text + " " + paragraph.text document = language_client.document_from_text(text) sentiment = document.analyze_sentiment().sentiment print(i, "| Username:", username.string, "| Date & Time:", datetime, "| Votes:", score, "|", 'Sentiment: {}, Magnitude: {}'.format(sentiment.score, sentiment.magnitude))
And now for the play-by-play:
import requests from bs4 import BeautifulSoup # Imports the Google Cloud client library from google.cloud import language
We start off by importing the requests library (to do the actual scraping), Beautiful Soup 4, and the Google Cloud client library, which contains the cutting edge Sentiment Analyzer.
# Instantiates a client language_client = language.Client()
This is necessary for the Google Cloud client library Sentiment Analyzer.
url = 'https://www.reddit.com/r/ethtrader/top/?sort=top&t=all' r = requests.get(url, headers = {'User-agent': 'youllneverguess'}) #Use fresh username since Reddit rejects the default Python one r print(r.status_code) #a status code of 200 means that everything is okay soup = BeautifulSoup(r.content, 'html.parser') siteTable = soup.find("div", { "id" : "siteTable"}) hits = siteTable.find_all("div", { "class" : "thing" } )
This "scrapes" the target website and them parses it using Beautiful soup. Any children divs of the siteTable div are posts regarding our topic, i.e. "hits". You have to go through the source code of the target website to find these bits of code to scrape based on. We have to change the name of the user agent since Reddit will reject the default one (they must get hit with requests using the default user-agent name all the time).
for hit in hits: i = i + 1 print("-------------------------------------------") username = hit.find("a", { "class" : "author"}) datetime = hit.find("time")['datetime'] score = hit.find("div", { "class" : "score unvoted"}).text
Above, we scrape the name of the user who made the submission, the date and time of submission (for later comparison to the subsequent change in price), and number of Reddit votes (so we have a human-determined score that's hard to fake).
Wait, there's more:
for link in hit.find_all('a', href=True): if "https://www.reddit.com/r/" in link['href']: link = link['href'] followed = requests.get(link, headers = {'User-agent': 'ethscraper 1.0'}) followed linksoup = BeautifulSoup(followed.content, 'html.parser') content = linksoup.find("div", { "class" : "content"}) paragraphs = content.find_all("p") text = "" for paragraph in paragraphs: text = text + " " + paragraph.text document = language_client.document_from_text(text) sentiment = document.analyze_sentiment().sentiment print(i, "| Username:", username.string, "| Date & Time:", datetime, "| Votes:", score, "|", 'Sentiment: {}, Magnitude: {}'.format(sentiment.score, sentiment.magnitude))
Here we follow the frontpage link to the discussion page and use artificial intelligence (AI) to score the attitude towards the topic at hand on a scale of from -1 to 1. The AI also scores the magnitude (i.e. overall strength of emotion) on an infinite scale.
Voilà voici:
This data is going to be used for machine learning to determine correlations between sentiment and market price, if any.
Fork
Comments
Post a Comment
Comments are welcome and a good way to garner free publicity for your website or venture.