This tutorial guides you in setting up a system for collecting Tweets. Not in Apache Spark or Apache Flink, but just in Python + Tweepy. In many use cases, just a single computing node can collect enough Tweets to draw decent conclusions. In future blog posts, I will explain how to collect Tweets using a cluster (and with either Apache Spark or Apache Flink). But for now, lets focus on a simple Pythonic harvester! If you are interested in scraping a website, you should definitely read this article.
Tweets are extremely useful for gathering opinions of thousands of people on a particular topic over time. Sadly, Twitter has revoked access to old Tweets (however, this Python package is still capable of doing so by making use of Twitter search functionality). Therefore, many developers harvest Tweets by using Twitters Streaming API and store them on their computing nodes. If you have enough computing nodes, you could consider collecting Tweets by using a cluster and cluster software, such as Apache Spark or Apache Flink. But if you have a small scale project, one Python script will be enough. In this tutorial, we will build a small Python script for retrieving and storing Tweets from the Streaming API.
Setting up an account
The first thing we need, is an access token for accessing the Twitter API. This can simply be done by visiting apps.twitter.com. Sadly, the policy and terms of service of Twitter changes frequently, so it is hard to explain and update the sign up process on Twitter every time Twitter changes something.
Writing the code
First, make sure you have installed the Python package tweepy. Now we can start writing our code! We will fetch Tweets containing “Python”:
import tweepy import json # Specify the account credentials in the following variables: consumer_key = 'INSERT CONSUMER KEY HERE' consumer_secret = 'INSERT CONSUMER SECRET HERE' access_token = 'INSERT ACCESS TOKEN HERE' access_token_secret = 'INSERT ACCESS TOKEN SECRET HERE' # This listener will print out all Tweets it receives class PrintListener(tweepy.StreamListener): def on_data(self, data): # Decode the JSON data tweet = json.loads(data) # Print out the Tweet print('@%s: %s' % (tweet['user']['screen_name'], tweet['text'].encode('ascii', 'ignore'))) def on_error(self, status): print(status) if __name__ == '__main__': listener = PrintListener() # Show system message print('I will now print Tweets containing "Python"! ==>') # Authenticate auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) # Connect the stream to our listener stream = tweepy.Stream(auth, listener) stream.filter(track=['Python'])
In the code, the comments describe what the code does.
It was fairly easy to setup a Tweet harvester! You only need to sign up on Twitter and write a few lines of code using Python and Tweepy. If you have any questions or comments on this articles, please send me a comment below!