Tweets.

Gathering Tweets with Python

This tutorial guides you in setting up a system for collecting Tweets. Not in Apache Spark or Apache Flink, but just in Python. In many use cases, just a single computing node can collect enough Tweets to draw decent conclusions. In future blog posts, I will explain how to collect Tweets using a cluster (and with either Apache Spark or Apache Flink). But for now, lets focus on a simple Pythonic harvester!

Tweets are extremely useful for gathering opinions of thousands of people on a particular topic over time. Sadly, Twitter has revoked access to old Tweets (however, this Python package is still capable of doing so).  Therefore, many developers harvest Tweets by using Twitters Streaming API and store them on their computing nodes. If you have enough computing nodes, you could consider collecting Tweets by using a cluster and cluster software, such as Apache Spark or Apache Flink. But if you have a small scale project, one Python script will be enough. In this tutorial, we will build a small Python script for retrieving and storing Tweets from the Streaming API.

Setting up an account

The first thing we need, is an access token for accessing the Twitter API. This can simply be done by visiting apps.twitter.com.

Writing the code

First, make sure you have installed the Python package tweepy. Now we can start writing our code! We will fetch Tweets containing “Python”:

import tweepy
import json

# Specify the account credentials in the following variables:
consumer_key = 'INSERT CONSUMER KEY HERE'
consumer_secret = 'INSERT CONSUMER SECRET HERE'
access_token = 'INSERT ACCESS TOKEN HERE'
access_token_secret = 'INSERT ACCESS TOKEN SECRET HERE'


# This listener will print out all Tweets it receives
class PrintListener(tweepy.StreamListener):
    def on_data(self, data):
        # Decode the JSON data
        tweet = json.loads(data)

        # Print out the Tweet
        print('@%s: %s' % (tweet['user']['screen_name'], tweet['text'].encode('ascii', 'ignore')))

    def on_error(self, status):
        print(status)


if __name__ == '__main__':
    listener = PrintListener()

    # Show system message
    print('I will now print Tweets containing "Python"! ==>')

    # Authenticate
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    # Connect the stream to our listener
    stream = tweepy.Stream(auth, listener)
    stream.filter(track=['Python'])
Collecting Tweets

In the code, the comments describe what the code does.

Conclusion

It was fairly easy to setup a Tweet harvester! If you have any questions or comments on this articles, please send me a comment below!

Kevin Jacobs

Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: mail@kevinjacobs.nl.