Tweets.

Scrape Tweets from Twitter using Python and Tweepy

This tutorial guides you in setting up a system for collecting Tweets. Not in Apache Spark or Apache Flink, but just in Python + Tweepy. In many use cases, just a single computing node can collect enough Tweets to draw decent conclusions. In future blog posts, I will explain how to collect Tweets using a cluster (and with either Apache Spark or Apache Flink). But for now, lets focus on a simple Pythonic harvester! If you are interested in scraping a website, you should definitely read this article.

Tweets are extremely useful for gathering opinions of thousands of people on a particular topic over time. Sadly, Twitter has revoked access to old Tweets (however, this Python package is still capable of doing so by making use of Twitter search functionality).  Therefore, many developers harvest Tweets by using Twitters Streaming API and store them on their computing nodes. If you have enough computing nodes, you could consider collecting Tweets by using a cluster and cluster software, such as Apache Spark or Apache Flink. But if you have a small scale project, one Python script will be enough. In this tutorial, we will build a small Python script for retrieving and storing Tweets from the Streaming API.

Setting up an account

The first thing we need, is an access token for accessing the Twitter API. This can simply be done by visiting apps.twitter.com. Sadly, the policy and terms of service of Twitter changes frequently, so it is hard to explain and update the sign up process on Twitter every time Twitter changes something.

Writing the code

First, make sure you have installed the Python package tweepy. Now we can start writing our code! We will fetch Tweets containing “Python”:

import tweepy
import json

# Specify the account credentials in the following variables:
consumer_key = 'INSERT CONSUMER KEY HERE'
consumer_secret = 'INSERT CONSUMER SECRET HERE'
access_token = 'INSERT ACCESS TOKEN HERE'
access_token_secret = 'INSERT ACCESS TOKEN SECRET HERE'


# This listener will print out all Tweets it receives
class PrintListener(tweepy.StreamListener):
    def on_data(self, data):
        # Decode the JSON data
        tweet = json.loads(data)

        # Print out the Tweet
        print('@%s: %s' % (tweet['user']['screen_name'], tweet['text'].encode('ascii', 'ignore')))

    def on_error(self, status):
        print(status)


if __name__ == '__main__':
    listener = PrintListener()

    # Show system message
    print('I will now print Tweets containing "Python"! ==>')

    # Authenticate
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    # Connect the stream to our listener
    stream = tweepy.Stream(auth, listener)
    stream.filter(track=['Python'])
Collecting Tweets

In the code, the comments describe what the code does.

Conclusion

It was fairly easy to setup a Tweet harvester! You only need to sign up on Twitter and write a few lines of code using Python and Tweepy. If you have any questions or comments on this articles, please send me a comment below!


Kevin Jacobs

Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: kevin8080nl@gmail.com. Are you interested in Data Science coaching in which I will answer all your Data Science related questions? Then please visit the Data Science coaching page!