Mining Twitter Data in Python using Tweepy

Introduction

I will be extracting twitter data using a python library called Tweepy. For a quick tutorial on tweepy read this post. In this post we will learn how to retrieve Twitter credentials for API access, then we will setup a Twitter stream using tweepy to fetch public tweets. After some preprocessing of tweets we will save these tweets and perform some example operations like sentiment analysis using TextBlob library. If you want to learn or revise some concepts or revise these libraries see posts mentioned below:

Setup

Installation

We will be using various libraries like Tweepy, NLTK, TextBlob. We will be using NLTK for our basic text transformation operations like tokenization, Stopword Removal etc. TextBlob will be used for Sentiment Analysis. We can also do sentiment analysis in NLTK also, the reason for using TextBlob is because TextBlob has method called sentiment which provides us with sentiment, because the main goal of this article is not to build and train a production grade Sentiment Analyser but rather to understand the concepts behind how to extract Twitter data and process it to make it suitable for further processing and analysis. And we will be using Tweepy for accessing Twitter API for tweet extraction.

pip install tweepy
pip install nltk
pip install textblob

NLTK Setup

Before using NLTK we have download various corpora and lexicons. To download these open your terminal/cmd and then enter commands below.

$ python
>>> import nltk
>>> nltk.download()

After calling nltk.download

  1. If you are on Windows or MacOS then a window will open for downloading various NLTK corpora and lexicons. We have to select “all” and click Download button.
  2. If you are on a Linux machine then probably CLI interface will appear in your terminal. First type “d” to download, and then enter “all” to download all the stuff that NLTK requires.

Note that NLTK’s download operations make take a lot of time depending on your internet connection speed. NLTK will download approximately 1.3GB.

Twitter API Credentials

  1. Visit this URL – https://apps.twitter.com/.
  2. Login your account, if not already logged in.
  3. Click on Create New AppScreen Shot 2017-10-16 at 3.46.38 PM
  4. Now fill in the form, and if you don’t have a website, just fill with any fake website name. After agreeing the Terms and Conditions, click Create your Twitter application.Screen Shot 2017-10-16 at 3.52.01 PM
  5. Now you will be redirected your application page. Now click Keys and Access Tokens tab.Screen Shot 2017-10-16 at 3.53.35 PM
  6. Copy Consumer Key and Consumer Secret. These will be used in our application.Screen Shot 2017-10-16 at 3.56.25 PM
  7. Click on Generate Access Token button, and then copy your Access Token and Access Token Secret. These values also will be used in our application.

Imports

import pickle
import pprint
import json
import string
import re

import tweepy
import nltk
import textblob

Twitter OAuth Authentication

CONSUMER_SECRET = 'YOUR CONSUMER SECRET GOES HERE'
CONSUMER_TOKEN = 'YOUR TOKEN GOES HERE'

The next step is creating an OAuthHandler instance. Into this we pass our consumer token and secret.

auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
ACCESS_TOKEN = 'TOKEN GOES HERE'
ACCESS_TOKEN_SECRET = 'SECRET GOES HERE'

Set access token in our OAuth Handler.

auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

Streaming Twitter Data

Note that we won’t be going on every details of setting up stream, because this is already ben explained in this post. Steps to setup Twitter Stream are:

  1. Create a class inheriting from tweepy.StreamListener
  2. Using that class create a tweepy.Stream object
  3. Connect to the Twitter API using the Stream.

Implementing tweepy.StreamListener

from tweepy import StreamListener
from tweepy import Stream
import json

tweets = []

class TwitterStreamListener(StreamListener):
    def __init__(self, maxTweets=1000):
        self.maxTweets = maxTweets
        self.nonEngCount = 0
        self.count = 0
    #this method is called everytime we recive a tweet
    def on_data(self, data):
        #tweets variable to store variables and further access in example section
        global tweets
        
        #parsing json from json string of tweet
        tweet = json.loads(data)
      
        #Main Logic
        #NOTE: that code for this part is available in Main Section below
        
    def on_status(self, status):
        print(status)

    def printStatus(self):
         print('English Tweets : {0}'.format(self.count))
         print('Non-English Tweets : {0}'.format(self.nonEngCount))
         
         #using escape sequence to print status in same positions
         for i in range(2):
             sys.stdout.write('\033[F') #MOVE CURSOR 1 LINE ABOVE
             sys.stdout.write('\033[K') #CLEAR TO END OF LINE

Creating tweepy.Stream object

listener = TwitterStreamListener(100)
stream = Stream(auth, listener)

Fetching Tweets

We will be fetching tweets on Indian Prime Minister Mr. Narendra Modi

stream.filter(track=['modi'])

Main Logic

Note: that this code is not independent rather continuation of “Implementing tweepy.StreamListener” section, i.e this code gets inserted inside on_data method of TwitterStreamListener that we implemented. This code is separately written for convenience and explanation purposes.

Different steps that we will be doing are:

  1. For now we will be working on english tweets. Therefore we will first check if ‘lang’ attribute exists in our tweet JSON and if ‘lang‘ attribute is english then we will continue else we will return.
  2. Then we will get tweet text using ‘text‘ key in our tweet JSON if tweet is not truncated else we will get full tweet text using ‘full_text‘ property in ‘extended_tweet‘ in our tweet JSON.
  3. Then we will append tweet text in our global tweets list. This will be used for sentiment analysis.
  4. Note: We will be printing the status of tweets with each tweet we receive, like how many tweets have been appended, how many non-english tweets we have encountered.

Main logic is:

if 'lang' in tweet:
    if tweet['lang'] == 'en':
        self.count += 1
 
        #getting tweet text
        try:
            tweetText = tweet['extended_tweet']['full_text']
        except AttributeError:
            tweetText = tweet['text']

        #appending tweet text
        tweets.append(tweetText)

        #printing status
        self.printStatus()
    else:
        self.nonEngCount += 1

        #printing status
        self.printStatus()

Saving Tweets to a File (Optional)

Note: If you want to write these tweets to a file, you can do that too, just make sure that instead of saving every tweet you receive, save every 100 or 1000 or so number of tweets to improve performance, Otherwise your application will be amazingly slow because of repeated disk access.

To save these create a method in our TwitterStreamListener that will accept a list of tweets and writes it to a file that you will accept using our listener class’s __init__ method. In __init__ open the file with ‘a’ mode using the filename that __init__ will receive.

__init__ for TwitterStreamListener

def __init__(self, filename):
    self.file = open(filename, 'a')
    self.savedTweetsCount = 0
    self.count = 0
    self.tweets = []

on_data() Method

In the on_data method of tweepy.StreamListener create a temporary list to append tweets.

def on_data(self, data):
    tweet = json.loads(data)
    
    self.count += 1
    self.tweets.append(tweet['text'])
    
    #saving every 100 tweets
    if self.count % 100 == 0:
        self.saveTweets()
        self.tweets = []

    #stopping the streaming by return False when we have received 1000 tweets.
    if self.count == 1000:
        return False

Method to Save Tweets

While saving tweet we have to manually write \n character. And then don’t forget to flush the file stream.

def saveTweets(self):
    for tweet in self.tweets:
        self.file.write(tweet)
        self.file.write('\n')
    self.file.flush()
    self.savedTweetsCount += len(tweets)

Tweet Preprocessing

Tokenization

We first need to tokenize our tweets. If we want we can just split our data at spaces or used NLTK’s tokenization functions. But we will be using a slightly better solution which makes use of regular expressions. Using regular expression mentioned below we will remove HTML Tags, @Mentions, Hash Tags, URLs and various other irrelevant terms that provide no value in our analysis

regex_str = [
 r'<[^>]+>', # HTML tags
 r'(?:@[\w_]+)', # @-mentions
 r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
 r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
 r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
 r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
 r'(?:[\w_]+)', # other words
 r'(?:\S)' # anything else
]

#compiling our regular expression
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)

def tokenize(tweet):
    return tokens_re.findall(tweet)

 

Removing Irrelevant Terms

In above tokenization steps we filtered most of the useless information like URLs, Hash Tags, Numbers etc. Now we will removing stopwords from our tweets.

Removing Stopwords

Stopwords are words like the, of which will mess up our any kind of analysis because they don’t provide much values for operations like sentiment analysis.

import string
punctuation = list(string.punctuation)

swords = stopwords.words('english') + punctuation + ['rt', '...']

def processTweets(tweets):
    processedTweets = []

    for tweet in tweets:
        #tokenizing tweet
        tweet = tokenize(tweet)

        #removing stopwords
        tweet = [term for term in tweet if term.lower() not in set(swords)] 
        
        #appending processed tweet
        processedTweets.append(tweet)

     return processedTweets

Saving processed Tweets

We will using python’s inbuilt serialization library called pickle for persisting our processed tweets. Ability to save tweets in necessary because in real life scenario we will be working with huge amount of twitter data and we cannot  load and process our data every time, because it will be a very expensive operation. Therefore we need to store our processed data if we will be needing it in future.

def saveTweets(filename, tweets):
    with open(filename, 'wb') as file:
        pickle.dump(tweets, file, pickle.HIGHEST_PROTOCOL

After saving tweets ofcourse we also need to load tweets, this process is called deserialization.

def loadTweets(filename):
    with open(filename, 'rb') as file:
        return pickle.load(file)

Examples

Sentiment Analysis

We will be using a library called colorama for printing colored text using escape sequences To learn more about coloraa see this CLI colors using ASCII Escape sequences with Colorama in Python. In this example we will iterate over list of tweets and:

  • Print Tweet in RED color if its sentiment is negative.
  • Print Tweet in GREEN color if its sentiment is positive
  • And print tweet in BLACK color if it is neutral in its polarity.
from colorama import *

#Initializing colorama
init()

for tweet in tweets:
    blob = textblob.TextBlob(tweet)
    if blob.sentiment.polarity == 0:
        print(tweet)
    elif blob.sentiment.polarity > 0:
        print(Fore.RED + tweet)
    elif blob.sentiment.polarity < 0:
        print(Fore.GREEN + tweet)

2 thoughts on “Mining Twitter Data in Python using Tweepy

  1. Very neat post, I really enjoy the site, keep it up. How do you market your site? I chanced it on Google. If you have a chance check out my site, it’s not as renowned, but I ‘m only able to update it once a week. kelly kosky

    Like

    1. Hi, right now due to work, I am not putting out new content, but I appreciate the compliments. I don’t use anything to market my website. The plan was to create articles on advanced topics and nuanced issues/concepts that are difficult to find online.

      Like

Leave a comment