Skip to content

Latest commit

 

History

History
208 lines (186 loc) · 7.31 KB

03-sentiment-analysis.md

File metadata and controls

208 lines (186 loc) · 7.31 KB

3. Sentiment analysis

3.1. Importing textblob

As we mentioned at the beginning of this workshop, textblob will allow us to do sentiment analysis in a very simple way. We will also use the re library from Python, which is used to work with regular expressions. For this, I'll provide you two utility functions to: a) clean text (which means that any symbol distinct to an alphanumeric value will be remapped into a new one that satisfies this condition), and b) create a classifier to analyze the polarity of each tweet after cleaning the text in it. I won't explain the specific way in which the function that cleans works, since it would be extended and it might be better understood in the official redocumentation.

The code that I'm providing is:

from textblob import TextBlob
import re

def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1

The way it works is that textblob already provides a trained analyzer (cool, right?). Textblob can work with different machine learning models used in natural language processing. If you want to train your own classifier (or at least check how it works) feel free to check the following link. It might result relevant since we're working with a pre-trained model (for which we don't not the data that was used).

Anyway, getting back to the code we will just add an extra column to our data. This column will contain the sentiment analysis and we can plot the dataframe to see the update:

# We create a column with the result of the analysis:
data['SA'] = np.array([ analize_sentiment(tweet) for tweet in data['Tweets'] ])

# We display the updated dataframe with the new column:
display(data.head(10))

Obtaining the new output:

Tweets len ID Date Source Likes RTs SA
0 On behalf of @FLOTUS Melania & myself, THA... 144 903778130850131970 2017-09-02 00:34:32 Twitter for iPhone 24572 5585 1
1 I will be going to Texas and Louisiana tomorro... 132 903770196388831233 2017-09-02 00:03:00 Twitter for iPhone 44748 8825 1
2 Stock Market up 5 months in a row! 34 903766326631698432 2017-09-01 23:47:38 Twitter for iPhone 44518 9134 0
3 'President Donald J. Trump Proclaims September... 140 903705867891204096 2017-09-01 19:47:23 Media Studio 47009 15127 0
4 Texas is healing fast thanks to all of the gre... 143 903603043714957312 2017-09-01 12:58:48 Twitter for iPhone 77680 15398 1
5 ...get things done at a record clip. Many big ... 113 903600265420578819 2017-09-01 12:47:46 Twitter for iPhone 54664 11424 1
6 General John Kelly is doing a great job as Chi... 140 903597166249246720 2017-09-01 12:35:27 Twitter for iPhone 59840 11678 1
7 Wow, looks like James Comey exonerated Hillary... 130 903587428488839170 2017-09-01 11:56:45 Twitter for iPhone 110667 35936 1
8 THANK YOU to all of the incredible HEROES in T... 110 903348312421670912 2017-08-31 20:06:35 Twitter for iPhone 112012 29064 1
9 RT @FoxNews: .@KellyannePolls on Harvey recove... 140 903234878124249090 2017-08-31 12:35:50 Twitter for iPhone 0 6638 0

As we can see, the last column contains the sentiment analysis (SA). We now just need to check the results.

3.2. Analyzing the results

To have a simple way to verify the results, we will count the number of neutral, positive and negative tweets and extract the percentages.

# We construct lists with classified tweets:

pos_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] > 0]
neu_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] == 0]
neg_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] < 0]

Now that we have the lists, we just print the percentages:

# We print percentages:

print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(data['Tweets'])))
print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(data['Tweets'])))
print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(data['Tweets'])))

Obtaining the following result: Percentage of positive tweets: 51.0% Percentage of neutral tweets: 27.0% Percentage de negative tweets: 22.0%

We have to consider that we're working only with the 200 most recent tweets from D. Trump (last updated: September 2nd.). For more accurate results we can consider more tweets. An interesting thing (an invitation to the readers) is to analyze the polarity of the tweets from different sources, it might be deterministic that by only considering the tweets from one source the polarity would result more positive/negative. Anyway, I hope this resulted interesting.

As we saw, we can extract, manipulate, visualize and analyze data in a very simple way with Python. I hope that this leaves some uncertainty in the reader, for further exploration using this tools.

Go back to 2. Visualization and basic statistics
Go next to 4. References