Twitter is a rich source of a user’s interests: the public bio, observations, people followed, Retweets and favorites. What if we could process all this information in real time to build awesome web apps that personalize content based on the Twitter profile?
MonkeyLearn (@monkeylearn) is a technology platform that enables this type of deep app/site customization. Here we will show you how to process a Twitter user’s public information to empower customization, as well as other kinds of intelligent applications.
As a prerequisite, you need to first have Twitter API credentials via a registered Twitter app, as well as have signed up with MonkeyLearn and have an API token.
Overview
You can get the full source code here.
Gather user data
First, we create a tweepy API object with our Twitter API key credentials:
# tweepy is used to call the Twitter API from Python import tweepy import re # Authenticate to Twitter API auth = tweepy.OAuthHandler(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET) auth.set_access_token(TWITTER_ACCESS_TOKEN_KEY, TWITTER_ACCESS_TOKEN_SECRET) api = tweepy.API(auth)
Once you have a Twitter client, we retrieve Tweets and favorites, filtering for text-heavy Tweets and calculating a separate quality score:
def get_tweets(api, twitter_user, tweet_type='timeline', max_tweets=200, min_words=5): tweets = [] full_tweets = [] step = 200 # Maximum value is 200. for start in xrange(0, max_tweets, step): end = start + step # Maximum of `step` tweets, or the remaining to reach max_tweets. count = min(step, max_tweets - start) kwargs = {'count': count} if full_tweets: last_id = full_tweets[-1].id kwargs['max_id'] = last_id - 1 if tweet_type == 'timeline': current = api.user_timeline(twitter_user, **kwargs) else: current = api.favorites(twitter_user, **kwargs) full_tweets.extend(current) for tweet in full_tweets: text = re.sub(r'(https?://\S+)', '', tweet.text) // calculate a “score” of tweet relevance/information quality score = tweet.favorite_count + tweet.retweet_count if tweet.in_reply_to_status_id_str: score -= 15 # Only keep tweets with at least min_words words. if len(re.split(r'[^0-9A-Za-z]+', text)) > min_words: tweets.append((text, score)) return tweets
In the provided source code, you’ll also see us go one step further and include friends descriptions into our content corpus.
Filter on language
The next step is to filter the Tweets and content to English. We can do this easily using MonkeyLearn’s API, classifying text in batch mode:
import requests import json # This is a handy function to classify a list of texts in batch mode (much faster) def classify_batch(text_list, classifier_id): """ Batch classify texts text_list -- list of texts to be classified classifier_id -- id of the MonkeyLearn classifier to be applied to the texts """ results = [] step = 250 for start in xrange(0, len(text_list), step): end = start + step data = {'text_list': text_list[start:end]} response = requests.post( MONKEYLEARN_CLASSIFIER_BASE_URL + classifier_id + '/classify_batch_text/', data=json.dumps(data), headers={ 'Authorization': 'Token {}'.format(MONKEYLEARN_TOKEN), 'Content-Type': 'application/json' }) try: results.extend(response.json()['result']) except: print response.text raise return results
If you need additional language support, MonkeyLearn has a number of language classifiers, including Spanish, French and many others. Look at our source code for the filter_language() method on how to swap out for your desired language.
Detect categories
Now that we have a list of Tweets and descriptions in English, we can use a MonkeyLearn topic classifier to categorize the text and create a histogram of the most popular categories for the user:
from collections import Counter def category_histogram(texts, short_texts): # Classify the bios and tweets with MonkeyLearn's topic classifier. topics = classify_batch(texts, MONKEYLEARN_TOPIC_CLASSIFIER_ID) # The histogram will keep the counters of how many texts fall in # a given category. histogram = Counter() samples = {} for classification, text, short_text in zip(topics, texts, short_texts): # Join the parent and child category names in one string. category = classification[0]['label'] + '/' + classification[1]['label'] probability = (classification[0]['probability'] * classification[1]['probability']) MIN_PROB = 0.3 # Discard texts with a predicted topic with probability lower than a treshold if probability < MIN_PROB: continue # Increment the category counter. histogram[category] += 1 # Store the texts by category samples.setdefault(category, []).append((short_text, text)) return histogram, samples # Classify the expanded tweets using MonkeyLearn, return the historgram tweets_histogram, tweets_categorized = category_histogram(expanded_tweets, tweets_english) # Classify the expanded bios of the followed users using MonkeyLearn, return the historgram descriptions_histogram, descriptions_categorized = category_histogram(expanded_descriptions, descriptions_english)
Display the most popular categories
The above histogram counts how much Tweet activity a user has in each category. Using matplotlib, we create a pie chart that shows the distribution:
The previous pie chart represents my own interests, which is a pretty accurate breakdown given my Twitter activity. I’m a software engineer and geek, so I’m very interested in Computers & Internet/Programming. Also I’m an entrepreneur, so I’m also interested in Business & Finance/Small businesses.
Extract keywords from a given category
The pie chart offers a high level summary of a user’s interests. We can dig deeper, finding specific interests in that category. To do that, we’ll again use our keyword extractor to highlight the most important terms in each category.
First, for each category, we’ll join all the content:
joined_texts = {} for category in tweets_categorized: if category not in top_categories: continue expanded = 0 joined_texts[category] = u' '.join(map(lambda t: t[expanded], tweets_categorized[category]))
We then use MonkeyLearn to extract keywords for each category, only keep the top 20 by relevance:
keywords = dict(zip(joined_texts.keys(), extract_keywords(joined_texts.values(), 20))) for cat, kw in keywords.iteritems(): top_relevant = map( lambda x: x.get('keyword'), sorted(kw, key=lambda x: float(x.get('relevance')), reverse=True) ) print u"{}: {}".format(cat, u", ".join(top_relevant))
The following clouds show the keywords that represent the computers & internet, and the business & finance categories respectively:
As another data point, you can see the pie chart and word cloud for Katy Perry, in which we identify events and Special Occasions and Entertainment are key categories, given her career and busy event schedule.
Conclusion
Using the Twitter API and MonkeyLearn, it’s simple to classify and extract relevant information from the Tweets and users descriptions. Together they offer useful insights into an individual usage, which can be used for a variety of applications:
We encourage using Twitter API and sign up to MonkeyLearn to discover new applications with the programming language you love.
Credits
A huge thanks to Agustin Azzinari and Rodrigo Stecanella for their contributions to the source code and Federico Pascual and Martin Alcala Rubi for their writing and editing.