Twitter is a great source of information as well as a fantastic communication tool on specific event. In our – academic – world conferences are one of the most commonly occurring events, where a significant degree of communication between participants now happen on Twitter. Last week Dublin was hosting the Research Data Alliance Plenary 3 with over 400 participants. It was organised by Insight Centre and the DRI, as well as other partners from Australia, Europe and US. We were assisting the organising committee with the event’s social media, part of which we demonstrated the top tweeters for each day and also the top tweeters and the top retweeteds for the whole conference. In addition to this, I presented an analysis and visualisation of the conversation between #RDAPlenary tweeters. Here I am sharing with you how these were done, in case you find them interesting or wanted to do the same for your other events.
In this blog post I take you step by step through what I did for (1) computing top tweeters and retweeters, and (2) analysing and visualising the #RDAPlenary social network.
- Twitter: You need a Twitter account
- ScraperWiki: You need a ScraperWiki account
- MS Excel – used only for top tweeter identification and not necessary for social network analysis
Getting #RDAPlenary tweets
The first step in this journey here is to download tweets from Twitter. Twitter doesn’t provide free access to all the tweets posted on Twitter. It, however, gives us access to a selection of tweets via the Twitter APIs and searching for specific hashtags, keywords or users. Access to all Twitter data, the Twitter firehose, is very much restricted and also expensive.
There are various ways to collect tweets. Programmers usually write their own scripts for this purpose, but here we use a very easy tool for this, which doesn’t require any programming skills and that’s the fabulous ScraperWiki.
1. Sign in to ScraperWiki.
2. Select ‘create a new dataset on the page you are directed to, called ‘data hub’.
3. Select ‘Search for Tweets’ to search for tweets you are interested in.
4. The next step is to tell ScraperWiki what you are interested in. Here you can use various advance search operators supported by Twitter. The easiest, however, is to search for a hashtag or a keyword. For our purpose we searched for #RDAPlenary to collect the conference tweets.
You may be asked to authorise the ScraperWiki to access your Twitter account and then ScraperWiki will start downloading all tweets it can find and access. You can also tick a box for collecting future tweets matching the hashtag.
5. After a short while ScraperWiki will show you that the tweets are collected and ready for you. You can now see collected tweets on the ‘View in a Table’ tab.
6. You will now have the option to download the tweet dataset as a spreadsheet or a CSV file. Download as spreadsheet. You may also use the the #RDAPlenary tweet dataset that I collected and used for RDA (#RDAPlenary tweets from 25 March 2014, 6.00 am.m. until 28 March 2014 3.00 p.m.).
Open the spreadsheet you just saved and have a look at the data. You will soon find yourself familiar with various columns in the spreadsheet, such as ‘created_at’, ‘text’, ‘screen_name’ and ‘hashtags’. Feels good?!
Top Tweeters of #RDAPlenary
During the conference we presented the top tweeters and the most retweeteds. For this I used MS Excel.
In the conference we were mainly focused on the original tweets and thus I deleted the retweets (starting with RT) from the datasets.
1. Select the ‘text’ column and select sort option as shown in the picture. This will sort the body of the text alphabetically and thus put all the tweets starting with RT together.
2. Select the ones starting with RT and delete them all.
Now we have a dataset of original tweets. Save it somewhere safe for further social network anlaysis, and use a copy of this clean dataset of original tweets for computing top tweeters in Excel.
We are now ready to calculate top tweeters.
3. Select all the data in the spreadsheet (Ctrl+A) and the select Pivot Table from Data tab. Pivot tables are a great way of summarising data in Excel.
This will open up a new work sheet for you with a guessed Pivot table. You don’t want that table and need to make changes to the pivot table and turn it to what you need.
4. What you want is to have the user handles in your first column and the count of each user’s tweets in front of their twitter handle. You should have a ‘Pivot Table Builder’ window automatically opened or you should be able to access it under Pivot Table>Data section. This builder helps you to choose the Row and column lables and values for your pivot table.
5. Scroll down the Field Name list and un-tick all the boxes, so that we can start fresh. You can also drag from Row Labels, Column Labels and value boxes and drop them in the big Field name box.
Now we have an empty pivot table.
6. Drag ‘screen name’ from Field name box to Row Labels box.
7. Drag ‘text’ from Field name box to Values box.
8. Click on the i on the right side of the dropped text under Value and make sure ‘Count’ is selected. This counts the tweets for each row, i.e. each Twitter handle.
Now you just need to sort your pivot table to see the top tweeter.
9. Select the values under Total column until the Grand total, leaving the Grand total value out. Then choose sort option.
Yay! You now have the top tweeters here computed for you.
10. You can find out the top retweeters in the same manner. For this in your pivot table choose ‘retweet_count’ and choose ‘Sum’ function for it, and then sort again. This will give you the sum of retweets for each user.
Data manipulation using OpenRefine
Now we want to get ready for some exciting social network analysis and find out how much conversation has been going on during #RDAPlenary. I used Gephi for this, but before being able to send the data to Gephi for network analysis and visualisation, we need to transform data into a Gephi friendly shape. For this I used OpenRefine, which is a great tool for data cleansing and manipulation.
Extracting mentions using OpenRefine
What we need to do is to extract mentions from each tweet to be able to draw the conversation between people identified by their twitter handle (@mentions).
1.Open OpenRefine with OpenRefine default options.
2. Load your original cleaned dataset (the one you removed RTs from and saved to put aside) into OpenRefine using “create project”.
What we want now is to transform the data into a format that Gephi likes, and that is a format that has two columns: (1) user handle (screen name) and (2) the handle of the second person mentioned in the user’s tweet.
Ok, first thing we need to do is to extract mentions, start with @, from the tweets.
3. Before anything let’s unify mentions by lowercasing all the tweets’ texts. Select ‘edit cells>common transforms>to lowercase’ from the column options menu of the text column (the arrow next to the column name).
Great! Now we need to extract all the mentions from the tweet texts. For this we have to take several steps at once.
First we need to split all the words. Twitter seperates words by anything that is not a letter, number, dash (-), or underscore (_).
4. Select “edit columns – add column based on this column” and choose the column mentions. The expression that will split the column’s text into words is:
Next we need to filter the list that for taking the mentions out, i.e. everything that starts with a @.
To filter the list, we use the function filter(). The filter() function wants your list first, then the name of a variable to assign to each column, and then something that checks whether or not it should be included.
5. To filter for each person mention, we can amend the above expression as follows:
6. We now need to join the list by appending the .join(“,”) function. This joins the list into a single string of text by inserting commas in case there is more than one mention in one tweet.
7. Now add this column to the dataset.
We now want to remove the columns that we don’t need, and bring the two columns ‘screen name’ and ‘mentions’ into position.
8. Delete the columns that we don’t need and reposition ‘screen_name’ and ‘mention’ columns if required.
Great, now we have two columns left, justas we were wishing for: the user handle and a comma separated list on mentions in a specific tweet from that handle.
9. For Gephi, we’ll need to have each user-mention pair in one row. This we need to split the multi valued column into several rows. Do so by ‘edit cells>split multi-valued cells’, and split by comma (,).
Boom, done! Oh, wait, but we now have a number of rows with blank screen_name!
10. Do not worry! Use ‘edit cells – fill down’ to add them to all the empty rows.
11. You now notice that there is a difference between the screen_names and the mentions: mentions start with @ (for users) and are in lowercase.
Let’s first lowercase the letters using “edit cells – common transforms – to lowercase”, as we did before.
12. Then use “edit cells – Transform” to add the @ in before the screen_name using “@”+value.
13. Viola! You have now formatted the file for Gephi. Download it as CSV (Comma-separated value) using the ‘Export’ button on the top right of your OpenRefine window.
Social Network Analysis using Gephi
We are now ready to start our social network analysis in Gephi.
1. Start Gephi and choose ‘new project’.
2. Open the CSV with ‘file>open’.
3. Select “directed” and leave the defaults.
Since Gephi takes all the rows and columns into account, we need to remove the column headers.
4. Change the view to the ‘data laboratory’ view.
5. Remove the first two nodes (screen_name and mentions): mark them, then right click and select ‘delete all’.
6. Change back to the ‘Overview’.
This is a pretty ugly and unreadable graph! This is because we haven’t applied any layouts options.
7. It is easy: In the layout window on the left, select ‘ForceAtlas’. ForceAtlas is a simple algorithm, which groups connected nodes closer together.
Let ForceAtlas run for a while and stop it when you feel better about the graph.
8. But how do we know which node is who? Let’s enable the labels.
Ok, we now have the labels, i.e. user handles, but they are not clear to read, and we don’t know which labels are more important than others.
9. To scale the labels by how many connections a label has, select label size on the top right of the Ranking window; Then choose ‘degree’ as a parameter.
10. Click on ‘apply’ and play with the parameters (minimum and maximum size) as you see fit (make sure you apply every time when you make changes).
11. Cool, but we still can’t read a thing! Luckily there is a layout called ‘label adjust’. This layout will move nodes so the labels don’t overlap. Let it run for a while.
12. For a final view switch to Preview tab and select refresh (make sure show label is selected). You can change the presets and Node values as you wish and refresh to see the result.
14. When you are happy with the result export your graph to your preferred format. Pretty, isn’t it?!
Here you can find the exact network visulisation I presented at the RDA Plenary 3 closing session and the complete #RDAPlenary tweet dataset that I used for it. More on RDA Plenary social media can also be found here.
This work was inspired by this School of Data work.