This is Your Life! (brought to you by LinkedIn & Python & Gephi)

Alas, this totally awesome feature from LinkedIn API will discontinue its availability sometime in May of 2015. My hopes are that LinkedIn either changes its mind or allows a partner to still continue lending the data to aspiring analysts (such as myself). Regardless, I wanted to take a moment and share a little bit about my experience with some extraordinarily powerful tools explaining network data using the soon-to-be-discontinued LinkedIn API.

Before I begin, I need to give the majority of the credit for this experience to the author of this post on Linkurio. The author goes by the pseudonym, Jean, and that’s all I’m able to find out. Thank you, Jean, whoever you are! Also, thanks to Thomas Cabrol for compiling some excellent code as a place to start.

Process

Ok, to start we must begin by creating an API account, which can be done by visiting the LinkedIn developer URL: https://www.linkedin.com/secure/developer

Add a new application, and then record the API and User keys because they are needed in the Python code below.  Note: if you are unable to retrieve the user tokens, the Linkurio post provides an option on how to collect them. Here is the code I ran:

[su_heading]Step 1: Connect to LinkedIn API[/su_heading]

#!/usr/bin/env python
# encoding: utf-8
"""
linkedin-2-query.py

Created by Thomas Cabrol on 2012-12-03.
Copyright (c) 2012 dataiku. All rights reserved.

Building the LinkedIn Graph
"""
#Note: run this first, and then run cleaner.py

from __future__ import division, print_function

import os
os.chdir("C://Users//Peter//Documents//code//linkedin")

print(os.getcwd()) 
import oauth2 as oauth
import urlparse
import simplejson
import codecs
 
 


# Define CONSUMER_KEY, CONSUMER_SECRET,  
# USER_TOKEN, and USER_SECRET from the credentials 
# provided in your LinkedIn application
CONSUMER_KEY = xxxx;
CONSUMER_SECRET = xxxx;
USER_TOKEN = xxxxx;
USER_SECRET = xxxxx;
 
 

OAUTH_TOKEN = USER_TOKEN
OAUTH_TOKEN_SECRET = USER_SECRET
 
OUTPUT = "linked.csv"
 

consumer = oauth.Consumer(key=CONSUMER_KEY, secret=CONSUMER_SECRET)
token = oauth.Token(key=OAUTH_TOKEN, secret=OAUTH_TOKEN_SECRET)
client = oauth.Client(consumer, token)
# Fetch first degree connections
resp, content = client.request('http://api.linkedin.com/v1/people/~/connections?format=json')
results = simplejson.loads(content)    
# File that will store the results
output = codecs.open(OUTPUT, 'w', 'utf-8')
# Loop thru the 1st degree connection and see how they connect to each other
for result in results["values"]:
    con = "%s %s" % (result["firstName"].replace(",", " "), result["lastName"].replace(",", " "))
    output.write("Peter Eliason," con "n")
    # This is the trick, use the search API to get related connections
    u = "https://api.linkedin.com/v1/people/%s:(relation-to-viewer:(related-connections))?format=json" % result["id"]
    resp, content = client.request(u)
    rels = simplejson.loads(content)
    try:
        for rel in rels['relationToViewer']['relatedConnections']['values']:
            sec = "%s %s" % (rel["firstName"].replace(",", " "), rel["lastName"].replace(",", " "))
            output.write(con "," sec "n")
    except:
        pass

 

This code will connect to my LinkedIn account and DOWNLOAD my ENTIRE 1st and 2nd degree network!  Spooky, but awesome.

Now that the data has been downloaded, the next step is to clean it up.  We may run the below to remove bad characters and set everything to lowercase:

 

[su_heading]Step 2: Cleaner code[/su_heading]



#!/usr/bin/env python
# encoding: utf-8
"""
linkedin-3-cleaner.py

Created by Thomas Cabrol on 2012-12-04.
Copyright (c) 2012 dataiku. All rights reserved.

Clean up and dedup the LinkedIn graph
"""
#note: run network_test.py first

from __future__ import division, print_function

import os
os.chdir("C://Users//Peter//Documents//code//linkedin")

print(os.getcwd())

import codecs
from unidecode import unidecode
from operator import itemgetter
 
INPUT = 'linked.csv'
OUTPUT = 'linkedin_total.csv'
 
def stringify(chain):
    # Simple utility to build the nodes labels
    allowed = '0123456789abcdefghijklmnopqrstuvwxyz_'
    c = unidecode(chain.strip().lower().replace(' ', '_'))
    return ''.join([letter for letter in c if letter in allowed])
 
def cleaner():
    output = open(OUTPUT, 'w')
    # Store the edges inside a set for dedup
    edges = set()
    for line in codecs.open(INPUT, 'r', 'utf-8'):
        from_person, to_person = line.strip().split(',')
        _f = stringify(from_person)
        _t = stringify(to_person)
        # Reorder the edge tuple
        _e = tuple(sorted((_f, _t), key=itemgetter(0, 1)))
        edges.add(_e)
    for edge in edges:
        output.write(edge[0] "," edge[1] "n")
 
 
if __name__ == '__main__':
    cleaner() 


The next part of the Python code uses a library called NetworkX to create a file format called graphml which can be imported by a network graphing tool called Gephi. NetworkX is actually capable of far more than simply converting API files to graphml, but we’ll hold off on that tangent for another post. For now, we’ll focus on Gephi and graphml.

[su_heading]Step 3: Create graphml file[/su_heading]


# Defining and Visualizing Simple Networks (Python)

# prepare for Python version 3x features and functions
from __future__ import division, print_function

import os
os.chdir("C://Users//Peter//Documents//code//linkedin")

print(os.getcwd())


# load package into the workspace for this program
import networkx as nx
import matplotlib.pyplot as plt  # 2D plotting

# read Wikipedia Votes data creating a NetworkX directed graph object g
f = open('linkedin_total.csv', 'rb')
g = nx.read_edgelist(f, delimiter=",", create_using = nx.DiGraph(), nodetype = str)
f.close()

nx.write_graphml(g,'linkedin_total.graphml')

Ok, so now we’ve got our graphml file. The next step is to import it into this tool called Gephi. You’ll need to download Gephi as an executable — it’s not a Python library or anything like that. It is a standalone visualization tool.

I’m a Windows user and I had problems getting Gephi to install properly. I was able to work around this by UNINSTALLING Java, and then reinstalling an old version of Java, version 7. After I did this, I was able to install Gephi without problems.

I’m told that Mac users are able to install Gephi with no problems. Figures, ha!

Now, after importing the graphml file into Gephi, I took these steps:

  1. On the left-hand-side, ran “Force Atlas 2.”  It takes a LONG time for the process to complete, so I cancelled it after about 10 minutes because the visualization was close enough for my needs.
  2. Activated the “Show node labels” to see who each node represented
  3. Ran the modularity algorithm in the Statistics panel (on the right). I went to the partition window (select Window > Partition) and choose to color the nodes according to their “Modularity class”

 

I’m left with a stunning graph of my network with me listed as the center node.  Each color represents a certain cluster within my list of connections. If you look closely, you can see that some nodes have names next to them (I’ve purposefully left them obscenely small to protect the identities of my connections), but Gephi allows the analyst to zoom in and out in order to explore the network.

gephiblank

After only a couple minutes, it becomes blatantly clear which each of these clusters and colors represent. They’re a representation of ME and MY LIFE!  The incredibly beautiful part of this entire process was that the analysis was entirely undirected! I had nothing to do with direction the creation of these clusters…NetworkX and Gephi did all of that for me by themselves!

To call attention to each of these clusters, I’ve gone ahead and named each cluster, here.  Each cluster represents a key time and network (aka: clique) in my life.

gephi_lists

The Undergrad section represents all of my connections from my undergrad school, Luther College in Decorah, IA.

MSPA represents grad school connections (in another state, and 10 years after undergrad, so not much connection between those two networks!) as part of Northwestern University in Evanston, IL.

Also interesting, Best Buy had some hard years back in 2008-2010 and a lot of folks left Best Buy to join SUPERVALU, which explains the many connections between the two.

The fascinating thing about this analysis, is that through LinkedIn, I have a basic map of my Personal AND Professional life.

Conclusion

While this particular map may not be extraordinarily useful for advancing my career, it allows me to be reflective on the state of my network, and in essence, a brief story of my life.

In a business setting, however, I can see how this process might be interesting in identifying clusters, tribes, and influencers using relationship data to understand influence of products, lifestyles, and consumer choices..

Twitter API

I’m finding out that the Twitter API has been fairly consistent between Python environments (my home and my work machines – both running Windows 7). However, I’m running into significant issues with twitter data encoding on my work machine. By encoding, I’m talking about which characters are interpretable or printable. In terms of Python IDE, I’m running Canopy at home and Spyder on my work machine.

For collecting Tweets, I have been using Tweepy. There are many other classes available to extract tweets from Twitter’s API, but Tweepy was very easy to install and manage. Like I mentioned above, the problem now is extracting the data from the JSON file created by Tweepy so that it’s available to my work machine. Once I am able to figure out how to extract data on my work machine, I will post the code for both environments.

The picture below has been particularly helpful in extracting data from the Twitter JSON record. It’s a bit dated, but much of the record structure is still true. I found the image from another post by Mike Teczno.

Here is the map of a tweet, published by Raffi Krikorian:

raffi-krikorian-map-of-a-tweet.

Parsing Exact Target (aka SalesForce Marketing Cloud) List Detective Import Error Files

When sending emails from ExactTarget (aka: SalesForce Marketing Cloud) — you know, we’ll just call it “ET” for short — we occasionally receive error logs which contain ET subscriber keys where a small subset of targeted customers match profiles of other customers who have unsubscribed or been flagged as undeliverable.  Our goal is to process these log files and mark these customers as “undeliverable” in our parent system.  Since the log files are not formatted for data import, we need to use a parsing tool to extract the subscriber keys. I chose Python.

In the code below, I enter the path of the log file which was emailed to me from ET, and then I use re (regular expressions) to find all instances that match the subscriber key format, which is “[varchars-varchars-varchars-varchars-varchars]”.

Your version of subscriber keys will undoubtedly look different than mine, but you should be able to modify the regular expression in the re.compile() function to search for the right format. More info about Python’s regular expression class.

Let me know what you think!

 
I have included sample files with fake data for reference:
Input File (9989FFA6-BD29-8DBA-B712-C6E8ED32F0X9 Results 988623122.txt)
Output File (cleanFile.txt)

 

 
[su_heading]Python Code[/su_heading]


# -*- coding: utf-8 -*-
"""
Created on Wed Feb  4 13:59:50 2015

@author: peliason

Purpose: Parse out subscriber key from ET List detective.

"""

from __future__ import division, print_function



##PARAMETERS
inputFile = "C://Users//peliason//Documents//reference//python//ET parse subscriber//for post//9989FFA6-BD29-8DBA-B712-C6E8ED32F0X9 Results 988623122.txt"

outputFile = "C://Users//peliason//Documents//reference//python//ET parse subscriber//for post/cleanFile.txt"
###END PARAMETERS





import re
import codecs


#initialize output list
found = []

#this is my search function
subscriberKeySearch = re.compile(r'[([A-Za-z0-9_] - [A-Za-z0-9_] - [A-Za-z0-9_] - [A-Za-z0-9_] - [A-Za-z0-9_] )]')

## Open the file with read only permit
f = codecs.open(inputFile, 'r','utf-16')

##IMPORTANT: if your input file is utf-8, then you should be able to run this 
#   file open code instead (without the codes class):
#   f = open(inputFile, 'r')


## Read the first line 
line = f.readline()

## till the file is empty
while line:
    
    found.extend(subscriberKeySearch.findall(line))
    
    line = f.readline()
f.close()

#write the list to the outputFile
myOutput = open(outputFile, 'w')
for ele in found:
    myOutput.write(ele 'n')
    
myOutput.close()


 .