Advice for aspiring data scientists: Learn SQL

My advice to data scientists hoping to get into IT or business: Learn SQL.


Pronounced “sequel,” SQL has been around since the 70s and remains the most popular language for interacting with databases today. Please note the subtle choice of words here…I said “interacting with databases,” not “interacting with data.” There continue to be strides made by the open source community and other tool sets that make working with data a breeze, but the work of bringing the data down from the database largely requires the work of someone who knows a thing or two about SQL.

My understanding is that SQL was originally created for marketers to be able to interact with databases on their own! The fire didn’t really catch though, and technical resources and analysts soon became the bearers of the skill over time.

“But I’m a data scientist, can’t someone else get the data for me?”

Sure, I get that point of view. But there are also some limitations to one’s productivity by abstaining from learning SQL. For example:

  1. You will be limited to only organizations large enough to staff a full-time database extract resource to pull your data down for you. Not every organization can afford to specialize data extraction apart from data science. By learning SQL, you will be able to expand your prospective workplace search by spanning into both roles.
  2. There will be misunderstandings between you and your data extraction partner, in terms of what data you need for your work. I did not say “there may be…” I said “there will be.” By learning SQL, you will be forced to understand the cardinality and sources of the data at the level of detail you require. Even if you are fortunate enough to have a data extraction partner on staff to support your data needs, you will find that your communication will benefit from a strong grasp of SQL. For example, stating that you need “sum of sales and count of distinct transactions grouped by region” will make your data partner’s life much, much easier.
  3. You will have to rely on someone else’s competing bandwidth. Imagine your boss demanding updated model results by end-of-day, but your data extraction partner is on vacation or simply busy with another project. How does that make you look? Even if that situation only happens a few times, imagine how your performance looks compared to a peer data scientist who knows SQL and can consistently meet deadlines.


My prediction is that SQL will likely be around for another 20 years, so learning it now will keep you relevant and valuable in the short-term and the long-term.

As far as data querying go (eg: SELECT statements), SQL syntax is generally common between platforms, so a Microsoft SQL Server version of SQL will be relatively similar to an Oracle version of SQL. You may not be able to completely copy a giant SQL statement and run it in another system, but it will be pretty close. As a data scientist, of all the SQL you write, 95% or more will be data querying, so don’t get too worked up about learning data INSERT or UPDATE statements across platforms.

Even some of the most advanced massively parallel processing (MPP) systems are using SQL, for example Redshift, Google BigQuery, Netezza, Vertica, and Aster Data. Many of these are relatively new platforms who have chosen to support SQL.

“But what about the new data management platforms that don’t use SQL, like Hadoop and NoSQL?”

Sure. I would say to learn those platforms, as well! The point I am trying to make here is it that it will do your career well for you to become autonomous at extracting and providing your own data. Data will never become less abundant, and by enabling yourself to pull the data from its source, you become significantly more valuable to an employer.

When you need a friend, MatchIt.

I’ve recently fallen in love with a great R package called MatchIt, and I’m going to talk about it in an upcoming Minneanalytics conference, Big Data Tech 2017.

MatchIt was created by Ho, King, Imea, and Stuart with the original intent of measuring political figures’ behavior regarding particular topics. Interestingly, a powerful side effect of this usage is that the tool can be used to help find a “like” sub-population between two different samples, whereby recreating similar distributions and attributes across multiple populations. Not surprisingly, this has a direct impact to retail testing.

A “mocked” version of my code to perform the matching can be found below. Here, I’m identifying 5 “matched” control stores for each test store. My poor choice for a blog post title refers to the tool “finding a friend” or finding a match for each specified test store.

If you’d like to see the original documentation by the authors, a great reference can be found here: It includes a deep description of what’s happening within the tool. Everything made a lot more sense to me after reading the doc.

UPDATE: I’ve found a much better parameter for the MatchIt script when using it for match-pair purposes. The distance=”mahalanobis” parameter will actually choose nearest neighbors instead of matching on propensity score which is the default method. I much prefer my results with the mahalanobis parameter. Model code updated below to reflect this addition.

#Assign "test / control" flag
testdlrsDF <- filter(subsetDF,Focus.Market != "CONTROL")
controldlrsDF <- filter(subsetDF,Focus.Market == "CONTROL")
testdlrsDF$treatment <- 1
controldlrsDF$treatment <- 0
pairsDF <- union(testdlrsDF,controldlrsDF)

#apply Match-Pair Assignment
m.out1 <- matchit(treatment ~ latitude + metric1 + metric2 + dim1 + dim2
                  , method = "nearest", distance="mahalanobis",
                  ratio=5,data = pairsDF,replace=TRUE)

m.out2 <- m.out1$match.matrix

#construct output dataframe
pairsdlr <- data.frame(TestDealerNum = integer(), ControlDealerNum = integer())
newRow1 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,1],]$DealerNum)
newRow2 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,2],]$DealerNum)
newRow3 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,3],]$DealerNum)
newRow4 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,4],]$DealerNum)
newRow5 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,5],]$DealerNum)
#note: actual analysis used 10 samples instead of 5...

pairsdlr <- dplyr::bind_rows(pairsdlr, newRow1, newRow2, newRow3,newRow4,newRow5)

#write data
write.csv(pairsdlr, "matchpairs.csv")

A Gentle Introduction into Random Forests

I can still picture exactly where I was the first time I heard someone suggested implementing a Random Forest as a modeling technique. We had hired a contractor to help with a particular modeling problem, and I literally laughed out loud and asked the individual if he was serious.  Hearing it for the first time, the word “random” certainly does not offer much to instill confidence in predictive accuracy. “How would something random add any sort of value,” I wondered. However, I later learned Random Forests have proven to be an extremely effective modeling technique, able to protect against correlated variables and bias.

In this post, I will provide context around what Random Forests are and what value they bring to business.


Overview of Random Forests

Random Forest models are based on the Decision Tree modeling technique which is based on splits of data rather than linear correlation. Developed by Breiman (2001), the algorithm follows a Bagging technique (coincidentally, also developed by Brieman several years prior) except that in addition to randomizing bootstrapped samples of data, Random Forest also randomizes bootstrapped sampled of predictors (Figure 1).


Figure 1: Evolution of Random Forests


In Figure 1, notice how there is a single tree for the CART model. The next evolution, Bagging, employs multiple trees based on bootstrapped samples of data (James, et al, 2014). We refer to this as ensemble modeling, because we use multiple models simultaneously to determine a prediction for a single observation (Seni & Elder, 2010). Ensemble modeling has proven to occasionally yield more accurate results at the expense of model interpretability. In Bagging, notice how the top of the tree is generally the same “Important Predictor.” This leads to correlated models. The correlation can be addressed by implementing a random factor (called perturbation) which only selects a subset of predictors with each bootstrap sample. Random Forest, another ensemble method, employs this approach. In the end, these ensemble techniques combine all of their models together and export a composite score or predictor (for example, through voting), for each observation.

While it does operate under the guise of a black box, Random Forests do leave us a few minor clues as to what’s going on underneath the hood. In statistics packages, there are generally some “variable importance” plots which can be conjured once a model is fit. These plots allow us to see which variables are most “interesting” but don’t necessarily explain why they’re interesting or even give a correlation sign.  Also, if needed, we can generally extract a few actual “trees” or splits from within the model construct, but since there are generally so many trees, simply reviewing a handful of them closely would not be helpful, and in fact, may be misleading.


Value of Random Forests


The value we realize from Random Forests is that it protects against correlated variables and gives each predictor more of a chance to be recognized in the model rather than be overshadowed by a few strong or greedy predictor variables. This is awesome when there exists high multicollinearity or a high number of predictors present. Overall, these additions lead to greater predictive accuracy (Seni & Elder, 2010). The downside of the Random Forest model is that it is not interpretable to the analyst or the business. It will be very difficult to peel back the covers and determine “why” a particular observation was classified in such a manner. The business must learn to “trust” the model through cross-validation and constant model performance monitoring.



For more Random Forest fun (I can tell you are hardly able to contain your excitement), head on over to either this other author’s blog post and one more for more “gentle” conversations regarding your new favorite landmark in Statisticsland, the Random Forest.





Breiman, L. (2001). Random Forests, random features. Berkeley: University of California. 1.1, 4.4.

Chapman, C. & Feit, E. (2015). R for Marketing Research and Analytics. Switzerland: Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). Introduction to Statistical Learning. New York: Springer.

Seni, G. & Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions. Morgan & Claypool.

Economics of CRM Modeling

These are several conversations I have had with many marketing leaders; maybe you can relate?


Conversation 1

Marketing Manager: If we get $2 in incremental sales per customer mailed, we should be able to increase our list by 10,000 and then make an additional $20,000 in incremental sales, right?

Me: Customer response and incremental sales are not linear!

Conversation 2

Marketing Manager: If you build a model, it will drive incremental sales, right?

Me: We’re already mailing nearly all of our customers.

Marketing Manager: Right, but if you build a model, it will drive more sales, though, right?

Me: No, that’s not quite how it works…just a sec…here (comes back with handy-dandy economic CRM diagram)



Here’s the PDF, for people who like PDFs.

I built this little diagram to bring all the key metrics into a single place that I can then use to demonstrate the relationships between cost, file size, incremental sales, ROI, and even more importantly the NET CONTRIBUTION. Here, I define [NET CONTRIBUTION] = [Incremental Sales] – [Cost].

Many leaders get caught up on ROI, but again, ROI is not a linear thing. Once the model is built, if we invest more in going deeper in the mail file, we cannot expect to maintain the same level of ROI.  ROI might be fantastic for the first decile, but the marketing manager needs to also consider, “will I get enough volume of incremental sales from this tactic at this depth to even make the ROI worth it?”

I like this  graph because it also emphasizes that there exists some “arbitrary incremental sales ceiling.”  In other words, if we mail EVERYONE in our entire CRM database, we will probably generate a lot of incremental sales, but it will probably be at an even GREATER expense…which is why net contribution (and ROI) is nearly zero at the full size of the file.

The goal, then, of marketing analytics is to optimize on maximum net contribution. Maximum ROI will likely only be a few records translating to minimum incremental sales, and Maximum Incremental Sales will be the whole file (which will probably also be super-high cost).  So, maximum net contribution is where the focus should be.

Once our “customer selection model” is established, we can use other models that “choose offers, choose times, or choose creative” to send our targeted customers in order to improve response or incremental sales for that group. For example, these types of models could include offer or product propensity models, lifecycle modeling, or creative/behavioral segmentation.  In other words, a customer selection model that chooses “who” to target won’t necessarily increase incremental sales from one campaign to the next (if circulation depth is held constant),* but an offer, timing, or creative model might be able to improve incremental sales because the “right message” and “right time” principles are addressed.


*Hey, I know there’s an argument to be made for improving models for customer selection and uplift modeling, etc. (which would boost incremental sales slightly on the same circulation), but that’s another discussion, mm-kay?

INFP? Consider a career in analytics.

I realize that this post will only resonate with a small number of people (3% to be exact) who associate with the Meyers-Briggs personality type INFP.

For those of you who aren’t familiar, INFPs are the types of people who aren’t very task-oriented, daydream (a lot), become obsessed and forget about other important things, and may reserve some judgement because they don’t want to hurt other people’s feelings.  Sounds like a great candidate for hire, am I right!

I’ve taken the Meyers-Briggs (MBTI) test several times and am consistently scored as an INFP. Then, after the test, I’m usually provided a little blurb with “career advice for INFPs” among relationship advice (which I clearly need help with) and other things. But, within the career advice section, it always mentions social careers or art or music. Now, not surprisingly, I do enjoy music and do break out the ol’ guitar from time-to-time, but I have always been curious that there is never any mention of Analytics as a career option which has been an incredibly fulfilling option for me.

I’m usually provided a little blurb with “career advice for INFPs” … but I have always been curious that there is never any mention of Analytics as a career option which has been an incredibly fulfilling option for me.

Here are some examples:

  • As an INFP, I am innately curious. Analytics THRIVES in an environment where someone is curious. It is only when the analyst becomes familiar with the data that interesting findings become more apparent.
  • As an INFP, I need a creative outlet. Analytics allows me to be extremely creative through creating presentations, storytelling with the data I curate, choosing which techniques to experiment with and which paths to follow.
    • I find that having supportive leadership is incredibly beneficial. When leadership trusts me and allows me to take some time to explore, I feel that I’m able to find some really interesting patterns and information.
  • As an INFP, I like having time at work to be by myself and be thoughtful. As an analyst/data scientist, I will have occasional meetings with people when I have questions or need to learn about data or context around certain problems or when I need to present. But, I like being able to go back to my desk and get work done, too. And if I do need to meet, I prefer the smaller meetings where there is focus and some sort of goal.
  • As an INFP, I tend to become obsessed with a particular technology or problem. This allows me to ramp up and adapt very quickly and become an expert.
  • As an INFP, I strive to get my customers the answers they need. In analytics, I’ve found that my customers are most delighted when I’m able to respond to their questions within the same day. Therefore, jumping through hoops or diverging from process (occasionally) needs to happen in order for me to turn something around quickly. As an INFP, though, I have no guilty conscience about breaking process, because at the end of the day, if my customer is happy and is able to do her job better, then it’s better for the company.
    • Hey, you “Js” sorry about breaking the process. It’s fine. It won’t happen again. 😉
  • As an INFP, I like being able to learn how things work. That’s why writing code has been really fun for me. I am “fluent” in many data gathering, analysis, and coding languages, and I love being able to sit back and watch my code “run.” I was never in love with Mathematics in school, but now that I can see the value and how it fits within an interesting career for me, I’ve been much more interested in learning math.


I could go on and on about this. Ultimately, the reason I wanted to jot down my thoughts here is that I feel like analytics is a GREAT career option for INFPs, and I don’t feel like it is getting the attention it deserves. I’ve read several blogs from INFPs who are having a career crisis, and so I wanted to mention this in the hopes that it may be helpful for someone like me, an INFP.



(Photo credit: Daydreaming - Craig Edwards)

Rise of the Gamer

No, I’m not talking about the Super Mario fiend or the World of Warcraft junkie. I’m talking about what unifies us as a human race. “Fun.” Humans desire fun.

By now, many of us have heard that we humans have an attention span of eight seconds (although, my pal Phil definitely has less). So, as marketers, how on earth will all of our hard-earned collective creative genius be recognized and appreciated by our customer?

Do you know how many hours and revisions we went through to get that shade of gray on the nav bar just right?! More than fifty.

One answer is out there. It’s games.

And the brands that can figure it out first will thrive in this new day in age. Creating portals and processes that tie in to gaming or character growth and recognition will create two invaluable assets:

  • Consumer experiences which are fun and memorable and which will create loyalty, curiosity, and advocacy.
  • Tons of actual human response data that can be consumed and analyzed creating an infinite testbed.

Not much more needs to be said about creating the “fun experiences.” I think we all get that finding a way to entertain and reward our customers will create strong affinities. I do, however, want to touch on how these data can then be used to create incredible outcomes.

First off, a game could be used to simulate shopping behavior.

Do you remember playing Sim City? All that time spent building a city, just to start over? Think of all that data which could be collected from that game. Why did the customer choose the hydro plant instead of the coal plant (maybe the customer is an environmental segment)? Why build a baseball field before a library (maybe the customer is a sports-minded segment)?  Did only 0.001% of gamers choose to build a harbor (maybe customers don’t see the value in a harbor)?

The data can be compiled to established segments and determine product value and simulate customer choice.

Second, humans can refine your analytical models.

Anyone else annoyed with Siri? Think about if Apple had invented some sort of trivia game, like “Who wants to be a millionaire.”  The list of questions could either be curated by Apple or solicited from the human playing the game. Responses would be fed back into the model and then refined over time.  At the very least, Apple should have developed some sort of feedback loop where humans could say “no, Siri, THIS is what I meant <next action>.” This feedback loop allows the algorithm to get better over time, because let’s face it, we humans are tough to figure out (and…as Boaty McBoatface helped remind us, hard to trust).

How about “voting.” Think about how Tinder has taken off (grown in popularity, I mean). It’s so easy to quickly discern taste. A game where players can quickly decide whether or not they like a certain feature could be invaluable for companies planning to release new products or enter new markets.

I’m barely scratching the surface here, guys, but the amount of value from the data we could source from interactions with games could be ENORMOUS.

10 years ago, social data was quite the rage. Companies needed to figure it out, and now, we see social marketing managers staffed within nearly every organization with even the smallest digital presence. I think we will see the same shift in gaming.

If I may be so bold, I predict that in the next 10 years, we will begin to see Gaming Strategist roles begin to pop up in organizations and agencies, as well.  These will be individuals who understand human attention, understand what creates curiosity, mystery, reward, and prowess. This individual will partner with Loyalty teams to create memorable and growth experiences, R&D teams to create market-ready products, and Sales/CRM teams to create razor-sharp targeting profiles.


Who’s already doing it?

Waze does it for driving conditions, and now Google is using them for their maps app. Foursquare was one of the first to begin giving users badges and status. Any type of “forum” nowadays gives the user’s rank alongside their comments. Rally Health and Novu are doing it for health tracking and living a better lifestyle. Mastercard has been creating games with a charity element attached to it.


Games are fun.

And games are data. The game (“fun”) component creates interest/loyalty and the side-effect of game-play data could lead to a fascinating behavioral analysis around how particular segments (or direct customers) make decisions.

Game on.  -Pete



Originally published on LinkedIn April 7, 2016.

This is Your Life! (brought to you by LinkedIn & Python & Gephi)

Alas, this totally awesome feature from LinkedIn API will discontinue its availability sometime in May of 2015. My hopes are that LinkedIn either changes its mind or allows a partner to still continue lending the data to aspiring analysts (such as myself). Regardless, I wanted to take a moment and share a little bit about my experience with some extraordinarily powerful tools explaining network data using the soon-to-be-discontinued LinkedIn API.

Before I begin, I need to give the majority of the credit for this experience to the author of this post on Linkurio. The author goes by the pseudonym, Jean, and that’s all I’m able to find out. Thank you, Jean, whoever you are! Also, thanks to Thomas Cabrol for compiling some excellent code as a place to start.


Ok, to start we must begin by creating an API account, which can be done by visiting the LinkedIn developer URL:

Add a new application, and then record the API and User keys because they are needed in the Python code below.  Note: if you are unable to retrieve the user tokens, the Linkurio post provides an option on how to collect them. Here is the code I ran:

[su_heading]Step 1: Connect to LinkedIn API[/su_heading]

#!/usr/bin/env python
# encoding: utf-8

Created by Thomas Cabrol on 2012-12-03.
Copyright (c) 2012 dataiku. All rights reserved.

Building the LinkedIn Graph
#Note: run this first, and then run

from __future__ import division, print_function

import os

import oauth2 as oauth
import urlparse
import simplejson
import codecs

# USER_TOKEN, and USER_SECRET from the credentials 
# provided in your LinkedIn application
USER_TOKEN = xxxxx;
USER_SECRET = xxxxx;

OUTPUT = "linked.csv"

consumer = oauth.Consumer(key=CONSUMER_KEY, secret=CONSUMER_SECRET)
token = oauth.Token(key=OAUTH_TOKEN, secret=OAUTH_TOKEN_SECRET)
client = oauth.Client(consumer, token)
# Fetch first degree connections
resp, content = client.request('')
results = simplejson.loads(content)    
# File that will store the results
output =, 'w', 'utf-8')
# Loop thru the 1st degree connection and see how they connect to each other
for result in results["values"]:
    con = "%s %s" % (result["firstName"].replace(",", " "), result["lastName"].replace(",", " "))
    output.write("Peter Eliason," con "n")
    # This is the trick, use the search API to get related connections
    u = "" % result["id"]
    resp, content = client.request(u)
    rels = simplejson.loads(content)
        for rel in rels['relationToViewer']['relatedConnections']['values']:
            sec = "%s %s" % (rel["firstName"].replace(",", " "), rel["lastName"].replace(",", " "))
            output.write(con "," sec "n")


This code will connect to my LinkedIn account and DOWNLOAD my ENTIRE 1st and 2nd degree network!  Spooky, but awesome.

Now that the data has been downloaded, the next step is to clean it up.  We may run the below to remove bad characters and set everything to lowercase:


[su_heading]Step 2: Cleaner code[/su_heading]

#!/usr/bin/env python
# encoding: utf-8

Created by Thomas Cabrol on 2012-12-04.
Copyright (c) 2012 dataiku. All rights reserved.

Clean up and dedup the LinkedIn graph
#note: run first

from __future__ import division, print_function

import os


import codecs
from unidecode import unidecode
from operator import itemgetter
INPUT = 'linked.csv'
OUTPUT = 'linkedin_total.csv'
def stringify(chain):
    # Simple utility to build the nodes labels
    allowed = '0123456789abcdefghijklmnopqrstuvwxyz_'
    c = unidecode(chain.strip().lower().replace(' ', '_'))
    return ''.join([letter for letter in c if letter in allowed])
def cleaner():
    output = open(OUTPUT, 'w')
    # Store the edges inside a set for dedup
    edges = set()
    for line in, 'r', 'utf-8'):
        from_person, to_person = line.strip().split(',')
        _f = stringify(from_person)
        _t = stringify(to_person)
        # Reorder the edge tuple
        _e = tuple(sorted((_f, _t), key=itemgetter(0, 1)))
    for edge in edges:
        output.write(edge[0] "," edge[1] "n")
if __name__ == '__main__':

The next part of the Python code uses a library called NetworkX to create a file format called graphml which can be imported by a network graphing tool called Gephi. NetworkX is actually capable of far more than simply converting API files to graphml, but we’ll hold off on that tangent for another post. For now, we’ll focus on Gephi and graphml.

[su_heading]Step 3: Create graphml file[/su_heading]

# Defining and Visualizing Simple Networks (Python)

# prepare for Python version 3x features and functions
from __future__ import division, print_function

import os


# load package into the workspace for this program
import networkx as nx
import matplotlib.pyplot as plt  # 2D plotting

# read Wikipedia Votes data creating a NetworkX directed graph object g
f = open('linkedin_total.csv', 'rb')
g = nx.read_edgelist(f, delimiter=",", create_using = nx.DiGraph(), nodetype = str)


Ok, so now we’ve got our graphml file. The next step is to import it into this tool called Gephi. You’ll need to download Gephi as an executable — it’s not a Python library or anything like that. It is a standalone visualization tool.

I’m a Windows user and I had problems getting Gephi to install properly. I was able to work around this by UNINSTALLING Java, and then reinstalling an old version of Java, version 7. After I did this, I was able to install Gephi without problems.

I’m told that Mac users are able to install Gephi with no problems. Figures, ha!

Now, after importing the graphml file into Gephi, I took these steps:

  1. On the left-hand-side, ran “Force Atlas 2.”  It takes a LONG time for the process to complete, so I cancelled it after about 10 minutes because the visualization was close enough for my needs.
  2. Activated the “Show node labels” to see who each node represented
  3. Ran the modularity algorithm in the Statistics panel (on the right). I went to the partition window (select Window > Partition) and choose to color the nodes according to their “Modularity class”


I’m left with a stunning graph of my network with me listed as the center node.  Each color represents a certain cluster within my list of connections. If you look closely, you can see that some nodes have names next to them (I’ve purposefully left them obscenely small to protect the identities of my connections), but Gephi allows the analyst to zoom in and out in order to explore the network.


After only a couple minutes, it becomes blatantly clear which each of these clusters and colors represent. They’re a representation of ME and MY LIFE!  The incredibly beautiful part of this entire process was that the analysis was entirely undirected! I had nothing to do with direction the creation of these clusters…NetworkX and Gephi did all of that for me by themselves!

To call attention to each of these clusters, I’ve gone ahead and named each cluster, here.  Each cluster represents a key time and network (aka: clique) in my life.


The Undergrad section represents all of my connections from my undergrad school, Luther College in Decorah, IA.

MSPA represents grad school connections (in another state, and 10 years after undergrad, so not much connection between those two networks!) as part of Northwestern University in Evanston, IL.

Also interesting, Best Buy had some hard years back in 2008-2010 and a lot of folks left Best Buy to join SUPERVALU, which explains the many connections between the two.

The fascinating thing about this analysis, is that through LinkedIn, I have a basic map of my Personal AND Professional life.


While this particular map may not be extraordinarily useful for advancing my career, it allows me to be reflective on the state of my network, and in essence, a brief story of my life.

In a business setting, however, I can see how this process might be interesting in identifying clusters, tribes, and influencers using relationship data to understand influence of products, lifestyles, and consumer choices..

Machine Learning is not Artificial Intelligence

Remember the first time you heard the words “Big Data?” Well, there’s a new buzzword in town — “Machine Learning.”

Ok, when I say “Machine Learning,” what happens in your mind? What images have I conjured by saying “Machine Learning?” Maybe, you saw a brief shadow of a floating, intelligent, robotic metal squid, or a flying Keanu Reeves? Maybe, you heard the name “Ah-nold” or “I’ll be back” with occasional lasers flashing in the distance.

Well, I’m sorry to say that I’m here to burst your bubble. Pop! There it goes…  When we discuss within the context of statistics and analytics, Machine Learning is NOT the same thing as Artificial Intelligence.

Machine Learning isn’t even a super simple, intuitive approach to data modeling and analytics. Machine Learning basically has to do with the fact that technology has finally come so far as to allow computers to apply brute-force methods and build predictive models that were not possible 30 and even 15 years ago. You may have actually already heard of many Machine Learning algorithms — for example: Decision Trees, Neural Networks, Gradient Boosting, GenIQ, and even K-means clustering.  Many analytical tools, such as Python and R, already support these modeling techniques. The SciKit Learn package in Python offers a great tutorial in Decision Trees.

Ultimately, what I want you to walk away with is that, when we talk about statistics and analytics, Machine Learning isn’t some super-fancy, futuristic process that will enlighten all of your analytics capabilities. It is actually a set of functionality that already exists and can be drawn upon to create predictive models using heavy computer processing.

If you’re interested in learning more, I’ll recommend the book “Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data” by Bruce Ratner. He talks about many of these techniques, what they are used for, and how to avoid pitfalls..

SQL Server to SAS ODBC: The Partial Truth


There are occasions when your ODBC connection from SAS to SQL Server returns partial data without an error flag, and you’d never know that you were missing data from looking at the SAS log.


I was running a SAS script in SAS Enterprise Guide (SASEG) which uses SQL pass-through to interface and retrieve data from Microsoft SQL Server. In this instance I wanted to create a dataset of all coupon redemptions in a given time frame. Below is the SAS script I was running (grayed out a few proprietary database names just to be safe).

sas code


Checking the log, my script ran without error and created a dataset with 15,330 rows.


first sas try


Later in the day, I wanted to change some code further down in the script. Since our database is not live, I would anticipate that a single query would return the exact same result set each time it’s executed.  However, I noticed that my row counts were changing each time I ran the script.  The second time I ran the script, it returned 2902 rows, and then the third time I ran the script it returned 13,816 rows!

third sas trysecond sas try

What was going on?!?

Confused, I took the exact SQL code and executed it directly in the SQL Server Management studio.  This time I noticed an error “Arithmetic overflow error converting expression to data type bigint.” However, even though there was an error, SQL Server still returned 1737 rows.sql server output 1

I ran the exact query a second time. This time it returned 3813 rows!



It seemed that SQL server was encountering an error with a particular record and would return all rows up to the encounter, but somehow the ERROR message never made its way to SAS.  Also, since there was no “order by” clause in the query, the data results seemed to be ordering by some SQL Server clustering technique that might have been triggered by a random seed (which explains why my row counts always changed).

sql code

After removing the “cast” function, the query returned 19,994 records without error, and consistently returned 19,994 records every refresh thereafter.  I had discovered the issue, but I remain very concerned that SAS did not acknowledge the partial list and overflow error at all in the execution log.  If my row counts had not changed, I would have never realized that there was an error in the data.


Possible Solutions:

1) Execute every SQL Server script in SQL Server Management Studio prior to pasting into SAS. This could be a huge pain, because macros in pass-through scripts must be resolved prior to running.  An alternative to this would be to stress test the SQL scripts in SQL Server Management Studio, where we execute every possible permutation of the SAS query in order to try and protect against any subset.


2)  <TBD> Is there an error catching option? I have tried errorstop, stopontruncate, and exitcode options in the proc sql statement, but none of them seem to warn the user of this error. Please comment on this post if you have a suggestion, and I will update the post! Thanks.


Control Redeemer Effect

Intro to Test vs. Control

One of the beautiful things of CRM analytics is the ability to design controlled experiments. Random sampling of a homogeneous group of individuals allows CRM professionals to test the impact of a particular CRM offer or message against another offer or no offer, and then after applying some statistics conclude whether or not that offer or message drove a desired behavior, like an open, click, or conversion. This process is generally referred to as Test versus Control.


One potential pitfall that muddies the analytical waters is dealing with customers that potentially “cross-over” from the Control side to the Test side during the campaign. Take, for example, a customer who was assigned to the Control group but somehow ended up redeeming an offer that was assigned to the Test group.

*Gasp!* What do we do now? Our experiment is ruined!

Well, not to fear (sort of). One widely accepted method to handle this problem is to exclude these Control Redeemers, as they are called, from the analysis entirely. The decision to exclude them is supported by the fact that these customers are marked as “dirty” for somehow getting their hands on an offer that was intended for someone else. Herein lies the issue!

These “Control Redeemers” tend to be more engaged customers.

  • Therefore, I believe that excluding them creates an unfair bias.
  • It seems that this bias is magnified with longer campaigns.
  • In the below, “Best” refers to the best, most engaged customers, and “Sec.” refers to secondary customers, or customers who are not as likely to redeem anyway.

bar chart comparison


The longer a campaign, the higher the impact from Control Redeemers.

  • I noticed this pattern within a recent campaign I analyzed and was having a hard time explaining why results appeared so much stronger as the campaign carried on.
  • To support my hypothesis of the Control Redeemer Effect, I conducted Sales Lift calculations using two methods:
    • Calculating Test vs. Control lift including Control Redeemers in the control group.
    • Calculating Test vs. Control lift excluding Control Redeemers in the control group.
  • For a 4-week campaign, the lift is 1.2 times higher when excluding Control Redeemers.
  • For a 52-week campaign the lift is 3.7 times higher when excluding Control Redeemers.




I felt my hypothesis had been validated. As the length of the campaign increased, it allowed highly-engaged customers more of an opportunity to get their hands on a coupon (either from a friend/family member, from generous Test group customers who leave their coupons in public places, or from gracious sales associates at the store), and therefore dilute the overall sales of the control group.

There are several ways to mitigate this Control Redeemer issue.

  • Assign unique offer codes to customers. This way coupon sharing is impossible, and Test vs. Control groups stay cleaner.
  • Stratify Test vs. Control results by some sort of “Customer Value” segmentation. Some companies have a “Metals” or “Value” segmentation which ranks customers from Best to Worst. Stratifying results by this segmentation would alleviate some of the pressure.
  • Consider replacing the “control redeemer” with a composite of match-pair control customers (from a pool of control customers who have not redeemed), matching on keys from an arbitrary pre-period (say 1 month, 3 months, or a year depending on average purchase cycles).  Note: this option is likely going to be analysis-heavy.
  • If neither of these above methods are feasible, then ultimately, the “real” answer of “campaign lift” is probably somewhere in between the “including control redeemers” and “excluding control redeemers” methods.


Please let me know if you have other thoughts or ideas on this measurement methodology topic!



Sales Lift calculation (not standard, by any means, but a common one):

left { frac{[TestGroupCampaignSales]}{[TestGroupTargetedCounts]}-frac{[ControlGroupCampaignSales]}{[ControlGroupTargetedCounts]} right }  ast [TestGroupTargetedCounts]= [SalesLift]

In the above, we include Control Redeemers in the total sales lift.  To exclude Control Redeemers, we use the below (note: I had to remove the vowels from a few terms below  because there is a text limit to the equation editor):

left { frac{[TestGrpCampaignSales]}{[TestGrpTargetedCnts]}-frac{[CtlGrpCampaignSales] - [CtlRedeemerSales]}{[CtlGrpTargetedCnts]-[CtlRedeemerCnts]} right }  ast [TestGrpTargetedCnts]=[SlsLift]


Python and Web Scraping (using Scrapy)

Certainly the most extensible scripting language I have ever used, Python allows the user to build powerful programs ranging from web crawling to text mining to machine learning. With invaluable packages, NumPy and SciPy, Python is able to tackle complex modeling tasks, while at the same time, other packages such as BeautifulSoup and Scrapy allow for thorough data collection through web crawling and scraping.

In the Tableau Project below, I have provided an example (with code included on the second tab) of how web crawling and data collection work, by taking a snapshot of my old motorcycle model and comparing prices from two different markets. The data was scraped using Scrapy and exported into a CSV file which I imported into Tableau.

[su_heading]Here is the Spider code:[/su_heading]

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_mcy.items import CraigslistMcyItem
import re, string
class MySpider2(BaseSpider):
  name = "craigmcy2"
  allowed_domains = [""]
  start_urls = [" 900",
                " 900",
                " 900&s=100"]

  def parse(self, response):
      hxs = HtmlXPathSelector(response)
      titles ="//p[@class='row']")
      items = []
      for title in titles:
          item = CraigslistMcyItem()
          item ["title"] ="span[@class='txt']/span[@class='pl']/a/text()").extract()
          item ["link"] ="span[@class='txt']/span[@class='pl']/a/@href").extract()
          item ["postedDt"] ="span[@class='txt']/span[@class='pl']/time/@datetime").extract()
          item ["price"]"a[@class='i']/span[@class='price']/text()").extract()
          item ["debug"] = "" #blank for now...before, it was:"a[@class='i']").extract()
          item ["location"] = re.split('[s"] ',string.strip(str("//title/text()").extract())))
      return items	

[su_heading]Items code:[/su_heading]

from scrapy.item import Item, Field

class CraigslistMcyItem(Item):
  title = Field()
  link = Field()
  postedDt = Field()
  price = Field()
  debug = Field()
  location = Field()

[su_heading]Run code (aka “Main”):[/su_heading]

import os
import scrapy  # object-oriented framework for crawling and scraping

os.system('scrapy list & pause')
os.system('scrapy crawl craigmcy2 -o craigslist_peter.csv')


Grad School Progress

The field of analytics is constantly evolving. I have enrolled in Northwestern University’s Masters of Science, Predictive Analytics program (in Evanston, IL) to help provide me with a fresh perspective on today’s top methodologies, tools, and business case studies.  You can track my grad school progress with a gantt chart that I created using Tableau. I will keep this up-to-date until I’ve earned my degree (expected 2016).