Remember the first time you heard the words “Big Data?” Well, there’s a new buzzword in town — “Machine Learning.”

Ok, when I say “Machine Learning,” what happens in your mind? What images have I conjured by saying “Machine Learning?” Maybe, you saw a brief shadow of a floating, intelligent, robotic metal squid, or a flying Keanu Reeves? Maybe, you heard the name “Ah-nold” or “I’ll be back” with occasional lasers flashing in the distance.

Well, I’m sorry to say that I’m here to burst your bubble. Pop! There it goes… When we discuss within the context of statistics and analytics, Machine Learning is NOT the same thing as Artificial Intelligence.

Machine Learning isn’t even a super simple, intuitive approach to data modeling and analytics. Machine Learning basically has to do with the fact that technology has finally come so far as to allow computers to apply brute-force methods and build predictive models that were not possible 30 and even 15 years ago. You may have actually already heard of many Machine Learning algorithms — for example: Decision Trees, Neural Networks, Gradient Boosting, GenIQ, and even K-means clustering. Many analytical tools, such as Python and R, already support these modeling techniques. The SciKit Learn package in Python offers a great tutorial in Decision Trees.

Ultimately, what I want you to walk away with is that, when we talk about statistics and analytics, Machine Learning isn’t some super-fancy, futuristic process that will enlighten all of your analytics capabilities. It is actually a set of functionality that already exists and can be drawn upon to create predictive models using heavy computer processing.

If you’re interested in learning more, I’ll recommend the book “Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data” by Bruce Ratner. He talks about many of these techniques, what they are used for, and how to avoid pitfalls..

Issue:

There are occasions when your ODBC connection from SAS to SQL Server returns partial data without an error flag, and you’d never know that you were missing data from looking at the SAS log.

I was running a SAS script in SAS Enterprise Guide (SASEG) which uses SQL pass-through to interface and retrieve data from Microsoft SQL Server. In this instance I wanted to create a dataset of all coupon redemptions in a given time frame. Below is the SAS script I was running (grayed out a few proprietary database names just to be safe).

Checking the log, my script ran without error and created a dataset with 15,330 rows.

Later in the day, I wanted to change some code further down in the script. Since our database is not live, I would anticipate that a single query would return the exact same result set each time it’s executed. However, I noticed that my row counts were changing each time I ran the script. The second time I ran the script, it returned 2902 rows, and then the third time I ran the script it returned 13,816 rows!

What was going on?!?

Confused, I took the exact SQL code and executed it directly in the SQL Server Management studio. This time I noticed an error “Arithmetic overflow error converting expression to data type bigint.” However, even though there was an error, SQL Server still returned 1737 rows.

I ran the exact query a second time. This time it returned 3813 rows!

sql-server-output-2

It seemed that SQL server was encountering an error with a particular record and would return all rows up to the encounter, but somehow the ERROR message never made its way to SAS. Also, since there was no “order by” clause in the query, the data results seemed to be ordering by some SQL Server clustering technique that might have been triggered by a random seed (which explains why my row counts always changed).

After removing the “cast” function, the query returned 19,994 records without error, and consistently returned 19,994 records every refresh thereafter. I had discovered the issue, but I remain very concerned that SAS did not acknowledge the partial list and overflow error at all in the execution log. If my row counts had not changed, I would have never realized that there was an error in the data.

Possible Solutions:

1) Execute every SQL Server script in SQL Server Management Studio prior to pasting into SAS. This could be a huge pain, because macros in pass-through scripts must be resolved prior to running. An alternative to this would be to stress test the SQL scripts in SQL Server Management Studio, where we execute every possible permutation of the SAS query in order to try and protect against any subset.

2) <TBD> Is there an error catching option? I have tried errorstop, stopontruncate, and exitcode options in the proc sql statement, but none of them seem to warn the user of this error. Please comment on this post if you have a suggestion, and I will update the post! Thanks.

Month: March 2015

Machine Learning is not Artificial Intelligence

SQL Server to SAS ODBC: The Partial Truth

Issue:

Possible Solutions: