My advice to data scientists hoping to get into IT or business: Learn SQL.

Pronounced “sequel,” SQL has been around since the 70s and remains the most popular language for interacting with databases today. Please note the subtle choice of words here…I said “interacting with databases,” not “interacting with data.” There continue to be strides made by the open source community and other tool sets that make working with data a breeze, but the work of bringing the data down from the database largely requires the work of someone who knows a thing or two about SQL.

My understanding is that SQL was originally created for marketers to be able to interact with databases on their own! The fire didn’t really catch though, and technical resources and analysts soon became the bearers of the skill over time.

“But I’m a data scientist, can’t someone else get the data for me?”

Sure, I get that point of view. But there are also some limitations to one’s productivity by abstaining from learning SQL. For example:

You will be limited to only organizations large enough to staff a full-time database extract resource to pull your data down for you. Not every organization can afford to specialize data extraction apart from data science. By learning SQL, you will be able to expand your prospective workplace search by spanning into both roles.
There will be misunderstandings between you and your data extraction partner, in terms of what data you need for your work. I did not say “there may be…” I said “there will be.” By learning SQL, you will be forced to understand the cardinality and sources of the data at the level of detail you require. Even if you are fortunate enough to have a data extraction partner on staff to support your data needs, you will find that your communication will benefit from a strong grasp of SQL. For example, stating that you need “sum of sales and count of distinct transactions grouped by region” will make your data partner’s life much, much easier.
You will have to rely on someone else’s competing bandwidth. Imagine your boss demanding updated model results by end-of-day, but your data extraction partner is on vacation or simply busy with another project. How does that make you look? Even if that situation only happens a few times, imagine how your performance looks compared to a peer data scientist who knows SQL and can consistently meet deadlines.

My prediction is that SQL will likely be around for another 20 years, so learning it now will keep you relevant and valuable in the short-term and the long-term.

As far as data querying go (eg: SELECT statements), SQL syntax is generally common between platforms, so a Microsoft SQL Server version of SQL will be relatively similar to an Oracle version of SQL. You may not be able to completely copy a giant SQL statement and run it in another system, but it will be pretty close. As a data scientist, of all the SQL you write, 95% or more will be data querying, so don’t get too worked up about learning data INSERT or UPDATE statements across platforms.

Even some of the most advanced massively parallel processing (MPP) systems are using SQL, for example Redshift, Google BigQuery, Netezza, Vertica, and Aster Data. Many of these are relatively new platforms who have chosen to support SQL.

“But what about the new data management platforms that don’t use SQL, like Hadoop and NoSQL?”

Sure. I would say to learn those platforms, as well! The point I am trying to make here is it that it will do your career well for you to become autonomous at extracting and providing your own data. Data will never become less abundant, and by enabling yourself to pull the data from its source, you become significantly more valuable to an employer.

Issue:

There are occasions when your ODBC connection from SAS to SQL Server returns partial data without an error flag, and you’d never know that you were missing data from looking at the SAS log.

I was running a SAS script in SAS Enterprise Guide (SASEG) which uses SQL pass-through to interface and retrieve data from Microsoft SQL Server. In this instance I wanted to create a dataset of all coupon redemptions in a given time frame. Below is the SAS script I was running (grayed out a few proprietary database names just to be safe).

Checking the log, my script ran without error and created a dataset with 15,330 rows.

Later in the day, I wanted to change some code further down in the script. Since our database is not live, I would anticipate that a single query would return the exact same result set each time it’s executed. However, I noticed that my row counts were changing each time I ran the script. The second time I ran the script, it returned 2902 rows, and then the third time I ran the script it returned 13,816 rows!

What was going on?!?

Confused, I took the exact SQL code and executed it directly in the SQL Server Management studio. This time I noticed an error “Arithmetic overflow error converting expression to data type bigint.” However, even though there was an error, SQL Server still returned 1737 rows.

I ran the exact query a second time. This time it returned 3813 rows!

sql-server-output-2

It seemed that SQL server was encountering an error with a particular record and would return all rows up to the encounter, but somehow the ERROR message never made its way to SAS. Also, since there was no “order by” clause in the query, the data results seemed to be ordering by some SQL Server clustering technique that might have been triggered by a random seed (which explains why my row counts always changed).

After removing the “cast” function, the query returned 19,994 records without error, and consistently returned 19,994 records every refresh thereafter. I had discovered the issue, but I remain very concerned that SAS did not acknowledge the partial list and overflow error at all in the execution log. If my row counts had not changed, I would have never realized that there was an error in the data.

Possible Solutions:

1) Execute every SQL Server script in SQL Server Management Studio prior to pasting into SAS. This could be a huge pain, because macros in pass-through scripts must be resolved prior to running. An alternative to this would be to stress test the SQL scripts in SQL Server Management Studio, where we execute every possible permutation of the SAS query in order to try and protect against any subset.

2) <TBD> Is there an error catching option? I have tried errorstop, stopontruncate, and exitcode options in the proc sql statement, but none of them seem to warn the user of this error. Please comment on this post if you have a suggestion, and I will update the post! Thanks.

Tag: sql

Advice for aspiring data scientists: Learn SQL

SQL Server to SAS ODBC: The Partial Truth

Issue:

Possible Solutions: