Advice for aspiring data scientists: Learn SQL

My advice to data scientists hoping to get into IT or business: Learn SQL.


Pronounced “sequel,” SQL has been around since the 70s and remains the most popular language for interacting with databases today. Please note the subtle choice of words here…I said “interacting with databases,” not “interacting with data.” There continue to be strides made by the open source community and other tool sets that make working with data a breeze, but the work of bringing the data down from the database largely requires the work of someone who knows a thing or two about SQL.

My understanding is that SQL was originally created for marketers to be able to interact with databases on their own! The fire didn’t really catch though, and technical resources and analysts soon became the bearers of the skill over time.

“But I’m a data scientist, can’t someone else get the data for me?”

Sure, I get that point of view. But there are also some limitations to one’s productivity by abstaining from learning SQL. For example:

  1. You will be limited to only organizations large enough to staff a full-time database extract resource to pull your data down for you. Not every organization can afford to specialize data extraction apart from data science. By learning SQL, you will be able to expand your prospective workplace search by spanning into both roles.
  2. There will be misunderstandings between you and your data extraction partner, in terms of what data you need for your work. I did not say “there may be…” I said “there will be.” By learning SQL, you will be forced to understand the cardinality and sources of the data at the level of detail you require. Even if you are fortunate enough to have a data extraction partner on staff to support your data needs, you will find that your communication will benefit from a strong grasp of SQL. For example, stating that you need “sum of sales and count of distinct transactions grouped by region” will make your data partner’s life much, much easier.
  3. You will have to rely on someone else’s competing bandwidth. Imagine your boss demanding updated model results by end-of-day, but your data extraction partner is on vacation or simply busy with another project. How does that make you look? Even if that situation only happens a few times, imagine how your performance looks compared to a peer data scientist who knows SQL and can consistently meet deadlines.


My prediction is that SQL will likely be around for another 20 years, so learning it now will keep you relevant and valuable in the short-term and the long-term.

As far as data querying go (eg: SELECT statements), SQL syntax is generally common between platforms, so a Microsoft SQL Server version of SQL will be relatively similar to an Oracle version of SQL. You may not be able to completely copy a giant SQL statement and run it in another system, but it will be pretty close. As a data scientist, of all the SQL you write, 95% or more will be data querying, so don’t get too worked up about learning data INSERT or UPDATE statements across platforms.

Even some of the most advanced massively parallel processing (MPP) systems are using SQL, for example Redshift, Google BigQuery, Netezza, Vertica, and Aster Data. Many of these are relatively new platforms who have chosen to support SQL.

“But what about the new data management platforms that don’t use SQL, like Hadoop and NoSQL?”

Sure. I would say to learn those platforms, as well! The point I am trying to make here is it that it will do your career well for you to become autonomous at extracting and providing your own data. Data will never become less abundant, and by enabling yourself to pull the data from its source, you become significantly more valuable to an employer.