My customers have been hiring us to do data mining, machine learning, data science, and AI projects since 2007. All the very early work was done using the little-known SQL Server Analysis Services Data Mining components. Even though they have been deprecated back in 2017 they still ship with SQL Server 2019. However, they are past their prime and with the exception of educational and data exploration purposes I could no longer recommend them. For those reasons, about 7 years ago, around 2014, I have started using R for data science and statistical learning, ie. more classical machine learning, not deep learning or anything too esoteric. R is still, by far, the best environment for advanced analytics, because it has been built for statistical computing from the ground up. As an aside, check out my recent comparison of advanced analytics using R vs Python.
Let’s get back to our customers. Many organisations keep their critical business data in a SQL database. It has always been a bit of a pain to connect R to SQL Server in a scalable and reliable way—unlike for ad-hoc work, which has been easy. Things greatly improved when Microsoft shipped SQL Server 2016 with the just-acquired Revolution Analytics R, notably the RevoScaleR packages, making it easy and highly performant not only to build ML models in R using SQL Server data, but also to deploy them, at scale. Things got even better with the introduction of the T-SQL PREDICT statement in SQL Server 2017. Nanosecond-scale predicting become a surprisingly easy thing to deploy. All the pipeline engineering, especially the mundane and time-consuming data preparation could be left to SQL experts rather than trying to break in too many new and unreliable approaches. It has always been about being data-centric in the real world! As I have been pointing out for years, your choice of technology, algorithm, or the platform is far less important than your data, and the quality of the question you are trying to answer with it.
The good news is that with today’s announcements all of the above will continue, and it will get better. Much better! The future is safe. Let me explain.
There are a few bits to this puzzle. My friend Niels Berglund has written about the SQL/ML internals extensively on his blog. Let me simplify by saying that there are two keys to it all. Firstly, we want to keep data inside SQL Server while building models for many good reasons. Secondly, we do not want to have any R, Python, Java or any other sloooow code running your predicting, if avoidable.
The first things is accomplished by the RevoScale algorithms, the magic of switching compute contexts to RxInSqlServer, and the pointer-like objects that connect R to SQL data, ie. RxSqlServerData, avoiding unnecessary copying. These were proprietary and required a special edition of the R engine installed as part of ML Server, or the rather useless ML Client. Why? Because you could not load that RevoScale ecosystem of packages into a plain-vanilla open-source R. Not even into Microsoft R Open, yikes! Yes, you could have used ML Services for that purpose, if you knew how to, but it was inconvenient and unsuited for research-like modelling workflows. Everyone used something like RStudio (or maaaybe VS/Code) and that proprietary R engine for model building.
The great news is that Microsoft will open source those RevoScale packages in order to make them work with the standard, open source R engine available from CRAN. Those private R engines, including Microsoft R Open (MRO), will be discontinued, saving Microsoft some cash, reducing unnecessary installations and pointless confusion. That means we no longer need to install their special version of R, which, to say it politely, has not received regular enough updates. The lack of updates has started causing issues while trying to use other, much newer packages. I am also looking forward to the new possibilities that open sourcing of RevoScale should enable, alowing more of our own code to benefit from the Revo magic. All in all, I am glad I will be able to work in a modern notebook environment, or an IDE of my choice, using common engines, and benefit from the best of ML in SQL Server and Azure SQL.
Microsoft have committed to recompiling and updating those open sourced packages when new versions of R and SQL Server are released. That has got to be an easier thing for them than maintaining a whole private R runtime distribution. I hope they stick to that promise because they still control the other key of this puzzle.
That other key, as you may remember, is the super-fast PREDICT statement. To make predictions, it uses a serialised version of an ML model that you have built using a few of the RevoScale algorithms (in SQL Azure you can also use ONNX models). The actual code that does the predicting is their highly-optimised C++ and it ships inside SQL Server 2017+. It does not have a dependency nor does it require any R, Python, or Java runtimes! This scales well, it is easy to secure to the level SQL Server requires, and it is fast. Predicting a billion rows per second is possible. Interestingly, this neither requires ML Services, nor the to-be-retired ML Server.
Why do we still need ML Services, then? There is another interesting use case that involves R, Python, Java, or some other future analytical thing—Julia, anyone? Many of us want the power of those languages and their runtime, together with their excellent package ecosystems, all made available to our T-SQL code. Again, this is a data-centric workflow, and one you will find in many pragmatic enterprise architectures, stuff that runs the world. Even if data lake afficionados may think of it as an old-fashioned idea, I guarantee you will see many new system built in this way for another 25+ years. In this use case, we are prepared to compromise on some performance, which is unavoidable considering the inherent disadvantage of the interpreted, rather than compiled, languages (Julia excluded). To call such code, using any runtime you want, you employ sp_execute_external_script. This uses ML Services and not the discontinued ML Server. You have the advantage of not moving the data outside of the database as such, even if there is an order of magnitude more interprocess data shifting than when you use PREDICT. Usually, I would only use this approach if I could not get the job done using PREDICT. An example of a task well suited to this technique is when you need to retrain, update, or revalidate an existing model on a regular basis. This requires running a full-on R engine which ML Services allow. As of Nov 2020, Microsoft have already open sourced the ability to add custom language runtimes, paired with CREATE EXTERNAL LANGUAGE in T-SQL. We got untied from that outdated version they shipped with SQL Server vLongTimeAgo. Thank you, but those runtimes could not work with RevoScale, grhrh. It will now become possible, for the first time, to implement this use case even in the plain, open source R, thanks to the soon-to-be open-sourced RevoScale packages. It is an excellent option to have, and it vastly expands our horizon.
Do we lose anything when Microsoft discontinues ML Server and ends its support in mid-2022? Yes, but not too many of you should be affected, or should care. The two main areas that will no longer be catered for is a way of running RevoScale on Hadoop or Spark engines, and the operationalisation approach based on mrsdeploy. I will not care about mrsdeploy because it has been quite eclipsed by what is possible with Azure Machine Learning as far as deploying ML web services is concerned. Some organisations, however, may need to think how to update their code to run more natively on Hadoop and Spark if they still use RevoScale for that purpose. That was a nice thing, but as far as new projects are concerned, no one should be considering Hadoop anymore. Hadoop has been, for all terms and purposes, dead as an analytical engine for a few years. It will remain in use as a backend parallelisation technology, and for running legacy code. Apache Spark ate Hadoop. It is very much in vogue, and it is still growing well, which is why there are better, newer ways to use it, than by relying on RevoScale. Microsoft offer several Spark platforms, notably inside SQL Server 2019 Big Data Clusters (BDC) and as Azure Databricks. Open source R will become available for SparkR and sparklyr in the upcoming version of SQL Server BDC.
All in all, I am pleased we got the long-overdue clarity about the future of the key parts of the SQL-centric, data-focused machine learning from Microsoft. A lot of detail still has to be ironed out, such as exactly which API classes will or will not be supported—RevoTreeView, or even something nicer than that, please, like in the old days of SQL DM? The long-term future of MRAN and checkpoint? Could PREDICT expand to allow more open sourcing, pretty please? That would be very interesting. It is great that Azure SQL Managed Instance got the current, and the upcoming, ML Services, but how about making the Azure SQL version of PREDICT have parity with on-prem—bit of a mess at the moment… What about MicrosoftML and its excellent boosted trees, or the OneClassSVM algorithms? What about the pre-built text and image analysis models that shipped with ML Server—nice to have but not a biggie if we lost them? And a good bit more detail on all of the above, please. :)
Needless to say, I will update you on anything I hear when it becomes newsworthy. I expect it should be out when the next version of SQL Server and Azure SQL get their public previews.
The key message today was very good. We will be able to run the normal, open source R engine, while having the benefit of in-database analytics, with the speed and convenience of PREDICT. I cannot wait to see how I can combine the best of SQL Server with the rest of the open source R in ways I could not have done before. Best of all, we can leave the hardest job of data prep worflows to SQL data experts, who really are the experts on that subject.
If you would like to study machine learning, using SQL, and data science with me, please have a look at my existing, up-to-date classroom courses. I will also record new videos, here at Tecflix, once the next release of SQL Server is ready.