Tecflix

Python or R?

11 May 2021 · · 1121 views

The Quest for the Best in Analytics

I am regularly asked by my customers to share my thoughts on the perennial R vs Python question. The answer is usually the same: you need both! Let me explain why.

The R Project for Statistical Computing, or simply R, is a 27-year old programming language and an environment for statistical computing, machine and statistical learning, data preparation, and visualisation. It is the most popular statistical programming language, and it is ranked the 6th or the 11th most popular programming language according to 2020–2021 surveys such as IEEE Spectrum or TIOBE Index. Notably, it is the only domain-specific (statistical computing) language in many popular top-20 rankings. While it is open-source software (OSS) many software vendors have produced their own versions of R. In addition to the OSS R, I use one other variant, Microsoft R, which has been part of SQL Server since version 2016—I teach it in my classroom ML courses. It is actively developed and it has an ecosystem of fewer than 20k packages whose developers tend to be statisticians, mathematicians, bioinformatics researchers and the academia in general. It is widely taught at university-level courses on statistics and as such it has replaced the grand-daddy of statistical software, SAS, in the academia.

Python is a 30 year old, general-purpose programming language presently (2021) ranking as the 1st or 3rd on IEEE and TIOBE. It is suited to many tasks including scripting and some data preparation. However, without the help of additional environments and packages it could not be used for statistical computing as it lacks native statistical data handling functionality. Fortunately, popular environments, notably Jupyter, and a selection of numerically optimised packages such as numpy and scipy, often installed as Anaconda, make it appropriate for data science, visualisation, collaboration, and for machine learning, especially deep learning. While deep learning is of no immediate value to many of my business customers, machine learning is important. Python is OSS, but it should be noted that its creator and overseer, Mr Guido van Rossum, has been a Microsoft employee in the role of a Technical Fellow since 2020. The OSS version of Python ships in several Microsoft products including SQL Server since version 2017, and the language has been heavily promoted and supported by the company. Because of Python’s good design it is widely used for teaching IT. Its ecosystem includes over 300k packages. While the quality of those packages vary, they cover such a breadth of subjects that a discerning Python programmer should be able to solve every business problem. As a point of reference, the even more wildly varied ecosystem of ECMAScript/JavaScript counts over 1m node.js packages.

In my opinion, and in line with industry trends and research, R is better suited to advanced analysis and data science than Python because of its purposeful design and structure aimed at statistical computing. It also comes with many tools and published approaches that aid a migration from SAS and other legacy analytical environments that my customers use. R also comes with both a greater breadth and the depth of statistical techniques. R visualisation tools have been the primary publication tool of BBC, New York Times, Twitter and Google for many years. Arguably, a statistician will be more productive in R requiring fewer lines of code to achieve a goal, unless they have a programming background. In that latter case, Python will have a greater appeal, being a well-designed—more elegant than R!—programming language, suited for many tasks and not just statistics and analytics. All of the mentioned, and many other major corporations use Python for general programming.

I strongly believe that neither R nor Python are well suited to serious, ongoing data preparation. More traditional tools, like SQL, especially with a dedicated team of data preparation professionals, should be the focus of analytical data management practices in any but the smallest of organisations. Further, both R and Python can be overly demanding of computing resources and any production-ready code can be highly optimised by the use of dedicated runtimes and packages such as those that come in Microsoft SQL Server 2019+, or by combining it with other open source parallelisation frameworks, like Apache Spark.

Multiple researches, especially Gartner, IDC, O’Reilly, Karl Rexer, and StackOverflow, show that the preferred toolkit of working data scientists, advanced analysts and statisticians include (in the order of popularity) Excel, SQL, Python, and R. All of my relevant clients support those findings, and their work combines SQL with Python and/or R, often in a single project.

Both R and Python/Jupyter are suited to reproducible research, whose principles are easily implemented using the concept of interactive notebooks. I teach reproducible research on my R courses and I use and consult on it for almost all of my customer projects nowadays. Those notebooks can be implemented locally, e.g. on a laptop, but it is easier to collaborate using a dedicated server or, if appropriate, a cloud service such as Azure Machine Learning Notebooks. On-prem, you can use a server environment comprised of RStudio Server for R and JupyterHub (a multiuser version of JupyterLab) for Python. Notebooks created within either are capable of running code written in the other language, including R, Python, SQL, Stan, bash and others. However, JupyterHub is more suited towards Python-heavy projects (even if they include some R code) while RStudio Server suits R-first projects (even containing some Python and SQL). Both allow the use of locally installed development tools like Microsoft Visual Studio Code or RStudio, and they work well through a modern web browser. Both can be obtained at no cost as OSS, or through a support-contract license at a cost 1–2 orders of magnitude lower than legacy software. Both work with other visualisation and exploration tools, including Power BI, which is also slowly but surely moving into the space of advanced analytics and research.

There is much enthusiasm for both Python/Jupyter and R at present. If you are looking for a general purpose programming language Python fits the bill. If you are a data analytics geek, R is superb. However, for the best of what is there, you should know a (good) bit about both. As of today, it would be risky to limit your choice of analytics environments to either R or Python. Looking towards the future, it may be worthwhile to also keep an open mindset towards potential, upcoming analytical languages such as Julia, the convergence of other languages with analytics, or the decay of the existing ones as the industry matures.

I hope this helps you make up your mind about R and Python.

Rafal

R logo shown above © The R Foundation CC-BY-SA 4.0. Python logo shown above © Python Software Foundation.

In collaboration with
Project Botticelli logo Oxford Computer Training logo SQLBI logo Prodata logo