R images based on the Rocker Project#
If your JupyterHub users are going to be primarily using R, we recommend basing your user image on one of the images maintained by the Rocker Project
Why?#
The Rocker Project provides a collection of different images that come with popular R software and optimized libraries pre-installed. It is maintained by people deeply vested in the R community, and provides curated sets of base libraries with versions known to work decently well together.
Step 1: Pick a base image#
While rocker offers a bunch of different images to pick from, my recommendation is to base your image on the rocker/geospatial image. It isn’t particularly large, and offers a fairly complete & comprehensive set of base packages to build off of.
Note
If you particularly care about shaving off image size, you could also consider using the rocker/verse image. At the time of writing, it is only about a 300MB reduction in size - which in not much in the context of datascience images.
Step 2: Pick a version#
All the rocker images, including rocker/geospatial
, are tagged based on the R version they use.
Look in the tags page for the base image you have picked,
and find the tag that corresponds to the R version you wanna use. In general, R users seem to upgrade R versions
fairly quickly after a new release - so it’s a decent bet to just pick the latest released version in the
tag list. Pick a fully specified version - so 4.2.2
, rather than 4.2
or 4
. This lets you keep the R version
and base set of packages stable until explicitly bumped. You should bump the base version at least once every 6 months.
So at this point, the Dockerfile
looks like this:
FROM rocker/geospatial:4.2.2
Step 3: Construct your Dockerfile to add Python#
We need to install Python inside the user image, even if our JupyterHub users will not be using any Python. This is for the following reasons:
The
jupyterhub-singleuser
executable must be available inside the container image for the image to be used with a JupyterHub.The jupyter-rsession-proxy is required to get RStudio to work with JupyterHub.
While there are many different ways to install Python inside a container image, we choose to use
mamba - a popular drop-in replacement for the conda
package manager. Why?
It is how repo2docker and hence mybinder.org get their Python, so the consistency is very helpful.
We can get any version of python we want, not just what is supported by the underlying Linux distro. While there are other ways to get arbitrary Python versions, in my experience this has been the easiest and most stable way.
So let’s add on to our Dockerfile
!
# Install conda here, to match what repo2docker does
ENV CONDA_DIR=/srv/conda
# Add our conda environment to PATH, so python, mamba and other tools are found in $PATH
ENV PATH ${CONDA_DIR}/bin:${PATH}
# RStudio doesn't actually inherit the ENV set in Dockerfiles, so we
# have to explicitly set it in Renviron.site
RUN echo "PATH=${PATH}" >> /usr/local/lib/R/etc/Renviron.site
# The terminal inside RStudio doesn't read from Renviron.site, but does read
# from /etc/profile - so we rexport here.
RUN echo "export PATH=${PATH}" >> /etc/profile
# Install a specific version of mambaforge in ${CONDA_DIR}
# Pick latest version from https://github.com/conda-forge/miniforge/releases
ENV MAMBAFORGE_VERSION=22.9.0-2
RUN echo "Installing Mambaforge..." \
&& curl -sSL "https://github.com/conda-forge/miniforge/releases/download/${MAMBAFORGE_VERSION}/Mambaforge-${MAMBAFORGE_VERSION}-Linux-$(uname -m).sh" > installer.sh \
&& /bin/bash installer.sh -u -b -p ${CONDA_DIR} \
&& rm installer.sh \
&& mamba clean -afy \
# After installing the packages, we cleanup some unnecessary files
# to try reduce image size - see https://jcristharif.com/conda-docker-tips.html
&& find ${CONDA_DIR} -follow -type f -name '*.a' -delete \
&& find ${CONDA_DIR} -follow -type f -name '*.pyc' -delete
Let’s explain what exactly is going on here.
Setting location of the conda environment#
We set a CONDA_DIR
environment variable with the path to where we want to
install the conda environment. While many conda installs are put in
/opt/conda
, we choose to put it in /srv/conda
still to match what
repo2docker
does.
Note
Why is it called CONDA_DIR while we are installing mamba? Backwards compatibility, and the fact that we are still using the conda ecosystem heavily - particularly conda-forge.
Making tools in the conda environment available on ${PATH}
#
We set the PATH
environment variable to include ${CONDA_DIR}/bin
, so users can invoke various software
tools (like python
) we install via mamba
without having to specify the full path. This is also required
for JupyterHub to use the image - we will install jupyterhub-singleuser
into this conda environment later,
and adding this directory to PATH
allows jupyterhub to start jupyterhub-singleuser
without having to
know where exactly it is installed.
Special considerations for environment variables inside RStudio#
The Dockerfile ENV directive sets up environment variables inherited by all the processes started by JupyterHub inside the container. However, RStudio Server does not inherit these! This is mostly due to how RStudio Server starts - by daemonizing itself using the technique of the ancient powerful ones, double forking.
So for env variables to take effect in the R sessions started by RStudio, they need to be added to an Renviron.site
. Add one line per file with the value of the
environment variable you want, and it shall be loaded by R on startup when started by RStudio. This way, if
you have code in R that looks for a python (or other) executable, it will find it in the correct place.
But wait, what does PATH=${PATH}
do?! How does that work? If you look at the documentation for the
Dockerfule RUN directive, RUN echo HI
is actually
equivalent to RUN ["/bin/sh", "-c", "echo HI"]
! And this /bin/sh
does get all the environment variables
declared via ENV
upto that point, and so can expand
them.
So our RUN echo "PATH=${PATH}"
gets translated into RUN ["/bin/sh", "-c", "echo \"PATH=${PATH}\"]
, and sh
expands the ${PATH}
to the full value of
the environment variable as they are enclosed in double quotes. This allows us
to tell RStudio Server to set its PATH
environment variable to the exact value
of the PATH
environment variable in our Dockerfile, in the most roundabout way
possible :)
So for any variable you want to set in your Dockerfile
that should be reflected in R processes
started by RStudio Server, you need to add an entry to the Renviron.site
file.
RStudio Server has a terminal, and that also does not inherit environment variables set here with ENV
- and
since it isn’t an R process, neither does it inherit environment variables set in Renviron.site
. So instead
we put it into /etc/profile
, which is loaded by the Terminal inside RStudio since it starts bash
with the
-l
argument - which causes it to read /etc/profile
! We use the same variable expansion trick as we did
in the previous step as well. You need to do this too for any env variable you want to set in your
Dockerfile
that you want reflected in terminals opened from within RStudio.
Aren’t computers wonderful? :)
Installing Mambaforge#
We then use the Mambaforge Installer to install
create a conda environment at $CONDA_DIR
. What exactly does “Mambaforge” give us?
A conda environment set to install packages from the community maintained conda-forge channel, filled with thousands of amazingly well maintained packages in python & other languages.
The mamba package manager by default, the faster drop-in replacement to the conda package manager.
A default version of Python, based on which version of mambaforge we picked. Usually the version that comes here is mentioned when you pick the release from the releases page
Notice that bit of $(uname -m)
in there? It automatically picks either the intel CPU installer (x86_64) or the
ARM CPU Installer (aarch64) based on what architecture we are building our image for automatically. This might
not matter to you now, but ARM machines are taking over the world at an exponentially increasing rate (I am writing
this on an ARM machine right now) - and by adding support for ARM from the start, we future-proof our image
without a lot of extra work!
Since the mambaforge installer was originally designed to be used on people’s laptops, it does leave behind
some cached files and what not that are of not much use in a docker container - after all, we do not expect
our end users to constantl re-install packages inside the container while they are using it (for the most part),
so we can save some disk space by cleaning these up before we finish the RUN
directive. Note that these all
need to be part of the same command, otherwise there will be no space saving! See this wonderful blog post
for more details.
Step 4: Installing Python Packages with an environment.yml
file#
Now that we have python installed, it’s time to install a few python packages in the conda environment.
First, we create an environment.yml
file listing the packages we want installed. This is a standard
format
used in the conda ecosystem.
channels:
- conda-forge
- nodefaults
dependencies:
- jupyterhub-singleuser>=3.0
- jupyter-rsession-proxy>=2
- nbgitpuller>=1.1.1
This is the absolute minimal environment.yml
file you need. Let’s unpack what it has!
channels
#
conda packages are organized into channels, which are locations where packages are stored. The most common ones are:
The conda-forge channel, containing packages built by the massive conda-forge community. This is where we want most of our packages to come from.
The “defaults” channel, which contains packages managed by Anaconda.com, primarily for use with their Anaconda Python Distribution. This is always present by default, but since packages present in this are also usually present in the conda-forge channel, accidental mixing can be confusing and problematic.
The special “nodefaults” channel, which removes the “defaults” channel from being considered for package installation! You should always include this when building container images
The nvidia channel, containing a mix of proprietary & open source packages for use with NVidia’s CUDA ecosystem. You don’t need this unless you are building stuff that uses a GPU.
Specifying conda-forge
and nodefaults
is most likely all you will need!
dependencies
#
This lists the conda packages we want to install in our environment. The packages we must add at a minimum are:
jupyterhub-singleuser, required for any image to work with JupyterHub
jupyter-rsession-proxy, provides support for RStudio Server to work within JupyterHub.
nbgitpuller, very commonly used to distribute notebooks (provides working RStudio Server support within JupyterHub). You can add additional packages here if you wish.
Note
We explicitly chose to not specify a python version here, so we just use the version of python that mambaforge installs by default. This keeps the image size small! If you want a newer version of python, I recommend changing the version of Mambaforge you are installing
Installing the packages into the existing environment#
Now that we have the environment.yml
file, we can add it to our Dockerfile
and install those packages
into the environment.
COPY environment.yml /tmp/environment.yml
RUN mamba env update -p ${CONDA_DIR} -f /tmp/environment.yml \
&& mamba clean -afy \
&& find ${CONDA_DIR} -follow -type f -name '*.a' -delete \
&& find ${CONDA_DIR} -follow -type f -name '*.pyc' -delete
We also cleanup after ourselves to keep the image size to a minimum - we will
have to do this after every mamba
command for the most part!
Step 5: Install additional R Packages#
While the base rocker image comes with a lot of useful libraries, you might still need to add more. While there are many ways to install R packages, I recommend using the install2.r command that comes with rocker by default (via the littler project).
By default, packages will be installed from packagemanager.rstudio.com, which provides optimized binary installs that install quite quickly!
RUN install2.r --skipinstalled \
package1 \
package2 \
&& rm -rf /tmp/downloaded_packages
This will fetch appropriate versions of package1
, package2
, etc and install them only if they are not
already installed. If you don’t specify --skipinstalled
, packages from the base image might keep getting
reinstalled - wasting time and space. We cleanup after ourselves by deleting temporary directories used for
storing downloaded packages before installation.
Step 6: Install Jupyter Notebook with R support#
While most R users prefer using RStudio, there are are also many users of the Jupyter IRkernel for using R with Jupyter Notebooks. It is generally a good idea to support this too in images we build for R users.
Install Jupyter frontends (JupyterLab & RetroLab)#
First, we install a couple of Jupyter frontends that people can use to edit their notebooks by adding
them under dependencies
to the environment.yml
. Without this, there won’t be any Jupyter frontends
for people to use for reading / writing Jupyter Notebooks!
I recommend installing the
jupyterlab
and
retrolab
frontends. The former provides an IDE-like experience,
while the latter provides a much simpler single-document experience!
With these added, the environment.yml
would look something like:
channels:
- conda-forge
- nodefaults
dependencies:
- jupyterhub-singleuser>=3.0
- jupyter-rsession-proxy>=2
- nbgitpuller>=1.1.1
- jupyterlab>=3.0
- retrolab
Install the R Kernel for Jupyter (IRkernel
)#
Now that Jupyter frontends are installed, we need to install the Jupyter R Kernel. Without this, only the default Python kernel would be available.
RUN install2.r --skipinstalled IRkernel \
&& rm -rf /tmp/downloaded_packages
RUN r -e "IRkernel::installspec(prefix='${CONDA_DIR}')"
We first install R package that provides the kernel, and then tell it to ‘install’ itself as an available kernel for the Jupyter installation we have.
Tell Jupyter where to start#
The rocker images come with a default non-root user named rstudio
, with a writeable home directory at /home/rstudio
.
While RStudio Server knows to start in /home/rstudio
, Jupyter does not. We have to tell it explicitly to
start in /home/rstudio
, otherwise it will start in /
and users will not be able to create notebooks there!
# Explicitly specify working directory, so Jupyter knows where to start
WORKDIR /home/rstudio
Tell Jupyter to use a better shell when starting Terminals#
Jupyter also has the capability to launch terminals, but will default to launching /bin/sh
as the terminal
shell. This is not quite right, as most people expect instead for /bin/bash
to be launched - so features like
autocomplete and arrow keys work! We have to explicilty tell Jupyter to launch /bin/bash
by setting the SHELL
environment variable.
ENV SHELL=/bin/bash
Note
RStudio Server doesn’t need this to launch /bin/bash
, so we don’t need to repeat the tricks we used in Step 3
to set env vars for RStudio Server here.
Step 7: Setup zero to JupyterHub configuration for home directory#
When using zero-to-jupyterhub-on-k8s for running JupyterHub, the default persistent home directory for the
user is mounted at /home/jovyan
. This works for most cases, as the default user name for most container images
used with JupyterHub is jovyan
, and their home directories are set to /home/jovyan
. However, since rocker’s
default username is rstudio
and the home directory is set to /home/rstudio
, we must explicitly tell z2jh to
mount the user’s persistent home directory there, with the following config:
hub:
config:
KubeSpawner:
working_dir: /home/rstudio
singleuser:
storage:
homeMountPath: /home/rstudio
This config also tells Notebook Extensions (such as nbgitpuller where to start.
Warning
NOT SETTING THIS WILL RESULT IN USER DATA LOSS.
Without setting this, the user’s persistent home directory will continue to be mounted at /home/jovyan
, but
the user will only see /home/rstudio
by default. This results in data loss, as /home/rstudio
’s contents
are discarded when the user’s server stops! You do not want this, so remember to set this configuration!