Customizing User Storage#
For the purposes of this guide, we’ll describe “storage” as a “volume” - a location on a disk where a user’s data resides.
Kubernetes handles the creation and allocation of persistent volumes, under-the-hood it uses the cloud provider’s API to issue the proper commands. To that extent most of our discussion around volumes will describe Kubernetes objects.
JupyterHub uses Kubernetes to manage user storage. There are two primary Kubernetes objects involved in allocating storage to pods:
A
PersistentVolumeClaim
(PVC
) specifies what kind of storage is required. Its configuration is specified in yourconfig.yaml
file.A
PersistentVolume
(PV
) is the actual volume where the user’s data resides. It is created by Kubernetes using details in aPVC
.
As Kubernetes objects, they can be queried
with the standard kubectl
commands (e.g., kubectl --namespace=<your-namespace> get pvc
)
In JupyterHub, each user gets their own PersistentVolumeClaim
object, representing the data attached to their account.
When a new user starts their JupyterHub server, a
PersistentVolumeClaim
is created for that user. This claim
tells Kubernetes what kind of storage (e.g., ssd vs. hd) as
well as how much storage is needed. Kubernetes checks to see
whether a PersistentVolume
object for that user exists (since
this is a new user, none will exist). If no PV
object exists,
then Kubernetes will use the PVC
to create a new PV
object
for the user.
Now that a PV
exists for the user, Kubernetes next must
attach (or “mount”) that PV
to the user’s pod (which runs
user code). Once this is accomplished, the user will have
access to their PV
within JupyterHub. Note that this all happens
under-the-hood and automatically when a user logs in.
PersistentVolumeClaim
s and PersistentVolume
s are not
deleted unless the PersistentVolumeClaim
is explicitly deleted
by the JupyterHub administrator. When a user shuts down their
server, their user pod is deleted and their volume is
detached from the pod, but the PVC
and PV
objects still exist.
In the future, when the user logs back in, JupyterHub will
detect that the user has a pre-existing PVC
and will simply
attach it to their new pod, rather than creating a new PVC
.
How can this process break down?#
When Kubernetes uses the PVC
to create a new user PV
, it
is sending a command to the underlying API of whatever cloud
provider Kubernetes is running on. Occasionally, the request
for a specific PV
might fail - for example, if your account
has reached the limit in the amount of disk space available.
Another common issue is limits on the number of volumes that may be simultaneously attached to a node in your cluster. Check your cloud provider for details on the limits of storage resources you request.
Note
Some cloud providers have a limited number of disks that can be attached to each node. Since JupyterHub allocates one disk per user for persistent storage, this limits the number of users that can be running in a node at any point of time. If you need users to have persistent storage, and you end up hitting this limit, you must use more nodes in order to accommodate the disk for each user. In this case, we recommend allocating fewer resources per node (e.g. RAM) since you’ll have fewer users packed onto a single node.
Configuration#
Most configuration for storage is done at the cluster level and is not unique to JupyterHub. However, some bits are, and we will demonstrate here how to configure those.
Note that new PVC
s for pre-existing users will not be
created unless the old ones are destroyed. If you update your
users’ PVC
config via config.yaml
, then any new users will
have the new PVC
created for them, but old users will not.
To force an upgrade of the storage type for old users, you will
need to manually delete their PVC
(e.g.
kubectl --namespace=<your-namespace> delete pvc <pvc-name>
).
This will delete all of the user’s data so we recommend
backing up their filesystem first if you want to retain their data.
After you delete the user’s PVC
, upon their next log-in a new
PVC
will be created for them according to your updated PVC
specification.
Type of storage provisioned#
A StorageClass object
is used to determine what kind of PersistentVolume
s are provisioned for your
users. Most popular cloud providers have a StorageClass
marked as default. You
can find out your default StorageClass
by doing:
kubectl get storageclass
and looking for the object with (default)
next to its name.
To change the kind of PersistentVolume
s provisioned for your users,
Create a new
StorageClass
object following the kubernetes documentationSpecify the name of the
StorageClass
you just created inconfig.yaml
singleuser: storage: dynamic: storageClass: <storageclass-name>
Do a
helm upgrade
Note that this will only affect new users who are logging in. We recommend you do this before users start heavily using your cluster.
We will provide examples for popular cloud providers here, but will generally defer to the Kubernetes documentation.
Google Cloud#
On Google Cloud, the default StorageClass
will provision
Standard Google Persistent Disks.
These run on Hard Disks. For more performance, you may want to use SSDs.
To use SSDs, you can create a new StorageClass
by first putting the following yaml
into a new file. We recommend a descriptive name such
as storageclass.yaml
, which we’ll use below:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: jupyterhub-user-ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- <your-cluster-zone>
Replace <your-cluster-zone>
with the Zone in which you created your cluster (you can find
this with gcloud container clusters list
).
Next, create this object by running kubectl apply -f storageclass.yaml
from the commandline. The Kubernetes Docs
have more information on what the various fields mean. The most important field is parameters.type
,
which specifies the type of storage you wish to use. The two options are:
pd-ssd
makesStorageClass
provision SSDs.pd-standard
will provision non-SSD disks.
Once you have created this StorageClass
, you can configure your JupyterHub’s PVC
template with the following in your config.yaml
:
singleuser:
storage:
dynamic:
storageClass: jupyterhub-user-ssd
Note that for storageClass:
we use the name that we specified
above in metadata.name
.
Size of storage provisioned#
You can set the size of storage requested by JupyterHub in the PVC
in
your config.yaml
.
singleuser:
storage:
capacity: 2Gi
This will request a 2Gi
volume per user. The default requests a 10Gi
volume per user.
We recommend you use the IEC binary prefixes (Ki, Mi, Gi, etc) for specifying
how much storage you want. 2Gi
(IEC binary prefix) is (2 * 1024 * 1024 * 1024)
bytes, while 2G
(SI decimal prefix) is (2 * 1000 * 1000 * 1000)
bytes.
Turn off per-user persistent storage#
If you do not wish for users to have any persistent storage, it can be
turned off. Edit the config.yaml
file and set the storage type to
none
:
singleuser:
storage:
type: none
Next apply the changes.
After the changes are applied, new users will no longer be allocated a
persistent $HOME
directory. Any currently running users will still have
access to their storage until their server is restarted. You might have to
manually delete current users’ PVCs
with kubectl
to reclaim any cloud
disks that might have allocated. You can get a current list of PVC
s with:
kubectl --namespace=<your-namespace> get pvc
You can then delete the PVCs
you do not want with:
kubectl --namespace=<your-namespace> delete pvc <pvc-name>
Remember that deleting someone’s PVC
s will delete all their data, so do so
with caution!
Additional storage volumes#
If you already have a PersistentVolume
and PersistentVolumeClaim
created
outside of JupyterHub you can mount them inside the user pods.
For example, if you have a shared PersistentVolumeClaim
called
jupyterhub-shared-volume
you could mount it as /home/shared
in all user
pods:
singleuser:
storage:
extraVolumes:
- name: jupyterhub-shared
persistentVolumeClaim:
claimName: jupyterhub-shared-volume
extraVolumeMounts:
- name: jupyterhub-shared
mountPath: /home/shared
Note that if you want to mount a volume into multiple pods the volume must support a suitable access mode.