Setting up ZFS on Talos

6 minutes read •

Talos, being an immutable distro, is amazing, but it does come with a caveat. When it's time to deviate from the defaults, it involves some additional steps. Talos has extension support for those cases. There's also a bunch of official/community extensions ready to go. Rolling your own is also possible, but involves a bit of a learning curve.

I wanted to convert one of my Kubernetes nodes into a ZFS-based storage box. Luckily for me, Talos has a community-maintained ZFS extension to get the ZFS kernel module and userland tools installed on the node. Unluckily, I wasn't able to locate sufficient documentation about it to get it working on first try. After a bunch of trial and error (and head banging), I was able to get things figured out.

Configuring the Talos node

To customize the extensions included in the Talos node, one needs use a purpose built install images. There's two options - using Talos Factory (the easy way), or using imager to do it locally.

Using Talos Factory web interface is very straightforward, just need to select the siderolabs/zfs under the system extensions. Doing it in Yaml and using the HTTP API is as easy:

customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/zfs

Once you have the installation image, you can slot it into your Talos node configuration, like so:

machine:
  install:
    image: https://factory.talos.dev/image/4dd8e3a8b6203d3c14f049da8db4d3bb0d6d3e70c5e89dfcc1e709e81914f63c/v1.8.3/metal-amd64.iso

But, turns out, that on it's own is not enough to get everything working. This is the part that made me lose several hours. Even though the zfs.ko module is present on the file-system, it isn't loaded in the kernel. In order to make that happen, one needs to tweak the node configuration once more and use the machine.kernel.modules list to explicitly include zfs.

machine:
  install:
    image: https://factory.talos.dev/image/4dd8e3a8b6203d3c14f049da8db4d3bb0d6d3e70c5e89dfcc1e709e81914f63c/v1.8.3/metal-amd64.iso
  kernel:
    modules:
      - zfs

After the above configuration has been applied to the node, the regular upgrade procedure will make everything ready for use.

Accessing the ZFS utilities

Even though the siderolabs/zfs extension includes the zfs tools (zfs, zpool, ...) on the node, using them is more involved as Talos doesn't support executing adhoc commands on the hosts directly. You need a root shell container on the node and execute the tools in the correct kernel namespace.

First, we need create a root shell that we can exec into. Depending on your needs it may be better to use DaemonSet to get the shell on to multiple nodes, but in case of a single node, we can just launch a simple Pod, like such:

apiVersion: v1
kind: Pod
metadata:
  name: zfs-shell
spec:
  nodeName: TARGET_NODE_NAME # TODO
  hostIPC: true
  hostNetwork: true
  hostPID: true
  containers:
    - command: ["sleep", "infinity"]
      image: debian
      name: shell
      securityContext:
        privileged: true

Afterwards we can run zfs tools via the nsenter command like so:

kubectl exec pod/zfs-shell -- \
  nsenter --mount=/proc/1/ns/mnt -- \
  zpool status

ZFS backed Persistent Volumes

Now we have a ZFS capable node in our cluster with a bunch of disks attached to it. To make this storage available in Kubernetes, we can install OpenEBS and make use of its Local PV ZFS storage engine.

First, we need to create the ZFS pool that will be used by OpenEBS. In my case, I had 6 disks and wanted to create a RAIDz2 pool on them.

kubectl exec pod/zfs-shell -- nsenter --mount=/proc/1/ns/mnt -- \
  zpool create -m legacy -f zfspv-pool raidz2 \
  /dev/disk/by-id/{DISK1,DISK2,DISK3,DISK4,DISK5,DISK6}

By default ZFS wants to mount the main pool filesystem under some directory in the host. Instead, we can use the -m legacy parameter to tell ZFS to leave the mounting to us. When OpenEBS is creating new filesystems in the pool, it is also using the legacy option, meaning there's no requirement for the main pool to be mounted on the host side either.

Once the pool is created, the OpenEBS can be easily installed via their Helm chart. After which, some ZFS storage clases need to be defined that are suitable for your workloads. Below are two basic storage classes I use.

# Storage class for random application files
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: host-zfs-standard
provisioner: zfs.csi.openebs.io
allowVolumeExpansion: true
parameters:
  recordsize: "128k"
  compression: "lz4"
  dedup: "off"
  fstype: "zfs"
  poolname: "zfspv-pool"
# Use allowedTopologies: in case ZFS is only available
# on some of the nodes in the cluster.

---
# Storage class for file storage (documents, photos, videos, etc)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: host-zfs-files
provisioner: zfs.csi.openebs.io
allowVolumeExpansion: true
reclaimPolicy: Retain
parameters:
  recordsize: "1M"
  compression: "lz4"
  dedup: "off"
  fstype: "zfs"
  shared: "yes" # to enable ReadWriteMany access mode
  poolname: "zfspv-pool"

To enable taking snapshots of the ZFS backed PVs, the a volume snapshot class also needs to be defined.

kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
metadata:
  name: host-zfs-snapshot
driver: zfs.csi.openebs.io
deletionPolicy: Delete

Monitoring with Netdata

The ZFS backed PVs should be usable now, but we're effectively flying blind. When a disk fails, we wouldn't know about it unless we manually checked the zpool status every so often. And if enough disks fail, it's sayonara to our data.

There are some Prometheus metrics exporters for ZFS, but I didn't explore that avenue, as Netdata has built-in support for ZFS data. But it doesn't work out of the box on Talos, due to the zpool command being difficult to access from the container.

I ended up creating a small utility (zfs-http-query that runs as a DaemonSet on the ZFS nodes and exposes zpool data via an unix socket. This allows pods that have access to that socket to query zfs data from an unprivileged container.

With zfs-http-query in place, the Netdata Helm values.yaml can be updated to make the built in ZFS support work:

child:
  configs:
    zfspool:
      enabled: true
      path: /etc/netdata/go.d/zfspool.conf
      # tell netdata zfspool integration to use the zpool shim from zfs-http-query
      data: |
        jobs:
          - name: zfspool
            binary_path: /opt/zfs-http-query/bin/zpool
  extraVolumeMounts:
    - name: opt-zfs-http-query
      mountPath: /opt/zfs-http-query
      readOnly: true
    - name: run-zfs-http-query
      mountPath: /var/run/zfs-http-query
      readOnly: true
  extraVolumes:
    - name: opt-zfs-http-query
      hostPath:
        type: DirectoryOrCreate
        path: /opt/zfs-http-query
        # contains bin/zpool and bin/zfs shims
    - name: run-zfs-http-query
      hostPath:
        type: DirectoryOrCreate
        path: /var/run/zfs-http-query
        # contains the unix socket used by the shims

With this in place, Netdata will have access to the ZFS pool data and can show graphs about the pool health. And send notifications (if properly configured) when the pool state becomes degraded (i.e. a disk failure).

Sharing storage across nodes

Unlike Ceph or other distributed storage solutions, the OpenEBS based ZFS is tied to the node. When a PV is created on a certain node, it can only be accessed on that specifc node and it cannot be migrated to another node.

I found csi-driver-smb project that allows mounting of samba shares to containers. This allows working around the above limitation by hosting a samba server on the ZFS enabled node, and using the csi-driver-smb to access it on any other node in the cluster.

Network storage diagram