Hello, Homelab!

12 minutes read •

About a year ago, my homelab journey began. It was a combination of the Self-Hosted Show and the Homelab Show podcasts and ServeTheHome YouTube channel that gave me the push needed to take the plunge.

I knew I wanted my lab to be Kubernetes based. I also wanted my lab to help me move some of my data off the cloud. I had noticed that Google Photos was corrupting some of the older pictures/videos, so I wanted to take ownership of keeping my data safe.

As I wanted to learn more about running a Kubernetes cluster in HA configuration, I needed a minimum of 3 machines. I set up a saved search on eBay to find some old 1L PCs that would be decent for my needs. Most of the good deals were based in US or UK, with ridiculous shipping prices, though.

After a while, I got a notification. A seller in France was selling a lot of 3x ThinkCentre m710q. The specs weren't to my liking, though. However, after some research online, I found that they could be upgraded to something decent. The most performant CPU for that system was the i7-7700T, which could be found in abundance on AliExpress. The M.2 slot originally meant for the Wi-Fi card could be used for a 2.5G NIC.

The eBay listing for the ThinkCentre lot was a reasonable 230€. However, after upgrading the CPUs, maxing out the RAM, adding the 2.5G NICs and getting new NVMe drives it totalled up to around 1400€. Later in the year I would also get a bunch of additional 3.5" SSDs for storage, which pushed the total slightly over 2200€.

Proxmox and Ansible

Proxmox was all the rage on all the podcasts I was listening to, so I wanted to take it for a spin as well. Proxmox is a Debian based OS with a web UI for managing VMs easily. It was quick to set-up and soon I had some Debian VMs running k3s. As I was playing around with Proxmox (PVE), I learned about Ceph - a distributed storage solution. I set it up as well, in a completely un-advised way possible - using a partition on my NVMe drive alongside the OS partition. Tested out the high availability feature in PVE and all the other major features as well.

So far everything had been set-up manually. I knew, I wanted something codified, so that I could recreate everything quickly without having to remember all the little details that went into setting up the machines. I picked up Ansible for that.

Over the next several weeks, I created Ansible playbooks for setting up everything I had done manually up until now. For both Proxmox and the VMs running k3s. There were several things with Ansible that didn't jive with me too much, one of which being the fairly slow playbook execution, even after having spent a bunch of time reworking everything to be fast, in theory. But I was willing to live with these grievances, I wasn't going to redo everything in some other system (Salt or Chef)...

I planned to set up Semaphore to execute the playbooks from Git in an automated way. That would allow me to update the lab from any of my computers. There was a challenge there. In certain conditions, my playbooks would automatically restart the Proxmox hosts. However, if the Semaphore is running in a VM on one of the hosts, the playbook would get interrupted. I kind of solved it by running Semaphore in a separate VM that was setup in PVE to do live-migration. That way, the Semaphore VM would survive the restarts of the underlying hosts. In the end, I wasn't happy with Semaphore features (at the time), and decided not to use it going forward.

Going bare metal

Out of all the things I wanted to do with my homelab, nothing had a hard-requirement on running in a VM. I decided to eliminate Proxmox from the equation all together and to double down on just Kubernetes. Less things to maintain seemed like a good idea.

After reworking my Ansible playbooks a bit, I was ready to deploy bare metal Debian with k3s. I started to focus more on setting up the Kubernetes side of things. While doing that, I came across Talos and it piqued my interest.

Talos is a minimal and immutable Linux distribution meant for one thing and one thing only - running Kubernetes. By this time I had already found several ways to shoot myself in the foot managing Debian and k3s via Ansible and Talos seemed to not have these pain-points. After few experiments in VMs on my main computer, I was convinced it was time to redo my 3 ThinkCentre nodes yet again.

Since Talos is immutable, I had no further use for Ansible and I stopped using it all together. Everything was now configured on the Kubernetes side with YAML.

After some 8 months of running Talos I couldn't be happier. Whenever there's a new version, I just let Talos do a fresh installs on all the nodes and everything just works.

Longhorn? or Ceph?

One of the first things I had to figure out in the fresh Kubernetes cluster was storage. Applications need a way to store user (my) data. Longhorn was the most recommended option in various subreddits when I got started. It was supposed to be easy to setup and low maintenance.

I did get it running fairly quickly. But since I had learned a bit about Ceph (while experimenting with PVE), I knew I wouldn't be happy with what Longhorn had to offer. For one, CephFS was something I wanted to make use of for shared storage between different services. And the sentiment online about Longhorn had slowly started to change. People were telling stories about how they had lost some of their data.

I ended up going with Rook, a Kubernetes operator for managing Ceph. It was a bit of a learning curve getting Rook up and running, but everything has worked fairly well since then. Some random troubleshooting and bug hunting, but none of which has ever brought down storage in my cluster. I'm amazed at the resilience.

Having said that, it is not the fastest storage setup. And it's completely my own doing. I'm using the cheapest consumer SSDs I could get my hands on. And most of them are connected to the machines via USB, since I have no way to add additional SATA ports to my 1L boxes. Over USB? What is he thinking!? He mad? Hear me out, my primary use for Ceph storage is to store my personal data. It's more important for me to have the data safe rather than it being fast. I don't plan on running a database on Ceph backed block storage, so chasing speed is not something I want to do here.

Each 1L box, has one internal SATA port available that I'm taking advantage of, and I'm also connecting 2x 3.5" drives over USB3. That's 3 drives per node and 9 drives in total. This allows me to store valuable data with 3x replication and less valuable data (e.g. DVD backups) with 4+2 EC (erasure coding). The EC gives me slightly more usable storage out of my disks.

Rats nest of USB SSDs

Over the past 6 months, I've simulated various failure conditions (kill power to a single host, pull the drive from the host, kill the data on the some of the drives) and Ceph has not complained one bit and has successfully kept my data safe.

One thing to keep in mind, is that currently I'm only using slightly above 1% of the available storage. I'm certain that as the cluster fills up more, it will have a few more surprises for me in store. But by that time, I may already be running it on better hardware.

And the databases?

I'm very much Postgres biased when it comes to databases, and I can run that off of local-path storage. Postgres itself takes care of replication in a HA setup (shout out to CloudNativePG. Combined that with regular backups to Ceph backed S3, I feel fairly confident there's not going to be any data loss there either.

That said, none of the databases so far contain critical personal data. So, if I am proven wrong, it won't be too painful of a lesson.

All the other Kubernetes cohabitants

In addition to Ceph, I have a whole bunch of other software running in the cluster to make it "usable" for regular applications:

And then there's the user-facing applications, of which there aren't many currently:

Tunnelling into the cluster

Even though most of my devices can access the services through tailscale, there are some trackers that need to connect to traccar via the public web. Ideally, I would've wanted to use Cloudflare Tunnels or similar services, but these only support HTTP traffic. The trackers, however, use various different binary protocols.

The option I settled on, was to get the cheapest VPS I could find. Then installing just tailscale package and some firewall forwarding rules on it. By enabling automatic updates, it means this setup is effectively configure once and forget.

Keeping everything up to date

I knew that I had to develop the habit of keeping most of the software as recent as possible. Otherwise, upgrading severely out of date software would become such a headache I would never want to do that.

I started off keeping track of all the software in the cluster using NewReleases. I would get a weekly digest of all the updates. After which, I have to find an evening where I would review the changelogs for potential dangers to look out for and update everything.

A lot of the projects mentioned above, are very actively developed. This meant, that every Monday, the weekly email from NewReleases, was fairly feature-packed. It didn't take long, for the update ritual to become a chore I didn't look forward to. It was also one of the reasons why I effectively stopped deploying new services on the cluster, as I didn't want to increase the burden on myself.

At some point, I decided to rethink how I approached updates. I restructured my repository containing the kubernetes manifests, and introduced Renovate. Previously, I had to update all the dependencies at once to be able to delete the digest email from NewReleases. Otherwise, I would've lost track of some updates. With Renovate, each of the updates gets its own pull-request. And depending on the service & the changelog I can decide to postpone certain updates for evenings where I have time to take care of them. The other - low risk - updates I can just merge and forget about.

Is it all worth it?

It has been a somewhat bumpy road, so far. I've enjoyed learning about new things and tinkering with both the hardware and software side of running a homelab. What I didn't anticipate was the responsibilities of maintenance that come with running a "production" cluster - Why is the internet down? Oh, the pi-hole pods are in a crashloop. This has paced me from moving more of my life onto self-hosted services.

But there's still plenty I want to explore, so I'm not throwing in my towel just yet. I'll take my time with it, and won't rush myself. And, in doing so, I hope I'll eventually reach the goal of having all my data on my own hardware.

If you managed to get this far, this topic is probably something you're interested in. Drop me a DM on any of the social platforms I'm on, if you want to chat.