I have a fully functioning Docker VM that I run quite a few services on. It works perfectly and has given me no grief. I've been using Portainer as a very nice frontend which was a breeze to link to my Active Directory. It's just all very nice and convenient. Let's dash it in the bin and set up something more convoluted and overkill.

I'm going to replace it with Kubernetes and Rancher, and will be figuring it out as I go. As a bit of a prep work, I have gone through the Rancher training videos and will be following most of the guidelines, right up until they diverge from my requirements.

The rough plan is as follows.

  • Set up a single node "HA" Kubernetes cluster and install Rancher on it. I'm going to use Rancher Kubernetes Engine (RKE) to do this. I "could" use the docker image of Rancher that they provide, but they also reccomend that you not use this in production. Seeing as my Docker host is running some super critical services like my ebook/comic reader, I need the option of being able to become highly available. Also because the plain docker route is too simple and this site isn't called
  • With Rancher installed, I will be offloading SSL to my pfSense box. I won't be putting a load balancer in just yet.
  • Get Rancher working with Active Directory.
  • Use this Rancher install to bring up another single node "HA" cluster for running workloads on
  • Figure out how to mount my smb shares containing my "linux iso's" and "public domain" comics.
  • Convert all my docker-compose files to work with the Rancher. Ok, I lie. I'll probably just do one to figure out how, and then do the rest once I've fully tested this whole thing.
  • Figure out how to backup and restore this set up. Then cause havoc and see if I can fix it.
  • Add more nodes to minion cluster. Add more workloads. See what happens if I yank a node.
  • Make my rancher cluster properly HA by adding another 2 nodes with a load balancer in front.
  • ......
  • lolno. maybe. we'll see. If I IT-TF-OUTTA-THIS then I'll do a proper migration with a fresh install that I'll try to automate as much as possible.

So the first task is to provision a VM with the following requirements.
- Docker installed. Rancher have provided some nice install scripts so this will be simplified
- iptables and opening a bunch of ports - Rancher seems to prefer iptables. I prefer UFW because it uncomplicated. Looks like I'm going to need to get a better understanding of iptables.
- Since we're going to be using RKE, we need to make sure that we have all the requirements listed here:

That's quite a few tasks to run, and I'm pretty sure I don't want to do this manually, so the real first task is to write an ansible playbook that will do the above for me.


I made the thing

Sup. I'm back, and I did the thing. I made a few changes to my previous requirements, and did things slightly differently.

First of all, I skipped out on implementing iptables properly. Doing some research, I found that the ansible module for iptables doesn't create new rules, just modifies what was there previously. While there were other options on working around this, I decided to stick with UFW. From the depths of google, I found some comments suggesting UFW should still work fine, so let's try that for now. Worst case scenario, I modify the playbook a little and we learn something. Maybe.

Secondly, instead of re-inventing the wheel, I used the ansible-docker-role by Jeff Geerling. This illustrious fellow has written many useful Ansible playbooks (as well as a literal book on Ansible), so he definitely knows what he's doing. Just import it, and add it to the playbook.

Thirdly, this guide on computingforgeeks was super helpful in figuring out how to do this.

Fourthly, the kernel modules required for RKE in the rancher documentation seem a little out of date. Specifically nf_conntrack_ipv4 and nf_nat_masquerade_ipv4. They're not included by those names anymore under Debian 10, but are still enabled so don't worry too much about those. (maybe. for now. we'll see.)

Fifthly, I have another playbook which I use to provision my "golden images" which I've not included here, mainly bacuse I will be modifying it for better security things, but that's a topic for another day.

Anyway, here's the relevant part of my rancher prep role. Instead of removing the bit where I called the Rancher Docker Install Script, I just commented it out. While it technically worked, it wasn't utilising Ansible properly. Ya see, Ansible is declarative. When you write a playbook, you need to come at it from the angle of "this is how I want things to be" as opposed to "do this, then this". Make sense? Utilising Ansible properly means we won't be running the Docker install script each time the playbook is run.

- name: Install pre-requisites
        - curl
        - ufw
    state: latest
    update_cache: yes

# While this works, it's not ideal becuause we're not properly utilising ansible
# Install a supported version of docker using the latest rancher docker install script
#- name: Install docker using Rancher's installation script
#  shell: curl | sh
# Changed to using geerlingguy.docker because that dude knows what he's doing

# Create a new user for k8s, and add it to the docker and sudo group
- name: Add the user 'k8s' with a bash shell, appending the group 'sudo' and 'docker' to the user's groups, create a home directory for them.
    name: k8s
    shell: /bin/bash
    groups: sudo,docker
    append: yes
    create_home: true

# Create the SSH directory for this user
- name: Create the .ssh directory
    path: /home/k8s/.ssh
    state: directory

# Create a new authorised_keys file with the public key for k8s user
- name: Create a new authorised_keys file with the public key for k8s user
    src: ./files/authorized_keys
    dest: /home/k8s/.ssh/authorized_keys
    owner: k8s
    group: k8s

# Enable passwordless sudo for k8s user
- name: Give k8s user passwordless sudo
    path: /etc/sudoers
    state: present
    line: 'k8s ALL=(ALL:ALL) NOPASSWD: ALL'

# This will fail because 2 of the modules are no longer present in the latest kernel of Debian 10
# I can fix this by either removing them from the list, or by adding a ignore fail to this
# Enable all the modules required according to:
# Remove the 2 known to fail
- name: Load kernel modules for RKE
    name: "{{ item }}"
    state: present
    #ignore_errors: yes
    - br_netfilter
    - ip6_udp_tunnel
    - ip_set
    - ip_set_hash_ip
    - ip_set_hash_net
    - iptable_filter
    - iptable_nat
    - iptable_mangle
    - iptable_raw
    - nf_conntrack_netlink
    - nf_conntrack
    #- nf_conntrack_ipv4
    - nf_defrag_ipv4
    - nf_nat
    - nf_nat_ipv4
    #- nf_nat_masquerade_ipv4
    - nfnetlink
    - udp_tunnel
    - veth
    - vxlan
    - x_tables
    - xt_addrtype
    - xt_conntrack
    - xt_comment
    - xt_mark
    - xt_multiport
    - xt_nat
    - xt_recent
    - xt_set
    - xt_statistic
    - xt_tcpudp

# Add these modules to /etc/modules just to be double sure
- name: Add modules to /etc/modules to be double sure
    path: /etc/modules
    line: "{{ item }}"
    - br_netfilter
    - ip6_udp_tunnel
    - ip_set
    - ip_set_hash_ip
    - ip_set_hash_net
    - iptable_filter
    - iptable_nat
    - iptable_mangle
    - iptable_raw
    - nf_conntrack_netlink
    - nf_conntrack
    #- nf_conntrack_ipv4
    - nf_defrag_ipv4
    - nf_nat
    - nf_nat_ipv4
    #- nf_nat_masquerade_ipv4
    - nfnetlink
    - udp_tunnel
    - veth
    - vxlan
    - x_tables
    - xt_addrtype
    - xt_conntrack
    - xt_comment
    - xt_mark
    - xt_multiport
    - xt_nat
    - xt_recent
    - xt_set
    - xt_statistic
    - xt_tcpudp

# Edit /etc/ssh/sshd_config and enable AllowTcpForwarding yes
- name: Ensure AllowTCPForwarding is enabled
    path: /etc/ssh/sshd_config
    line: AllowTcpForwarding yes
    create: yes

# Following sysctl settings must be applied - net.bridge.bridge-nf-call-iptables=1
- name: Modify sysctl entries
    name: '{{ item.key }}'
    value: '{{ item.value }}'
    sysctl_set: yes
    state: present
    reload: yes
    #- {key: net.bridge.bridge-nf-call-ip6tables, value: 1}
    - {key: net.bridge.bridge-nf-call-iptables,  value: 1}
    #- {key: net.ipv4.ip_forward,  value: 1}

# Disable SWAP because kubernetes doesn't like it  
- name: Disable SWAP in fstab since kubernetes can't work with swap enabled
    path: /etc/fstab
    regexp: '^([^#].*?\sswap\s+.*)$'
    replace: '# \1'

# Needed for Debian 10. Unsure of any other os.
- name: Change to iptables-legacy
  shell: update-alternatives --set iptables /usr/sbin/iptables-legacy && update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

# Open required ports
- name: Open required UFW TCP ports for Rancher and AD
    rule: allow
    proto: tcp
    port: "{{ item }}"
    - 22
    - 80
    - 443
    - 2376
    - 2379
    - 2380
    - 3389
    - 6443
    - 9099
    - 10250
    - 10254
    - 30000:32767
    - 1024:65535
    - 53

- name: Open required UFW UDP ports for Rancher and AD
    rule: allow
    proto: udp
    port:  "{{ item }}"
    - 8472
    - 30000:32767
    - 1024:6535

- name: Make sure ufw is enabled
    state: enabled

- name: Install pre-requisites for Longhorn if running on a minion node
        - curl
        - util-linux
        - grep
        - mawk
        - open-iscsi
    state: latest
    update_cache: yes
  when: '"minion" in ansible_hostname'
  notify: enable_iscsid
# Perform a reboot
- name: Perform a reboot

# Handler
- name: enable_iscsid
    name: iscsid
    state: started
    enabled: yes
Relevant parts of the k8s-provision playbook

The only other thing to note is that with the geerlingguy.docker role, the following needs to be added to the defaults.

# Going by the Rancher Docker install scripts, 18.06.2 is the latest version of docker supported by kubernetes on Debian 10.
docker_package: "docker-{{ docker_edition }}=18.06.2~ce~3-0~debian"

# We also want to add the k8s user to the docker group
docker_users: [k8s]

With that run, we should now have a node ready for RKE. Exciting stuff. The cool this is, with Ansible, we can run this on all the nodes we want at the same time. Neat huh?

Right, lets get RKE installed and then use that to get our "HA" single node up and ready for bidnizz.

Let's get Kuby

So, RKE. Super easy way of bringing up a kubernetes cluster. Before we begin though, we need a few things. RKE itself, a workstation we'll be installing this on and any ssh keys. I have set myself up with a super fancy VM that runs code-server and is the big boss off all my Ansible minions. I'll be running RKE on that to do all the managings.

Let's make a RKE config file. Here's one I made earlier, all from scratch using my amazing brain skills. Be super careful with spacing and tabs because it is a yaml file.

# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
- address:
  port: "22"
  internal_address: ""
  - controlplane
  - worker
  - etcd
  hostname_override: ""
  user: k8s
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa_k8s
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: null
    retention: ""
    creation: ""
    backup_config: null
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    infra_container_image: ""
    fail_swap_on: false
    generate_serving_certificate: false
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
  plugin: canal
  options: {}
  mtu: 0
  node_selector: {}
  update_strategy: null
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
  etcd: rancher/coreos-etcd:v3.4.3-rancher1
  alpine: rancher/rke-tools:v0.1.64
  nginx_proxy: rancher/rke-tools:v0.1.64
  cert_downloader: rancher/rke-tools:v0.1.64
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.64
  kubedns: rancher/k8s-dns-kube-dns:1.15.2
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.2
  kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.2
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
  coredns: rancher/coredns-coredns:1.6.9
  coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1
  nodelocal: rancher/k8s-dns-node-cache:1.15.7
  kubernetes: rancher/hyperkube:v1.18.8-rancher1
  flannel: rancher/coreos-flannel:v0.12.0
  flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
  calico_node: rancher/calico-node:v3.13.4
  calico_cni: rancher/calico-cni:v3.13.4
  calico_controllers: rancher/calico-kube-controllers:v3.13.4
  calico_ctl: rancher/calico-ctl:v3.13.4
  calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
  canal_node: rancher/calico-node:v3.13.4
  canal_cni: rancher/calico-cni:v3.13.4
  canal_flannel: rancher/coreos-flannel:v0.12.0
  canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4
  weave_node: weaveworks/weave-kube:2.6.4
  weave_cni: weaveworks/weave-npc:2.6.4
  pod_infra_container: rancher/pause:3.1
  ingress: rancher/nginx-ingress-controller:nginx-0.32.0-rancher1
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
  metrics_server: rancher/metrics-server:v0.3.6
  windows_pod_infra_container: rancher/kubelet-pause:v0.1.4
ssh_key_path: ~/.ssh/id_rsa_k8s
ssh_cert_path: ""
ssh_agent_auth: false
  mode: rbac
  options: {}
ignore_docker_version: null
kubernetes_version: ""
private_registries: []
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
  update_strategy: null
cluster_name: ""
  name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
  provider: ""
  options: {}
  node_selector: {}
  update_strategy: null
  replicas: null
  restore: false
  snapshot_name: ""
dns: null

Now do lots of reading on what all those mean and.......... yeah, ok, lets do it the easy way. After you have RKE installed, just run:

rke config

and follow the prompts. The values within the [] are the defaults, but "Dammit Hobo, I need help with the prompts" I hear you say. Well, ok, I gotchu.

[+] Cluster Level SSH Private Key Path [~/.ssh/id_rsa]:
# Provide the path to the SSH key for the cluster
[+] Number of Hosts [1]:
# You can have as many as you want. If you used the amazing ansible script I made, you could provision them all at once. Neat huh.
[+] SSH Address of host (1) [none]:
# The IP address or hostname of the host
[+] SSH Port of host (1) [22]:
# Generally this will be 22 unless you changed it.
[+] SSH Private Key Path of host () [none]:
# Provide the path to the SSH key for the host. In my particular scenario this is identical to the first one provided above.
[+] SSH User of host () [ubuntu]:
# Change this to the username associated with the private key and login.
[+] Is host () a Control Plane host (y/n)? [y]: 
# yes
[+] Is host () a Worker host (y/n)? [n]:
# yes
[+] Is host () an etcd host (y/n)? [n]:
# yes
# If you're running multiple nodes, feel free to change these out. As I'm only running one node, it will be doing all the things.
[+] Override Hostname of host () [none]:
[+] Internal IP of host () [none]:
[+] Docker socket path on host () [/var/run/docker.sock]:
# default
[+] Network Plugin Type (flannel, calico, weave, canal) [canal]:
# default
[+] Authentication Strategy [x509]:
# default
[+] Authorization Mode (rbac, none) [rbac]:
# default
[+] Kubernetes Docker image [rancher/hyperkube:v1.18.8-rancher1]:
# default
[+] Cluster domain [cluster.local]:
# change if you want, but I left it as it is.
[+] Service Cluster IP Range []:
# default
[+] Enable PodSecurityPolicy [n]:
# default
[+] Cluster Network CIDR []:
# default
[+] Cluster DNS Service IP []:
# default
[+] Add addon manifest URLs or YAML files [no]:
# default

This will generate the above file for you. Simples. This file can be modified though. You can add in more nodes or take nodes out. You can add/remove roles from each node all from this one file. Super easy.

If you don't have a password on your ssh private keys (naughty) then, just type in

rke up

and you're set. You have a kubernetes cluster. However, if you're trying to be super secure and have a password on your ssh key, you're probably wondering how we go about doing this. I did at least. Anyway, there's a tool for that.

Let's run through the basics, more for me than anyone else because you probably already know this. Nerd.

# Start up ye olde agent
eval $(ssh-agent)
# Add an arbitrary key
ssh-add ~/.ssh/key

Easy. Now that we have the key stored in the memories, we can go ahead and try to bring up our cluster using the "rke up" command. The nerds among you who read through my yaml file will see that "ssh_agent_auth: false". Change this to true in your cluster.yml if you're going to be using the ssh-agent.

In the same directory as your cluster.yml file, there should be a new file called kube_config_cluster.yml. This file needs to be kept secure as it contains all the details we need to interact with our cluster. How do we interact with our cluster? Glad you asked. With kubectl. Follow the super simple instructions and get it installed. Once you've done that, we provide our kubeconfig file by running the following from the same directory as the file.

export KUBECONFIG=$(pwd)/kube_config_cluster.yml

Easy. Now lets see how badly we messed up. Try:

kubectl get nodes
$ kubectl get nodes
NAME          STATUS   ROLES                      AGE   VERSION   Ready    controlplane,etcd,worker   23m   v1.18.8


Now we can move on to installing Rancher. We'll do this by using Helm. Get it installed on the same workstation you ran rke from. Once you got it, lets add the stable chart repository for rancher.

helm repo add rancher-stable

Then,  create a namespace for rancher. Namespaces in kubernetes are a way of organising resources.

kubectl create namespace cattle-system

Next up. SSL. There are a few options here, but I'll be off loading ssl to my pfSense box, or, as the rancher documentation calls it "External TLS Termination", so we can skip installing cert-manager and go straight to installing rancher.

First though, we need to provide the certificate authority certs to rancher as a secret. To do this, get the CA certs. In my case I have a root certificate which signs an intermediate certificate, which then signs all subsequent certs. In my scenario I'll get a copy of the root and intermediate certs and join them together into a pem file. Super easy to do in the terminal.

cat intermediate.crt root.crt > cacerts.pem

Note that the order you combine your certs matter.

With that done, run the following command to provide the certs to rancher in the cattle namespace.

kubectl -n cattle-system create secret generic tls-ca \

Now we're ready. Lets run the following in the terminal.

helm install rancher rancher-stable/rancher \
  --namespace cattle-system \
  --set \
  --set tls=external \
  --set privateCA=true

Wait a bit, and hopefully, rancher will be available at the hostname you specified (assuming of course you set up the correct dns entry on your dns server.)

To check on the status of the rollout, you can use:

kubectl -n cattle-system rollout status deploy/rancher

Anyone else get Pokemon vibes when discussing cattle and rollout?

If you see the following:

deployment "rancher" exceeded its progress deadline

Try the following to see the current status.

kubectl -n cattle-system get deploy rancher

Anyways, next up I'll be configuring HAProxy on my pfSense box so I don't get any annoying browser error messages when interacting with the rancher web gui.

Lock it up

Oooh, things are getting interesting now. We have a functional install of rancher, on what can easily be upgraded to a multi node ha cluster. We gettin' fancy in here.

SSL Off-Loading is what we're all about in this little post. Little bit of back story. I'm using 2 instances of Windows Server 2019 which run both Active Directory and DNS for my internal domain. If it was up to me, I'd use pfSense for DNS, but  AD is a delicate little flower and must have things going its way or else it doesn't want to play. What can ya do? Anyway, I have a virtual IP set up in pfSense which HAProxy listens on. When it recieves a request for a particular sub-domain, it can do the re-directing and the ssl off-loading. This initially took a bit of time to figure out how to set up, and the only information I could find was about using Lets Encrypt and going that route. However, I only use that amazing service for publicly facing services. For everything internal, I run my own certificate authority. Cus we fancy like that. So, the hostname (set in my AD DNS) will redirect to the virtual IP. HAProxy will see this, and see rancher at the beginning and will be like, yeah I know what to do with that. Send that request here, and wrap it in the wildcard ssl cert I have. Easy.

Let's assume we have a functioning HAProxy set up on our pfSense box. I have a guide for this which I will upload in due course. Try this link. If you see a guide, then I did it. If not....well enjoy.

This bit is pretty simple. In pfSense, we're looking under Services -> HAProxy. From here, add a new backend.

Mode: Active
Name: Rancher
Forward To: Address+Port
Port: 443
Encrypt(SSL): Yes

Health Check Method: None

With that done, go to frontend and edit the section for HTTPS_443 and add in the subdomain for rancher and tell it to use the backend we just created.

Boom. Doneski.

Active Directory

Now we get to the bit I've been dreading. Adding in Active Directory integration....

Well. I've been banging my head against this for a while now. Would you like the Aladeen news, or the Aladeen news?

The Aladeen news is that I figured out the correct settings to get AD integration working. Woooo, yay and so on. The Aladeen news however is that UFW is doing its job reaaallly well. So well in fact, that I can't get my rancher host to connect to AD without disabling it. With it enabled, you'll see:

TestAndApply Error
Error creating connection: LDAP Result Code 200 "": dial tcp: i/o timeout

Disable it -

sudo ufw disable
# or for a full reset to baseline
sudo ufw reset

and it works fine with the following settings:

Service Account Username: [email protected]
Default Domain: [blank]
Server Connection Timeout: 5000

Search Base: dc=[DOMAIN],dc=[TLD]
Object Class: person
Login Attribute: userPrincipalName
Username Attribute: name
Search Attribute: sAMAccountName|sn|givenName
Status Attribute: userAccountControl
Disabled BitMask: 2
Search Base: dc=[DOMAIN],dc=[TLD]
Object Class: group
Name Attribute: name
Search Attribute: sAMAccountName

# Obviously remove the [] when filling in details, and stick with uppercase.

I've tried opening a bunch of ports on the rancher host to see what would help. No dice. Kinda annoying tbh. Not a huge issue in the grand scheme of things because my rancher install won't be accessible externally and is covered by my pfSense box from shenanigans. Still though. Would be nice to get this to work. :/ Moving on for the time being, but I'll be back to figure this out later. At least my domain name now makes sense huh.

If for some reason crazy reason you've been following along and have managed to figure this out, let me know. Somehow.

Anyway, with this done, you can then restrict access to authorized users based on your active directory groups.

Some time later.......

It took a while to get to this stage. I had to bust out me special troubleshooting wall to beat my head against until I got it, but LOOK AT IT. GAZE UPON MY MINION CLUSTER!

Seriously though, this took a while to do. Firewall shenanigans and certificate shenanigans and not knowing where the FM was so I could RTFM. To save you the hassle, I just edited the steps slightly above to get to this stage much easier than I did. YOU. ARE. WELCOME.

Anyway, if you've got up to this stage, you're almost there. Click on Clusters -> Add Cluster -> From Existing Nodes -> Give it a name -> Next -> Assign all the roles you want. In my case, I have 1 node which will have all 3 roles. Copy the provided command on your provisioned machines and you'll have your own minions cluster in no time.

I will be changing the resource allocations for these 2 vms now that everything is up and running. The cluster will be reduced to 2GB RAM, and the minions cluster will be increased to 8GB.

We're finally at the "fun" stage. Migrating my docker containers over. Haven't got this far before so I'm just going to savour the moment before I inevitably run into further issues.


OK, I lied. Unintentionally.

Just a teensy weensy update here because I found some weirdness. First, Rancher does not like 2GB of RAM, so that greedy lil cow will stick with 4GB. My minions cluster has 8 though so I'll be dumping some workloads on there soon.

Also, remember how I said I got active directory with rancher working? Well. It stopped working. More accurately, it couldn't talk to it anymore? Hmmm. Wonder what could be the cause of that.

$ sudo ufw status
Status: inactive
# hmmm. doubt.
$ sudo ufw disable
Firewall stopped and disabled on system startup

Oh look. AD connection is working again. I see. I see.

$ sudo apt remove ufw
$ sudo reboot now

Oh look. It stopped working. Ok. Lemme google something real quick.

Uh huh. Yup. Yup. Ok, yeah. I not smart. UFW is a wrapper for iptables for noobs like me to be able to handle it all much easier. Let me try "disabling" iptables real quick. It's in quotes because I'm not actually disabling anything, just letting everything in/out/

$ iptables -P INPUT ACCEPT
$ iptables -P OUTPUT ACCEPT
$ iptables -P FORWARD ACCEPT
$ iptables -F

Ok, yes, AD is working again. Quick reboot aaaaaannddd.....

An unknown error occured while attempting to login. Please contact your system administrator.
Stupid system administrator not doi....oh, it means me :/

Ok Ok. This isn't a problem I can just put a little spongebob plaster on so I may as well fix it now. What I need to be able to do is make my iptables changes persistant. Let me do some more googling.......

Well then. Iptables is being replaced by nftables which now the default backend on Debian 10. When using iptables, you're actually using something called the iptables-nft which is a weird middle layer between iptables and nftables. Ok. If I'm using nftables, maybe I should just use firewalld to manage that. I remember seeing something about that on the rancher install docs........ah. ok. Well guess I won't do that. Wait, how about if I allow all traffic from the rancher node to both of my domain controllers?

Ok. I'm really going into a bit of a deep dive here, but APPARENTLY Debian 10 is compatible with Kubernetes/Rancher, but there's an issue with kubeadm and iptables and I should be using iptables-legacy. Well how de everloving flippity flop do I do that?

Oh, I see. Ok. Let's try it.

# use legacy iptables
$ sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
$ sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
# Reboot to force changes, then allow all incoming and outgoing to my domain controllers
$ sudo iptables -A INPUT -s -j ACCEPT
$ sudo iptables -A INPUT -s -j ACCEPT
$ k8s@k8s-rancher-01:~$ sudo iptables -A OUTPUT -d -j ACCEPT
$ k8s@k8s-rancher-01:~$ sudo iptables -A OUTPUT -d -j ACCEPT

Oh, we're in. Neat. Not to brag or anything, but this makes me a TROUBLESHOOTING GOD. BOW BEFORE ME PEASANTS!

To recap what we've just learnt here. Debian 10 uses nftables but ALSO uses the legacy iptables. When I was modifying the firewall rules through both ufw and iptables previously, I was modifying the nftable rules through a weird compatibility layer. However, for some weird reason the legacy rules were also being used to some degree? Maybe? I think? Some things were fine using the new nftable rules, but rancher wants to deal with the legacy ones which just seemed to cause a weird clusterfuck. Anyway, by changing to the legacy iptables, and adding in the firewall rules to allow all comms between my rancher node and my domain controllers, it all seems to work. Excellent. Now I think I can still use ufw to deal with these rules as long as I enable legacy mode. I'll give that a go.

While I enjoy this high of figuring out this weird issue, let me just say that this took most of my day. Granted, I did other things in between like getting pwned on Apex Legends and optimising my factorio  base (it is the weekend after all), but still. Some annoying issues can take a while to figure out so don't be doing the up givings. Ya hear me?

I think my main issue here was not knowing what I didn't know (and also maybe sorta not properly reading the manuals......) but I digress. Victory snack time.

Mini postings

Final update of the day. Did another teardown and rebuild. Seriously. Having the playbooks made this set up and learning process way easier. I won't post the entire playbook, but the only thing I added was:

# Needed for Debian 10. Unsure of any other os.
- name: Change to iptables-legacy
  shell: update-alternatives --set iptables /usr/sbin/iptables-legacy && update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

Once I've finished working on this project, I'll post a sanitised version of my playbooks and stuff in case anyone else can get some use out of them.

But yes. Everything seems to be working as it should. Firewall can stay on, ufw rules are working and Active Directory login works after reboots.

Where I wander out loud about storage

Now we need to discuss storage. More correctly, persistant storage. Here are the provided options in rancher.

Add an epehemeral volume
Add a new persistent volume (claim)
Use an exisisting persistent volume (claim)
Bind-mount a directory from the node
Use a secret
Use a config map
Use a certificate

Ephemeral is out but I do like that word. The easy option seems to be bind-mount a volume from the node, which could work....

What if I added a small task to my playbooks that would auto-mount a share from my nas and then provide storage via that mount? If each node had this, then it should work across all nodes....I think. Seems a bit cludgy though.

The other 2 relevant options revolve around PVC. As in Persistant Volume Claims. From my understanding, the way this works is that you set up a persistant volume. From here you can then assign PVC's to workloads. A claim allows a workload to access a certain amount of storage. We could use this with a bind mounted directory from our nas.......? Less cludgy but still cludgy.

The correct option would be to use some form of distributed storage such as GlusterFS. I could hack something together using multiple vms and then provide those access to shares from my nas, but that just seems overly convoluted. Ok, I like overly convoluted, but it's just impractical in this sec, gonna see how complicated it is......

Ok, so the documentaion for gluster is pretty straight-forward. They even have ansible playbooks to help set things up. However, rancher requires a small change for this to work properly. Additionally, I'd need to play around with fedora as that's the reccomended os for gluster. Ok. I don't fancy doing that just yet,  but that's an interesting topic for another day. I'm going to go for cludge option 1. Easier than the other options, but still provides a little bit of resiliance and the data is backed up so should be fine for my needs.

Where I change my mind

Nope. Changed my mind. I just found out about Longhorn. If you're noticing a theme here, it's because longhorn was developed by the same folks that made rancher. What is it you ask? A solution to my distributed storage problem. And the best bit? It can be installed on the kubernetes cluster, and the orchestration can be managed by kubernetes. The other best bit? Installation seems pretty simple. We just have a few requirements to go through.


  1. Rancher v2.1+: I'm using version 2.4.8. Good on this front.
  2. Docker v1.13+ : I'm using 18.06.2~ce~3-0~debian so good here too.
  3. Kubernetes v1.14+ cluster with 1 or more nodes and Mount Propagation feature enabled. If your Kubernetes cluster was provisioned by Rancher v2.0.7+ or later, MountPropagation feature is enabled by default. Check your Kubernetes environment now. If MountPropagation is disabled, the Base Image feature will be disabled. : I'm only installing this on the minions cluster, which was provisioned by rancher. Good to go here as well.
  4. Make sure curl, findmnt, grep, awk and blkid has been installed in all nodes of the Kubernetes cluster. : Pretty simple to do. I'll just make a ansible task for this. findmnt and blkid are part of util-linux. Good to know for the ansibleising. Also, awk is part of the mawk package.
  5. open-iscsi has been installed on all the nodes of the Kubernetes cluster, and iscsid daemon is running on all the nodes. : Ansible
  6. For Debian/Ubuntu, use apt-get install open-iscsi to install. : Nah, ansible
  7. A host filesystem supports file extents feature on the nodes to store the data. Currently we support: ext4 + XFS : Easy enought to check. Run "df -T". I'm good to go though. Sweet.

So, from this, I learn that I need to make a task for ansible the can be run on nodes that will be minions. We don't want these prerequisites installed on non minion nodes so a bit of segeragation will be good. Oooooh. I can do this based off the hostname. Ya see, I've been using templates in VCenter when I clone the base images for these shenanigans. This allows me to apply hostnames and static ip addresses. I've been using this to apply a naming scheme for these vms. In the case of the minion cluster, each vm has the following naming scheme "k8s-minion-xx" where xx is an integer. So all I have to do here is add a task to my playbook that only runs "when" it see's that the hostname contains the word minion. Noice.

Here it be.

- name: Install pre-requisites for Longhorn if running on a minion node
        - curl
        - util-linux
        - grep
        - mawk
        - open-iscsi
    state: latest
    update_cache: yes
  when: '"minion" in ansible_hostname'
  notify: enable_iscsid
### handler is in seperate file but looks like this
- name: enable_iscsid
    name: iscsid
    state: started
    enabled: yes

Ok, so with that run on my node, I should be able to install longhorn through the rancher ui.

Rancher -> Change to minions cluster -> Projects/Namespaces -> Add Namespace -> "longhorn-system" -> Create

Why we do dis? According to the Longhorn install guide "Important: Please install Longhorn chart in longhorn-system namespace only" so I figured i'd make sure the namespace exists. I could have tested to see if the namespace would be made for me, but this was easy enough to do.

Rancher -> Change to minions cluster -> Apps -> Launch-> Longhorn -> Launch

Hmmmm. That was easy......maybe tooo easy. Everything seems to be working. Moving on.......


Let's put some workloads on this bad boi. I've been wanting to set up my linux hosts to e-mail me with any woes they may have. Most of my services such as my NAS have e-mail alerts being sent to me using a gmail account from ye olde days when google was just giving away gsuite for free. It's nice and works great, but we don't like that here. How about we give mailhog a go? It has a nice docker image that we can use to test both it and kubernetes workloads. Let's give it a go. The command to run this via docker is:

docker run -d -e "MH_STORAGE=maildir" -v $PWD/maildir:/maildir -p 1025:1025 -p 8025:8025 mailhog/mailhog

That's cool and all, but we need it to run through the rancher interface.

Rancher -> Select Minions cluster -> Default -> Deploy

Now, as you can see, we can easily map most of the options from the provided command to what rancher requires.

Using the longhorn storage class I added in earlier made this super easy to do. When all done. Click launch.

I look away for one minute and it decides to be a ninny. Here are the errors I'm looking at.

Warning 	FailedMount 	Unable to attach or mount volumes: unmounted volumes=[mailhog-volume], unattached volumes=[mailhog-volume default-token-bl6f5]: timed out waiting for the condition 	a few seconds ago

Warning 	FailedAttachVolume 	AttachVolume.Attach failed for volume "pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa" : rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=unable to attach volume pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa to k8s-minion-01: volume pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa not scheduled] from [http://longhorn-backend:9500/v1/volumes/pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa?action=attach] 	a few seconds ago

Warning 	FailedAttachVolume 	AttachVolume.Attach failed for volume "pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa" : rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [detail=, message=unable to attach volume pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa to k8s-minion-01: volume pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa not scheduled, code=Server Error] from [http://longhorn-backend:9500/v1/volumes/pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa?action=attach] 	2 minutes ago

Warning 	FailedAttachVolume 	AttachVolume.Attach failed for volume "pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa" : rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=unable to attach volume pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa to k8s-minion-01: volume pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa not scheduled, code=Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes/pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa?action=attach] 	4 minutes ago

Normal 	Scheduled 	Successfully assigned default/mailhog-756f65f694-6fqv4 to k8s-minion-01 	5 minutes ago

Warning 	FailedScheduling 	0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims. 	5 minutes ago

Knew it all seemed way too easy. Let me get my troubleshooting plank ready and let's figure out where I went wrong.

According to the longhorn ui:

"Replica Scheduling Failure"

Hmmm. I have a hunch but that's not really enough to go on. Now where are the logs? Well, it turns out looking at longhorn logs is super simple. In the bottom left of the ui, there's a link to "Generate Support Bundle". Clicking that allows a bunch of logs to be zipped up and downloaded. Looking through the longhorn-manager.log file, and searching for the pvc, I see the following:

level=error msg="There's no available disk for replica pvc-f83e8f56-7b9f-4014-b0b7-b85d455367fa-r-b2c07a51, size 5368709120"

Okedoke. Well the vm this is running on has plenty of space. I even cut the size of the pvc down to 2gb, but as expected, that didn't solve the problem. Anyway. On to my huncho. I think that I need more nodes. What I mean is, the only error the longhorn ui is throwing is the one regarding replication. Logs are showing no disks available. My guess is that I need to add more nodes so that the ha storage system can be ha. Easy enough to test. I bring up 2 more minion vms, run my amazing ansible playbook, then run an adhoc ansible command on both of the new nodes to get them to join my minions cluster. Easy.

I'm a real cluster now ma!

Oh, would you look at that, everything is working fine now.

I didn't get the same buzz from solving this problem, but it's nice that it works. I'll be reducing the memory alloted to minion-01 to 4gb so the other nodes don't get jealous. I'm also curious to see how my cluster will handle minion-01 going down while I edit the ram settings. BRB.

Ok, so what I should have done is tested that with some data in the volume, but ignoring that for the time being, things worked as expected. Minion-01 had a quick nap, kubernetes did its thing and mailhog was started on another node. Looking at the longhorn ui, I can see that the volume is now attached to minion-03. Noice.

So, we've done a very simples deployment using the rancher ui. Before I can sign off on this journey and be able to complete the rest of my migration, I need to understand how to deply apps in the other 2 ways. Using helm/helm charts and then manually making a deployment by messing with some oh so fun yaml.

I kinda explained helm and helm charts earlier. Member how I said helm is like a package manager for kubernetes? Well, ya. That's the long and short of it. There are plenty of helm charts availables for a whole host of apps. Even some of the services I'm already running have helm charts like Heimdall. However, I've been wanting to deploy a Nextcloud instance in my lab so I can deploy Collabora on there. Lucky for me, there's a helm chart ready to go. Now what?

Well, go to your rancher ui -> Apps -> Manage Catalogs -> Add Catalog -> Give it a name, I called mine Nextcloud, just to be different. The catalog url is the link to the source code. This can be found by clicking the helm chart link for nextcloud, and clicking on source. There are 2. You want the one for helm. I left the option for Helm Version on v2. You do you. There's a little snippet on the difference between the 2 here.

That link also has this juicy tidbit of info.

As of Rancher v2.5, the catalog system is deprecated and has been replaced with Apps and Marketplace in the Cluster Explorer.

Well then. I'm on 2.4.8 which is what got pulled when I selected stable, so no biggy. I'll give the "Apps and Marketplace" a play about when I get round to updating my rancher install.

Anyway, with that done, if you go to Apps now and click on Launch, you'll see Nextcloud just chilling there waiting for you tell it to get to work. Give it good ol' fashioned click and let's do some configuratings.

If you read through the configuration section, you'll see there's quite a few things we can modify. On top of this, I'd like to utilise longhorn for storage. I don't see any obvious clicky things associated with storage so that'll need to added in to the deployment file before we do any launching. Here's my first attempt at this.

I feel like this is wrong and I should have been editing the values.yml file. Oh well. I'm testing this for you so you don't have to. (Future me says this does work, but play with the yaml file. It helped me better understant what was going on under the hood.)

Ok, yeah that didn't work. That's good. I mean, not as good as it working, but we're learning. So still a mini success.... brb

I fixed it.

I'm back. So I figured out the issue. Do you see where I went wrong? Yeah. You can't use the github repo as the address for the helm chart. The correct one is at

Use this instead of the repo like I said above. Good thing you're not blindly following this "guide" isn't it :D

Also, instead of providing the answers as shown above, I edited the options using the yaml file. If you Click on "Preview" near the bottom, under the answers section, you'll see a dropdown for the template files. Look through that list for


Copy and paste the whole thing into the answers section (after clicking on "Edit as YAML").

The only options I edited are:

host: nextcloud.kube.home # change to the fqdn

username: admin # dont need to change this. Up to you.

password: changeme #change to a better password. Like Password1

enabled: true # I changed to false. Internal uses sqlite which shouldn't be used in "production"
  enabled: false # Changed to true to deploy this instead

  type: ClusterIP # Change to NodePort as I want to map an internal port to an external
  port: 8080
  loadBalancerIP: nil
  nodePort: nil # Change to an port range from 30000-32767 you want this service to be accessible from
storageClass: "longhorn"
accessMode: ReadWriteOnce
size: 5Gi

Granted, this may not be the ideal way to do this, but all my load balancing is handled by pfSense, so this works for me. Click Launch and and after a bit, it should be all spun up. Look at it chilling all nicely next to longhorn.

Now, if you try clicking on the nodeport, you'll probably see this

Pretty self explanatory. Instead of modifying anything though, I forwarded the domain name and offloaded ssl. With that done, going to makes everything work all nice.

Ok, everything seems to work. Now to test some things.

Did the longhorn storage thingy work?


Ok. Next test. Upload a file. Then add the ip addresses for the other 2 nodes to the nextcloud backend in haproxy. Turn off minion-01. See what happens when I try to access the nextcloud instance. You excited? I'm excited.

Ok. Uploaded a very important image.

Add in the ip addresses for the other nodes in the cluster

aaaand tell minion-01 to have a quick nap.



 CrashLoopBackOff: back-off 5m0s restarting failed container=nextcloud pod=nextcloud-88db9f5bb-f6ld4_nextcloud(21d67779-6520-4b83-accd-0ec4627367c0) 

I've been at this for a while. Let me recap what's happened so far.

I can get Nextcloud to work as long as I dont provision persistant storage for its data. The data for the required mariadb is fine. That sets itself up nicely in longhorn. Trying to get nextcloud's data to be persistant just causes the deployment to fail with

pod has unbound immediate PersistentVolumeClaims.

I've tried a few things to get past this hurdle, but for the time being. I'm stuck.  This is unfortunate as I'd like this to work with longhorn. Guess I'll be coming back to this once I figure this out.

Where I try things that don't work

There's a horrid phrase about there being more than one way to skin a cat. Firstly, why? Why was that a thing that needed to be known? I get cats are gross and disgusting, and basically the worst, but still.

Secondly, the point is friendship with nextcloud through helm chart ended. I have a "New and Improvedplan.

So, what I'm going to do is take a helm chart for mariadb and get that up and running. This will be my new master sql database that everything will connect to and use. Normally I'd spin up a new container for each service, but this should be fine. I'll have the data for this database replicated across 3 nodes. Highly available and dat.

With that done, I'm going to use the nextcloud docker image and plug it in through the ui and make it connect to the mariadb instance. The assumption is that I can get that to work with persistant data. Maybe (most likely) I'm doing something wrong. Maybe the helm chart doesn't want to work with longhorn for whatever reason. Maybe it's maybeline. Who knows.

oooooooooooh. Interesting.

So, I set up a dedicated mariadb pod. Its all running nice and smooth like.

Try doing to the same with nextcloud through the rancher ui as a workload.

So the problem wasn't directly attached to the helm chart.

I wonder what could be the case of these shenanigans?

Eh. I'm done for today. Toodles.

Ok ok. I'm going to try to deploy one other app that needs a persistant claim through longhorn. If it doesn't work, then I'm blaming nextcloud for my woes.


Seems to be a weird issue with longhorn here. Going to shut my cluster down tonight. Tomorrow we'll try again.

Toodles for realsies.

Why do bad things happen to good people?

New day, new problems. Started up my cluster, or tried to, but rancher decided it didn't want to come back online. It's ok. I'm still learning. Let's take what we know so far, and scale it up a little more. This time, I'll recreate both clusters. A 3 node cluster for rancher, and another 3 node cluster for the minions. Bare with.

Ok cool. I now have a new rancher cluster with 3 nodes, that's properly ha AND running on the latest version of 2.5.1. Added AD integration, sent an ad-hoc ansible command to all the minions to get them set up as a new cluster, aaaaannnd.........done.

Install longhorn again, and yes the namespace for longhorn-system is nicely auto made for us if we install it from the apps section. Coolio.

Now. Let's try the whole silly nextcloud debacle one more time.

Aaaaaaaand. Nope. I hate this. Seems to potentially be an issue with longhorn that I can't quite pin down. Works fine with mariadb and mailhog, but nextcloud just does not want to play with it.

Ok, let's simplify things. Heimdall. Let's install that with persistant storage on longhorn. see what happens.

Same problem, but look at what I dug up :3

Warning 	ProvisioningFailed 	failed to provision volume with StorageClass "longhorn": rpc error: code = InvalidArgument desc = access mode MULTI_NODE_MULTI_WRITER is not supported 	3 minutes ago

Lemme try something.....


The smartie-pants among those weirdoes who've followed along until now will guess what le probleme was.

For persistance for nextcloud, I was using

accessMode: ReadWriteMany

Nextcloud did not like that.

accessMode: ReadWriteOnce

Works. Just works. I wonder if this will cause problems when I send the node it's running on to sleep? Guess we'll find out soon enough.

So, after all that, I'm back to where I was before. Uploading super important selfies to the nextcloud instance, sending the node to sleep and seeing if everything is fixed automagically. Bare with.

Beautiful. Just Beautiful. Longhorn + Kubernetes is also quite cool.

Well I'm happy. Let's turn that node back on and see if everything goes back to being green and healthy looking.

They do. Everything self-healed very nicely. Bravo. Jolly good damn good bloody good show ol' chap.

Getting ever so close to being able to call this journey complete. Just one more task left to do.

Writing a deployment from scratch. Like a big boy.

The Final Countdown, do dooo dooooo

I've been doing the readings and readings . While I don't grok this, I know just enough to stumble my way through. First off, there's some important terminology we need to understand so things make sense going forward. The following is a direct copy of the definitions from here.


  • PersistentVolume — A provisioned piece of storage in the cluster (could be local or remote on AWS, for instance).
  • PersistenntVolumeClaim — A request for PersistentVolume by a user or service.
  • Pod — Kubernetes does not support creating containers directly and the basic execution object in Kubernetes is the Pod, which may contain one or more containers.
  • ReplicaSet — A resource that is responsible for maintaining a stable state of replicas of a given application.
  • Deployment — Manages the container deployment cycle events, such as rolling updates, undoing changes, pausing, and resuming changes.
  • Service — Responsible for enabling communication between the applications deployed.
  • Ingress — Exposes internal services to the external world.

The definition of each of these resources consists of four main sections, as listed below:

  • apiVersion — The resource’s API versions.
  • Kind — The resource’s type.
  • Metadata — Metadata that should be attached to the resource, such as labels and a name.
  • Spec — Define the specification and configurations of the defined resource.


I've dealt with some of these terms previously while editing the various values files, so that was a good primer for this section.

Looking through helm hub, I can see that number of the services I want to run already have charts made for them. The vast majority of the rest of these services can be implemented very simply through the rancher ui. For this reason, I'm drawing a line on this long winded journey here. I'm going to migrate my services over and get this cluster offically set as production in my fluffy little cloud.

I may revisit writing a deployment from scratch, as I would like to convert the docker-compose file for the vpn container and the services that use it. I feel that's complicated enough to warrant its own post.