Because It Should Be Documented: Creating A Ceph Storage Cluster using old desktop computers

Introduction

What I'm using

My place of employment is getting rid of a bunch of old Dell Optiplex 780s in a computer refresh. Typically these would just go to our surplus depart to be sold for cheap to anyone who wants one. Since none of this money ever makes it back to our department, it's of little consequence to my higher ups whether they are sold or repurposed.

So I have free range of several hundred EoL, but still modestly powerful, desktop computers. So I've grabbed up four of them to work my way through the Ceph evaluation instructions. Maybe this will be a valid way to re-purpose some otherwise in-the-trash hardware, or maybe it will just be a learning tool for me.

Optiplex 780 Specs:

Core 2 Quad processor (Q9550)
4GB RAM (1066MHz)
500GB-1TB 7600RPM SATA (2.0, 3Gb/s) Drive

They came with 1TB, but our replacements, if the drive ever failed, were often not 1TB
One disappointing thing is that the power supply in these units only has one SATA power, so I can't hook up a second drive - at least not easily.

Setup Process

I'm writing this as I go, and may or may not feel like editing it later, so bear with me - this is very much a train-of-thought.

Note from the future: Setup has not been as quick as the quick setup guide would lead you to believe, so I'm splitting into multiple posts. This post gets through the very basic setup - getting a cluster with two OSDs to an "active+clean" state. Further information on expanding the cluster, and setting up file shares ~~coming soon~~ (is available here). I'm giving this a quick once-over now, but, barring any glaring errors, will remain largely as it is

OS Install

I'm using CentOS 6.5 x86_64 - minimal installer. I'm using CentOS because that's what I'm most familar with. However I hope to use btrfs (because this is an experiment, and what's an experiment without experimental software), which requires an updated Kernel, so I'm going to figure how to do the Kernel upgrade as well, which I've never done before so that should be fun.

I'm using the minimal installer because GUIs are for jerks, etc. But mostly because I don't want a bunch of unnecessary programs chewing up resources. I'm using 64bit because, seriously who uses 32bit stuff any more. The processor is 64bit, but I'm not positive the Dell MoBo/BIOS are truly 64bit. But either way it should be fine.

The only thing special I'm doing in the install process is leaving a large portion of the drive unformatted to become the brtfs partition later. I'm also not creating a swap partition because using HDD as RAM for a storage device seems a bit silly. (CentOS ended up creating a tempfs parition anyway against my will - I'll probably remove that when I can be asked).

Partition table ended up looking like this:

250MB /boot ext4
10 GB / ext4
8 GB /home ext4
~350GB free to be used later

OS setup

Most of these steps are fairly routine, but I figured I'd include them here just for posterity sake - and maybe so it's more evident what I screwed up later when something goes wrong.

vi /etc/sysconfig/network-scripts/ifcfg-eth0
#disabled netmanager on eth0
#set onboot to yes
service network restart
#eth0 is now up and has an ip

#get all the latest things
yum upgrade
#get dependencies for kernel upgrade
yum install gcc ncurses ncurses-devel

#create myself a user
useradd myuser
passwd myuser

#remove root from ssh permission
vi /etc/sshd/sshd_config
#change PermitRootLogin from "yes" to "no"
service sshd restart

#so I don't have to download and scp
yum install wget

#download kernel source
wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.6.11.tar.bz2
#I'm using 3.6.11 here because that is what Ceph currently recommends -- "latest in the 3.6 stable"

#to avoid redundancy I'll just post the link to the steps I'm follow for updating the kernel
#http://www.tecmint.com/kernel-3-5-released-install-compile-in-redhat-centos-and-fedora/
#Note: I had to install perl to get compiler to complete
yum install perl

#Once new kernel is installed reboot and press a key to during the "Booting Centos in ...." screen to show the new 3.6.11 boot option
#I edited grub.conf to make 3.6.11 the default so I don't have to remember to select it each boot

#Add ceph repo to you, follow instructions on ceph website - under "Redhat Package Manager"
#http://ceph.com/docs/master/start/quick-start-preflight/

Cloning image

So, obviously I don't want have to do all the above to each machine (3 minimum) so I want to clone the disk. But it's a 500GB drive, I don't want to wait for dd to run on each machine. So I found a possibility here that I'm going to try. Theory is to fill the empty part of the drive with zeros so that it can be compressed with gzip. This will be easy with my existing partitions, but I guess I'll have to create a temporary partition to zero-out the unused space. If I had thought of this before I could have zero'd the disk before install, but live and learn I suppose.

So I created a partition, formatted to ext4, mounted to /temp and then issued

cat /dev/zero | tee -a /zero.txt /home/zero.txt /temp/zero.txt

to zero out all unused space on each partition. This took a long time. After that I run this

rm /zero.txt /home/zero.txt /temp/zero.txt
dd if=/dev/sda bs=4M | gzip > /external/CephImage.gz

Where /external is an external drive I've attached to the machine to hold the image. This also takes a long time. A little over 3 hours to be precise. But I ended up with an image that was 3.3GB rather than 400GB - a significant savings. Seriously, that's some ridiculous compression, I'm a little worried it's going to corrupt on the image write... we'll see I guess.

Now I plug in a bare drive and begin the opposite process

dd if=CephImage.gz bs=4M | gunzip > /dev/sdc

where /dev/sdc is an unformatted bare drive I plugged in. This, again, will take awhile. I'm actually wondering if this will take longer than just your standard DD, because it now has to uncompress the whole thing and write it. But still worth it if it means having 3GB image rather than a 400GB one.

A little longer, but not by much.
...and it boots!

It's not a great clone method; ~3 hours does not make for rapid deployment, but it should suit my purposes here. I don't know that this actually saved any time over the standard "dd if=/dev/sda of=/dev/sdc" but this does at least give me an image backup in case something happens.

Setting Up Ceph

Many hours later I've got some cloned harddrive.

I install ceph deploy on my main machine now
yum install ceph-deploy

Boot up the first node (Ceph-Node1) and follow the Preflight Checklist to get it ready. I've moved to a private network so I set the hosts up manually in the hosts file. Then used ceph-deploy to install to each node
Ceph-Deploy new Ceph-Node1
Ceph-Deploy new Ceph-Node2
Ceph-Deploy new Ceph-Node3

At this point, as I went to set up the partitions with btrfs, I noticed I hadn't installed the btrfs userspace programs. While support is built into the kernel for btrfs, the programs to actually use it are not, so I had to install that on each node (since I'm on a private/not-internetted network now, downloaded rpm from pkgs.org and used a flash drive to get it to each node).

Recreated /dev/sda4 by deleting/readding with the full space of each drive (again, this varies drive-to-drive based on what I had lying around). The used mkfs.btrfs to format it. Edited /etc/fstab to make it mount on boot.

Hmmm, so looks like I followed the wrong page before. "Ceph-Deploy new" installs a monitor node, so I purged everything and started over via instructions at the start of the storage cluster quick start guide which I will be subsequently following.

So, now correctly, I do:

Ceph-Deploy new Ceph-Node1
#It knows the correct user for Ceph-Node1 via the ~/.ssh/config file

This creates a monitor node on Node1.

I'm not sure I'll ever understand how Linux user context works. Ceph-Deploy doesn't like being run as root or with sudo, so I had to log in with a non-root account, then run "su" to get permission to run it, but not "su -" so I'm still the other user but with root permissions. Trying to just run as root gives errors (paradoxically) saying the command must be run as root. This does actually make sense, it's the remote machine that needs root, and for whatever reason ceph doesn't run remotely as root if it's root locally.... anyway so now running:

Ceph-Deploy install Ceph-Node1 Ceph-Node2

Gives me an error that it can't get a valid baseurl for the repo. Fantastic. I'm trying to set this up on a private, non-interneted network, and now it wants internet.

After some trying, I've decided a proxy is probably the way to go for this. Trying to resolve all dependencies and download all requisite .rpm files myself is proving too tiresome. Luckily I've set up proxy servers (with squid) before, so hopefully this won't be too bad. I'm not going to post all the steps involved with that, there's squid guides elsewhere and would just clutter this already cluttered post.

With proxy server setup, I've found the the ceph-deploy install does not appear to respect http_proxy settings in the ~/.bash_profile (I say this because I can wget things from the internet, but when ceph-deploy tries, it fails). So I've had to set proxy settings in /etc/yum.conf, /etc/wgetrc, and /root/.curlrc, in order to get it to complete. Well, that installed it on the admin machine (the one with ceph-deploy installed) so now we've got to get it on the nodes.... Yep, all three of those files have to be set on each node (.curlrc must be in /root), but it's working at least.

Ok, Ceph is installed on all nodes... now back to the storage cluster quick start to continue that.

"ceph-deploy mon create-initial" runs with no issues

"ceph-deploy osd --fs-type btrfs prepare <node>:/ceph" runs with no issues - /ceph is the directory I've mounted the btrfs partition to. Specified using btrfs because it defaults to xfs. Just did this to Node1 and Node2 for now, as per instructions.

"ceph-deploy osd activate <node>:/ceph" ran fine on node1, but seems to be hanging up on node2. Eventually times out with a "received no response in 300 seconds" type error. ...Got it, default iptables rules were in place and apparently blocking communication between Node2 and the monitoring node (Node1), turned off iptables on all hosts and it worked. Presumably node1 worked because it's also the monitor node so firewall wasn't an issue.

Followed the rest of the steps in the storage quick guide. Having an issue with checking cluster health from anything other than the monitor (Node1). Problem appears to be the monitor service continually shutting down because of space issues..... Monitor generated ~900MB of logs very quickly and (combined with the other installs) filled up the '/' directory (I only had partitioned 20GB). Cleaned some stuff up and trying again.

Note: found this out looking at /var/log/ceph/ceph-mon-Ceph-Node1.log
and saw: "<...>reached critical levels of available space on local monitor storage -- shutdown!"

Why is this log gowing so fast?!? literally several MB a second of logs.

Found (google) adding "debug paxos = 0" line to /etc/ceph/ceph.conf stops the log from logging a million messages (never thought I'd be able to say that non-hyperbolicly) a minute. Seems like a good feature. Added that under "[global]", stopped the service, removed the current (several GB) log file and started the service back up. Log file is a much more manageable size now.

So, mon node is online now, and I can query information about the cluster from other machines, so they're talking ok, but my cluster is still in an "active+degraded" state (has been for 12 hours or so at this point - I went home between this paragraph and the previous one). Ceph -s gives the following information:

192 pgs degraded, 192 pgs stuck unclean
2 osds, 2 up, 2 in

According to the wiki, "unclean" indicates that 'pgs' (placement groups) have not been replicated the minimum number of times. It's showing both osds I've created so far -- or I'm assuming that's what " 2 up" means (checked the wiki, that is what that means), so it seems likely that the number of replications is set too high - that is, more than 2.

Sure enough running "ceph osd dump | grep 'replicated size'" showed that all 3 pools (data, metadata, rbd) with a size of 3 (size is apparently the code for "number of replications I should have"). So the issued the following command for each pool

ceph osd pool set <pool> size 2

After doing that and waiting a minute, the cluster is now showing "active" but not "active+clean" the way it's supposed to. Still has the 192 pgs stuck unclean, but no more pgs degraded.... Found a solution, shutting down the osd on one node, leaving it down for a bit, then restarting got it to come back in a clean state... a little troubling but what are you going to do. Here's the email archive I found the solution in.

So, hurray, I have an "active+clean" cluster now, I can continue with the "quick" start guide. Next step is adding additional OSDs and monitors. Neat. Seems like a simi-natural place for a break. Stay tuned for the post where I expand the cluster, add more monitor nodes, and setup block devices, file-shares etc.

Because It Should Be Documented

Thursday, June 12, 2014

Creating A Ceph Storage Cluster using old desktop computers