Glusterfs Distributed File System on Amazon EC2
A lot of people are using Amazon EC2 to build web site clusters. The EBS storage provided is quite reliable, but you still really need a clustered file-server to reliably present files to the servers.
Unfortunately AWS doesn’t support floating virtual IPs so the normal solutions of using nfs servers on a virtual IP managed by heartbeat or something is just not available. There is a cookbook for a Heath Robinson approach using vtunnel etc., but it has several problems not least its complexity.
Fortunately there’s glusterfs. Gluster is mainly built for very large scale, peta-byte, storage problems – but it has features that make glusterfs perfect as a distributed file system on amazon EC2:
- No extra meta-data server that would also need clustering
- Highly configurable, with a “stacked filter” architecture
- Not tied to any OS or kernel modules (except fuse)
- Open Source
I use ubuntu on EC2 so the rest of this article will focus on that, but gluster can be used with any OS that has a reliable fuse module.
I’ll show how to create a system with 2 file severs (known as “bricks”) in a mirrored cluster with lots of clients. All gluster config will all be kept centrally on the bricks.
At the time of writing the ubuntu packages are still in the 2.* branch (though v3.0.2 of gluster will be packaged into Ubuntu 10.4 “Lucid Lynx”) so I’ll show how to compile from source (other installation docs can be found on the gluster wiki but it tends to be a bit out of date).
To compile version 3.0.3 from the source at http://ftp.gluster.com/pub/gluster/glusterfs
apt-get update
apt-get -y install gcc flex bison
mkdir /mnt/kits
cd /mnt/kits
wget http://ftp.gluster.com/pub/gluster/glusterfs/3.0/3.0.3/glusterfs-3.0.3.tar.gz
tar fxz glusterfs-3.0.3.tar.gz
cd glusterfs-3.0.3
./configure && make && make install
ldconfig
Clean up the compilers:
apt-get -y remove gcc flex bison
apt-get autoremove
This is done on both the servers and clients as the codebase is the same for both, but on the client we should prevent the server from starting by removing the init scripts:
# only on the clients
rm /etc/init.d/glusterfsd
rm /etc/rc?.d/*glusterfsd
It’s also useful to put the logs in the “right” place by default on all boxes:
[ -d /usr/local/var/log/glusterfs ] && mv /usr/local/var/log/glusterfs /var/log || mkdir /var/log/glusterfs
ln -s /var/log/glusterfs /usr/local/var/log/glusterfs
And clear all config:
rm /etc/glusterfs/*
Ok, that’s all the software installed, now to make it work.
As I said above, gluster is configured by creating a set of “volumes” out of a stack of “translators”.
For the server side (the bricks) we’ll use the translators:
- storage/posix
- features/locks
- performance/io-threads
- protocol/server
and for the clients:
- protocol/client
- cluster/replicate
- performance/io-threads
- performance/io-cache
(in gluster trees the root is at the bottom).
I’ll assume you’ve configured an EBS partition of the same size on both bricks and mounted them as /gfs/web/sites/export
.
To export the storage directory, create a file /etc/glusterfs/glusterfsd.vol
on both bricks containing:
volume dir_web_sites
type storage/posix
option directory /gfs/web/sites/export
end-volume
volume lock_web_sites
type features/locks
subvolumes dir_web_sites
end-volume
volume export_web_sites
type performance/io-threads
option thread-count 64 # default is 1
subvolumes lock_web_sites
end-volume
volume server-tcp
type protocol/server
option transport-type tcp
option transport.socket.nodelay on
option auth.addr.export_web_sites.allow *
option volume-filename.web_sites /etc/glusterfs/web_sites.vol
subvolumes export_web_sites
end-volume
NB. the IP authentication line option auth.addr.export_web_sites.allow *
is safe on EC2 as you’ll be using the EC2 security zones to prevent others from accessing your bricks.
Create another file /etc/glusterfs/web_sites.vol
on both bricks containing the following
(replace brick1.my.domain
and brick2.my.domain
with the hostnames of your bricks):
volume brick1_com_web_sites
type protocol/client
option transport-type tcp
option transport.socket.nodelay on
option remote-host brick1.my.domain
option remote-subvolume export_web_sites
end-volume
volume brick2_com_web_sites
type protocol/client
option transport-type tcp
option transport.socket.nodelay on
option remote-host brick2.my.domain
option remote-subvolume export_web_sites
end-volume
volume mirror_web_sites
type cluster/replicate
subvolumes brick1_web_sites brick2_com_web_sites
end-volume
volume iothreads_web_sites
type performance/io-threads
option thread-count 64 # default is 1
subvolumes mirror_web_sites
end-volume
volume iocache_web_sites
type performance/io-cache
option cache-size 512MB # default is 32MB
option cache-timeout 60 # default is 1 second
subvolumes iothreads_web_sites
end-volume
and restart glusterfs on both bricks:
/etc/init.d/glusterfsd restart
Check /var/log/glusterfs/etc-glusterfs-glusterfsd.vol.log
for errors.
On the clients edit /etc/fstab
to mount the gluster volume:
echo "brick1.my.domain:web_sites /web/sites glusterfs backupvolfile-server=brick2.my.domain,direct-io-mode=disable,noatime 0 0" >> /etc/fstab
Then create the mount point and mount the partition:
mkdir -p /web/sites
mount /web/sites
Check /var/log/glusterfs/web-sites.log
for errors.
And you’re done!
The output of df -h
should be something like this (though your sizes will be different).
bash# df -h
Filesystem Size Used Avail Use% Mounted on
...
brick1.my.domain 40G 39G 20M 0% /web/sites
In another post I’ll pontificate on tuning gluster performance, why I chose this particular set of filters
and what the options mean.