- 3 minutes read #project management

Can you do it?

This post was originally written while I was at LShift / Oliver Wyman

You’re somewhere in the middle of an Agile project. As usual, once you’ve actually started development, the true nature and scope of the project is becoming clear and the client asks ‘Can you do <some “clarified” feature>?’.

This, of course, is a trap of linguistics. As programmers we can do anything so confidently say yes, but the question they’re really asking is “WILL you do this?” with a side order of “as well as all the other stuff”. It’s related to the other trick question “Can we bring this task/feature/bug into the sprint (as well as all the other stuff)?”

As Agile practitioners (all LShift lead developers are trained in DSDM), to stop project expectation/scope/costs/deadlines spiralling out of control the conversation needs to encompass ALL of these four questions:

  • Can we do it?
  • Should we do it?
  • When should do it?
  • What shall we displace so we have time? – or, Is there a case for more budget to be made available, or a subsequent phase (these may signify a change in business goals)?

and it’s the last step that is the hardest.

In manufacturing there is the adage: “Cheap, Quick, Good – pick two”. In Design & Manufacturing (which is what software development is) the model is “Time, Cost, Quality, Features – at most three can be fixed”.

atern pick 3

You always want high quality, so the negotiation has to be found in at least one the other three areas. But budget is often fixed and a deadline has been promised – so the variable is usually features (if newly discovered items are important, and nothing can be displaced, then new budget must be found).

NB. This is related to the Project Management Triangle though there’s a subtle difference: clients tend to think in terms of budget and deadline whereas implementation teams think in terms of features, velocity (related to team size) and deadline.

In a related post:

“reaching the very end edges of our ability to anticipate how we are going to want things to be and coping with the unknowns we encounter is part of the fundamental nature of developing software.”

True Agile is not just “morning standups and you’re done”, it’s a constant conversation between the client and developers about the relative priority of features. As the project progresses new ideas come to mind, or it’s apparent a feature implies more requirements than you thought (or occasionally less), and this is perfectly natural. An agile project team can respond to the changing requirements, but an agile project manager needs to lead the client through all four of the replanning questions.

xkcd1425

- 5 minutes read #programming , #tools , #water cooler

In defence of integration tests

This post was originally written while I was at LShift / Oliver Wyman

There’s a notion that ‘Integration tests are somehow rubbish and we should replace them with contract tests’ that I wish to reject.

Rainsberger

This video has some straw man arguments that are just wrong:

‘the more integration tests we have the less design feedback we get’

‘writing more integration tests which encourage me to design more sloppily’

Really? Certainly following the smell of a unit test allows one to pinpoint the fault more clearly, but does an integration test “encourage” one to design sloppily?

Notice that, in the video, an application checked with integration tests ostensibly looks like this (around 4′):

system net

whereas one checked with contract tests suddenly looks like this (around 52′):

system tree

Is that a magical consequence of contract tests or is the author being a little disingenuous?…

So what are integration tests good for? My claim is this:

Integration tests can be used for a formal description of the client’s specification.

So formal in fact that they’re executable and therefore automatically and objectively verifiable: unit tests are what keeps a developer sane, integration tests prove to the client that you’ve delivered what they asked for. If written correctly they have the side benefit of making the client’s manual UAT reveal fewer functional bugs simply because there are fewer (GUI designers will always care about the colour of a font, there’s not much that integration tests can do about that). Note that TDD can be applied variously in both cases, but that’s orthogonal to this discussion.

So integration tests are essential for these reasons alone, but should we use them for anything more than that?

The ideal structure of a contract test is this (40'55" in the video above):

contract tests

But in practice the reality is clearly this:

fake contract tests

The client-server link is only really tested during integration – the contract tests are completely separated from the client that will actually use the supplier. You can have no formal confidence that the contract tests are testing anything useful for the client code.

Freeman and Pryce note a related example:

The team had been writing acceptance tests to capture requirements and show progress to their customer representatives. They had been writing unit tests for the classes of the system, and the internals were clean and easy to change. They had been making great progress, and the customer representatives had signed off all the implemented features on the basis of the passing acceptance tests.

But the acceptance tests did not run end-to-end – they instantiated the system’s internal objects and directly invoked their methods. The application actually did nothing at all. Its entry point contained only a single comment:

// TODO implement this

– Growing Object-Oriented Software Guided by Tests – Testing end-to-end

This discussion of contract tests makes it clear:

syrup detached

The contract tests (the icons that look like text pages) are completely unrelated to the real client (the yellow icon in the middle).

One useful observation that Stefan Smith does make is that, if your contract tests are written in a very lightweight system – i.e. one that has no or few dependencies and is simple to run – then the other, independent, team writing the supplier may find them useful as a test for their system. This could be a social lever helping the two teams work on the contract tests together and help communication between the teams. Ideally the contract tests would form part of the documentation of the supplier service – depending on configuration, the contract tests can become the integration tests of the supplier.

So which is better? Integration tests or unit test? And where do contract tests fit it? The answer of course is that we need all of them and it depends.

  • Integration tests (“end-to-end”) prove that your system really does fit together properly with its dependencies and, as a side benefit, can be used to express the journeys described in the stories of the client specifications. They are “black box” and not there to exercise the multitude of code paths.
  • Unit tests (“isolation”) are “white box” and there to exercise all corners of a class. Smells from the set-up of these tests can also pressure the developer to reduce the intertwingledness of the class by refactoring.
  • Contract tests are best used if a service you depend on is so slow or flaky that you can’t rely on it in the integration tests. In these cases you have to write a Fake (“stub”) and the contract tests are the only way you can be sure your Fake matches the real supplier.

Note that nearly all unit tests of classes that have injectable dependencies use stubs, almost by definition, through the use of a tool like Mockito or some such. The services they are “Fake"ing though should, in a good design, be simple enough that the stub is clear to the developer so doesn’t need to be checked with it’s own contract tests (infinite recursion awaits you there). Definitions of “clear” and “obvious” are bread-and-butter to the developer debate…

So maybe this is where the confusion comes from and this is a non-debate. If your service stubs get so complex you have to call them Fakes then you need the unit tests of the service to check them – and, in fact, have to write them yourself because the source code of the supplier is inaccessible. These Fakes then enable you to write integration tests (“end-to-nearly-end”) that can be run in-memory or in your Continuous Integration system.

Contract tests are there to help you verify reality and help ensure your Fakes are correct – but they’re no guarantee.

Ps. the Freeman and Pryce is excellent, every serious software developer should read at least sections 1 and 3 of it.

Update

The pact system is very interesting – it may even be a way of turning slow integration tests into fast unit(ish) test.

- 4 minutes read #programming , #water cooler

What is Simple?

This post was originally written while I was at LShift / Oliver Wyman

Consider these quotes:

“Any sufficiently complicated program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp” – Greenspuns Tenth Rule Of Programming

“Once you add group by, filter, & join, you can no longer claim to have invented a new query language, only a new dialect of SQL. With worse syntax and no optimizer.” – Carlos Bueno

Recently I’ve needed to review some PHP web development frameworks and I’ve seen the same thing: a new Framework is created and the founders claim it’s better because it’s “simple”! Then in v2 they add ORM, database migration, async events, CSRF protection…

It’s this word “simple” that causes all the problems. In my experience, when people say “I want something that’s simple” what they really mean is “I want something I already know or can use without learning anything”. This a purely subjective measure which is why 1. it’s impossible for a group of people to agree what “simple” is and 2. “simple” changes over time even for the same person.

In user interface design, for example, you get over this problem by making the application flashy or inviting in some way so the user doesn’t notice that they’re learning a new system. Or you make the “first action” so easy that the user gains confidence in their ability and is willing to invest more time – there’s a reason why the on-off switch on TVs and computers is the biggest and most obvious even though it’s not the button you use the most.

There are many programming languages and systems that have gained huge popularity – Ruby, PHP, nodejs – even though most people agree they are terrible once you get past the easy case. They are popular because they have “write-simplicity”. I once started some development in nodejs and mongodb and was fairly overjoyed at just how “simple” and expressive things were. After 3 months I hated the everything about it and wished I’d used a “decent” language (autovivification in Javascript and mongodb make for a huge number of impossible to find typo-bugs)!

This hatred was due to the lack of “read-simplicity”.

Is it possible for a language, framework, etc. to have both write and read simplicity?

Write Simple:

  • The type system stays out of the way.
  • APIs are easily guessable, consistent, small and complete.
  • Large library integrating with other services.
  • “Simple” syntax (some people think that prefix-notation and huge sequences of closing parentheses is simple – let’s leave that as a matter of opinion…)
  • Easily mockable for unit tests.

Read Simple:

  • Strong-enough type system so your IDE can correctly refactor code, and it’s easy to find the implementation of the method call your looking at (in duck-typing languages the IDE just gives up and shows you every class with a method of that name. In RubyMine this appears to be true even if you’re calling “self” when it should know exactly what class tree you’re expecting!).
  • Enforces good coding style that reduces or eliminates intertwingledness.
  • Anonymous and first-class functions are excellent – but while compactness can be impressive in algorithm competitions singlelineitis makes the brain hurt when maintaining real code.
  • Very easily mockable for unit tests.

(These criterion are not complete)

The main distinction between the two lists is the existence of strict typing. Is this really the source of read-simplicity and the bane of write-simplicity? I suspect mutable vs. immutable data structures may also be in there.

Haskel, Clojure and Erlang are all on my “to learn” list – I’ve already used assembly, Basic, POP-11, Lisp, ML, Prolog, C, Bash, Perl, Python, Javascript, PHP, Ruby and Java – will I find my goal? Ceylon looks interesting…

Why is this important? We all like to argue about which language we think is “best”, but I don’t think that’s useful unless you’re also taking into account the context in which a particular piece of software is being written.

In the startup world a large proportion of projects will fail, and it’s important to find out which as cheaply as possible – you need to drill as many test bores as possible to find the one with oil at the other end. So for your prototype or “first version” it may even be financially irresponsible not to use a write-simple language.

But once a piece of software is taking longer than a couple of months to write or it will be maintained over a long period of time (the sort of project that LShift is normally engaged in) or, particularly, if more than one person is writing it then it becomes more and more essential to use a read-simple language. Hence the desire to find a language that is both.

Update 2018: For some reason Ceylon still has not caught on, but Kotlin is superb - any Java team should transition to Kotlin…

- 4 minutes read #Mozilla , #developer

Automated javascript unit-tests with xpcshell and Hudson

I’m currently re-writing a Thunderbird plugin – and in the last few years have caught the unit-testing and test driven development bug… So, how do I make my life easy by integrating Hudson and Thunderbird?

It turned out to be suprisingly difficult, here’s lots of instructions plus a download.

First job was to find a javascript interpreter and unittest framework:

  • jsunit – jsunit is no longer actively maintained and has become Jasmine.
  • Jasmine – tries to be a whole way of life, very very young, almost no documentation whatsoever.
  • jstest – no longer maintained and has a fatal version dependancy conflict: jstest requires version 1.6R5 of js.jar but envjs requires 1.7R2 or later…
  • rhinounit – rhino is an implementation of javascript in java. Rhinounit has a really horrible output format that dumps the entire java call-stack when a test fails.
  • xpcshell – is a command-line version of the javascript in firefox and thunderbird. It provides a full javascript browser environment including XMLHttpRequest implementations, so envjs is not needed. Also includes runxpcshelltests.py for executing tests.

So xpcshell it is (believe me – that took much longer to research than you took to read it!).

You need to compile a mozilla thunderbird package on your hudson server to get access to xpcshell. These instructions are boiled down from Simple Thunderbird build. Note that my version does not have debug enabled – this is deliberate and important.

apt-get build-dep thunderbird
apt-get install mercurial libasound2-dev libcurl4-openssl-dev libnotify-dev libiw-dev autoconf2.13
mkdir -pf /opt/kits/thunderbird
cd /opt/kits/thunderbird

# this takes a minute or two
hg clone http://hg.mozilla.org/releases/comm-1.9.2/
cd comm-1.9.2

# this takes several minutes
python client.py checkout

# edit/create .mozconfig and enter
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/objdir-tb
mk_add_options MOZ_MAKE_FLAGS="-j4"
ac_add_options --enable-application=mail

# this takes ages, 2hrs on an EC2 m1.small! Come back tomorrow...
make -f client.mk

runxpcshelltests.py has a very non-standard output format. I’ve implemented a set of plugins for TAP and jUnit output formats – download runxpcsheltests.tgz – this is a drop-in replacement for /opt/kits/thunderbird/comm-1.9.2/mozilla/testing/xpcshell (if you’ve followed the build instruction above) but you can unpack it anywhere on your hudson server – for example, if you have a source directory then create a directory “scripts” and unpack the tgz file in it. This is also the reason for building mozilla without debug – if debug is enabled then xpcshell prints out various usage information that can’t be trapped and excluded from the formatted test output.

Create a directory test/xpcshell in your source root and create a file all.sh in it containing the following:

#!/bin/bash

D=`dirname $0`
X=$D/../../scripts/xpcshell

/usr/bin/python2.6 -u /opt/kits/thunderbird/comm-1.9.2/mozilla/config/pythonpath.py\
   -I/opt/kits/thunderbird/comm-1.9.2/mozilla/build\
   $X/runxpcshelltests.py\
   --output-type=junit --no-leaklog --no-logfiles\
   /opt/kits/thunderbird/comm-1.9.2/objdir-tb/mozilla/dist/bin/xpcshell\
   $D

Now you can add test files to that directory, e.g. test_001_pass.js:

function run_test() {
        do_check_true(true);
}

The do_check_true function effectively checks against “arg == true” so I also created a head_test_funcs.js file in that directory to add more testing functions, e.g.:

function do_check_trueish(item, stack) {
  if (!stack)
    stack = Components.stack.caller;

  var text = item + " a true-ish value?";
  if (item) {
    ++_passedChecks;
    xpcshell_output.pass(stack, text);
  } else {
    do_throw(text, stack);
  }
}

The last step is to integrate with hudson. Click on the Configure link in a hudson job. In the Execute Shell section add the line

trunk/test/xpcshell/all.sh > report_xpcshell.xml

In the Post-Build Actions section tick on Publish JUnit test result report and in the Test Report XMLs section enter

report_*.xml

If you’re already using junit tests then you may need different output file names to suit.

Groovy!  We can now do automated unit/regression testing on plugin base classes! The next step is to figure out how to provide the xul document environment and perform functional testing like Selenium does for browsers…

NB. I’d really like a Mozilla developer to pick up  runxpcsheltests.tgz and drop it into the current Mozilla system – standardised test output is an item on the mozilla software testing wishlist.

Update: the mozilla team have taken this up as bug 595866.

- One minute read #developer

restricting ubuntu/apt cassandra version to 1.1.x

I’m using the datastax version of cassandra and installed it with the command:

apt-get install cassandra=1.1.9

and, once you do that, apt-get is good about not upgrading any further at all.

But this morning I wasted several hours with hung software on my development machine unto I spotted that “Ubuntu software updater” had upgraded my cassandra to 1.2.x!  AGAIN! ARGH!

After some research this does the trick.

1. create a file /etc/apt/preferences.d/cassandra

2. in it add the lines:

Package: cassandra
Pin: version 1.1.*
Pin-priority: 1000

3. apt-get update

From now on upgrades should only get the 1.1.x versions (it’s now at 1.1.11). You can check this with:

apt-cache policy cassandra

This works fine for the “user friendly” updater too.

- One minute read #Mozilla , #poetry , #technology pontification

Mozilla Haiku

In honour of the Mozilla QA Haiku list:

“Current tests are odd
Use jUnit for output
Many tools are free”

- 4 minutes read #developer , #technology pontification

Compare sql vs. nosql

There’s been a meme going around recently that SQL and relational databases are somehow “too complicated”, antiquated and “old hat” and should be replaced with something simpler and therefore more efficient.

This opinion is missguided (and perhaps slightly juvenile). Never-the-less a kind of “NoSQL” movement formed which has created some very useful things in the Distributed Hash Table (DHT) space. (In a video on Cassandra, Eric Evans claims to have invented the term NoSQL and wishes he hadn’t!).

I hope to show that SQL and DHT (NoSQL) systems are complimentary to each other and not in competition.

Useful data storage system have “ACID” characteristics (Atomicity, Consistency, Isolation, Durability). SQL systems are very strong on Atomicity, Consistency and Isolation and can also achieve “5 nines” or more reliability in terms of Durability. But, even with highly partitioned data stores, the Consistency requirements often prove to be a bottleneck in terms of performance. This can be seen as an impact on Durability – i.e. database performance under sufficient write load can drop to a point where the database is effectively unavailable.

Sharding – completely splitting the database into isolated parts – can be used to increase performance very effectively, but Consistency, and queries that require access to the whole database, can become costly and complicated. In the latter case a proxy is usually required to submit the same query to all shards and then combine the results together before returning it to the client. This can be very ineffiecient when making range queries.

DHT systems trade Atomicity and Consistancy even further for more Durability under load (ie. performance scaling). Strictly speaking NoSQL can be implemented by a simple hash table on a single host – e.g. Berkley DB – but these implementations have no scaling capability so are not included in this discussion.

SQL implementations include: MySQL, Oracle, PostgreSQL, SQL server etc. DHT implementations include: Cassandra, HBase, membase, voldemort etc.. MapReduce implementations (e.g. Hadoop) are a form of DHT but one that can trade key uniqueness for the speed of “stream/tail processing”.

SQL DHT
Immediate (or blocking) consistancy Eventual consistancy: reads don’t wait for a write to completely propogate. Last write wins, conflict resolution on read etc.
Transactional Multiple-operation transactions implemented in the application.
Scale write performance by partitioning (utilise multiple disk spindles). Writes go to a privileged master or master cluster (which may also service reads). Scale read performance by “fan out”: multiple read slaves replicating from the master. All nodes are functionally equal, no privileged “name” or meta nodes. Scale reads and writes by adding new nodes (heterogenious preferably).
Relational. Indexes available on multiple columns (one column optionally a “primary” unique key). Non-relational, single index, key-value stores (“column family” DHT systems are just an extension of the single key)

The metric is then quite simple: if high-capacity (data volume or operations per second) is required, data is only ever accessed by primary key, and eventual consistancy is good enough, then you have an excellent candidate for storage in a DHT.

Other relational storage can be replaced with DHT systems but only at the cost of denormalising the data – the data is structured for reads not writes – but this should probably be avoided! You can use a DHT to speed up a RDMS with regard to the storage of blobs. Some RBMSs have a separate disk space for blobs, some include them in the normal memory space along with the rest of the data. If you have a DHT to hand then another technique is to split up any updates into 2 halves – the first uses the RDMS to store the simple, relational data and returns a primary key, the 2nd then store the blobs in the DHT against that primary key instead of in the RDMS. This shortens the write thread, and any associated locking, in the RDMS as much as possible.

Update Sept 22nd, useful links:

- 2 minutes read #gluster , #nagios

Monitoring gluster with nagios

There’s little info on the web about how to monitor a glusterfs brick with nagios (or any other tool). There is a hint of a gluster utility script – http://www.mail-archive.com/gluster-devel@nongnu.org/msg06928.html – but it’s not available in the source package.

It also needed some updates to make it suitable for nagios. I hope the gluster devs take my version of the script, along with these instructions, and add them to the main gluster source…

These instructions are for a default nagios3 installation on  ubuntu karmic with gluster 3.0.3 compiled from source so you may need to edit this for your site.

Download this script (glfs-health.sh) and store it somewhere useful:

wget http://www.sirgroane.net/downloads/glfs-health.sh --output-document=/usr/local/bin/glfs-health.sh
chmod u+x /usr/local/bin/glfs-health.sh

Assuming a simple TCP gluster install we can set up a nagios command like this:

echo '
define command{
         command_name    check_gluster
         command_line    sudo  /usr/local/bin/glfs-health.sh  $HOSTADDRESS$ 6996 tcp $ARG1$
         }
' >> /etc/nagios-plugins/config/gluster.cfg

Notice the “sudo” in the command? This is because glfs-health.sh has to run as root. To enable this we have to add a line to /etc/sudoers:

echo "nagios  ALL=(ALL)  NOPASSWD: /usr/local/bin/glfs-health.sh" >> /etc/sudoers

Now you can construct a nagios service to monitor the bricks. For example: you’ve created a nagios hostgroup “gluster-bricks” with all the bricks in and they all export a volume “export_data”:

define service {
        hostgroup_name                  gluster-bricks
        service_description             Glusterfsd
        check_command                   check_gluster!export_data
        use                             generic-service
        notification_interval           0
}

Restart nagios and you’re done.

Update:

  • Things have changed since this was written. The Gluster port to use would no longer be 6996. Probably more like 24010 or something.
- 4 minutes read #developer , #gluster

Tuning glusterfs for apache on EC2

The gluster installation described in a previous post  is being used for a webserver cluster on Amazon EC2 using two storage bricks serving a whole bunch of “client” webservers. I tuned the system with “end-to-end” performance testing using a website load tester rather than worry about contrived disk-access tests. That, and helpful comments from various devs on the user list, lead to the following conclusions.

There’s a large collection of “performance translators” in gluster used for improving speed. Let’s have a look at the ones I didn’t use and why:

  • performance/read-ahead – Probably useful if your server has physical disks as it will minimise disk seeks. But amazon EBS storage is no doubt a layered storage system with its own caching. So this translator doesn’t offer any speed increase and just gets in the way.
  • performance/write-behind – Same issues as read-ahead. Plus this translator seems to have problems if you try to read a file quickly after writing it.
  • performance/stat-prefetch – Pre-fetches and caches file stat information when a directory is read. Speeds up operations like ls -l but apache never needs that so it just gets in the way.
  • performance/quick-read – Uses a feature of the gluster protocol so the whole of a (small) file can be fetched during the lookup phase so opens and reads are not needed. Also caches the file data. Unfortunately it has a memory-leak bug that may be fixed in v3.0.5. Until then it can’t really be used.

These are the performance filters I did use

  • performance/io-cache – Caches read file data in 128K pages for 1-60 seconds. The page size and maximum cache timeout can be changed in the source. Should only be used in volumes where files are read much more often than they are written because the translator just invalidates a whole 128K page when any part of it is written. This is perfect for website pages though.
  • performance/io-threads – Doesn’t fork extra processes, but does configure a thread pool that allows faster operations to leap-frog blocked ones.

The translator stack I came up with has this layout:

APACHE
   |
performance/io-cache
   |
performance/io-threads
   |
cluster/replicate
   |
protocol/client
  | |
AMAZON NETWORK
  | |
protocol/server
   |
performance/io-threads
   |
features/locks
   |
storage/posix
   |
ext3/xfs/whatever
  | |
AMAZON EBS STORAGE

The philosophy is

  1. Only use the translators that you can prove actually provide a benefit. Translators are cheap but still get in the way. The gluster volgen command provides a good start for a general server but the volume config can be tweaked more for webservers.
  2. Caching first. It’s quick and should be serving most of the files.
  3. Lots of threads on the client side. Apache is multi-threaded and Amazon EC2 servers are multi-core. Anything we can do to help concurrency to the bricks is a good thing.
  4. Threads on the server side too. I’ve read some articles that say this is a waste. But, in my experience, a large rsync on one client for example can really hold up accesses made from other clients unless io-threads is configured on the server side too. Also, EBSs never “fail” but occasionally they do exhibit huge iowait spikes of 100s of ms. In these circumstances io-threads on the server side mean that a minimum of the clients are kept waiting.
  5. Don’t bother caching on the server side. The kernel will already be caching the filesystem underneath gluster.

The best tip though is to understand the whole architecture of your system and concentrate your optimisation efforts where they will have the most benefit. Seems obvious once it’s said, but it takes some out-of-the-box / holistic / whatever thinking to actually do it.

In the case of an Apache web service, moving from single-server nfs to replicated gluster initially caused pages to take an extra 500ms or much more to load! This was almost a disaster – glusterfs is tuned for big files rather than small… In this case the solution was simple: migrate all .htaccess files into directives in the Apache config, and specify AllowOverride None. This prevented Apache checking directories for .htaccess files and the overhead of gluster was greatly reduced: enough so the sites feel just as responsive as before. When the gluster devs fix the quick-read bug in v3.0.5 then the sites will be even quicker.

Updates:

Lee Simpson Says:
March 26th, 2010 at 1:07 am

Thanks for the interesting articles.
“performance/stat-prefetch” may be useful depending on what your apache is serving. If you are serving PHP pages which read the contents of folders (e.g wordpress/phpbb) then it could make a big difference. However, I’ve had big problems with “performance/stat-prefetch” and apache and have filed a bug here;
http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=762
Once those problems are cleared up in a future release it may be worth taking another look at performance/stat-prefetch

- 4 minutes read #developer , #gluster

Glusterfs Distributed File System on Amazon EC2

A lot of people are using Amazon EC2 to build web site clusters. The EBS storage provided is quite reliable, but you still really need a clustered file-server to reliably present files to the servers.

Unfortunately AWS doesn’t support floating virtual IPs so the normal solutions of using nfs servers on a virtual IP managed by heartbeat or something is just not available. There is a cookbook for a Heath Robinson approach using vtunnel etc., but it has several problems not least its complexity.

Fortunately there’s glusterfs. Gluster is mainly built for very large scale, peta-byte, storage problems – but it has features that make glusterfs perfect as a distributed file system on amazon EC2:

  • No extra meta-data server that would also need clustering
  • Highly configurable, with a “stacked filter” architecture
  • Not tied to any OS or kernel modules (except fuse)
  • Open Source

I use ubuntu on EC2 so the rest of this article will focus on that, but gluster can be used with any OS that has a reliable fuse module.

I’ll show how to create a system with 2 file severs (known as “bricks”) in a mirrored cluster with lots of clients. All gluster config will all be kept centrally on the bricks.

At the time of writing the ubuntu packages are still in the 2.* branch (though v3.0.2 of gluster will be packaged into Ubuntu 10.4 “Lucid Lynx”) so I’ll show how to compile from source (other installation docs can be found on the gluster wiki but it tends to be a bit out of date).

To compile version 3.0.3 from the source at http://ftp.gluster.com/pub/gluster/glusterfs

apt-get update
apt-get -y install gcc flex bison
mkdir /mnt/kits
cd /mnt/kits

wget http://ftp.gluster.com/pub/gluster/glusterfs/3.0/3.0.3/glusterfs-3.0.3.tar.gz
tar fxz glusterfs-3.0.3.tar.gz
cd glusterfs-3.0.3
./configure && make && make install
ldconfig

Clean up the compilers:

apt-get -y remove gcc flex bison
apt-get autoremove

This is done on both the servers and clients as the codebase is the same for both, but on the client we should prevent the server from starting by removing the init scripts:

# only on the clients
rm /etc/init.d/glusterfsd
rm /etc/rc?.d/*glusterfsd

It’s also useful to put the logs in the “right” place by default on all boxes:

[ -d /usr/local/var/log/glusterfs ] && mv /usr/local/var/log/glusterfs /var/log || mkdir /var/log/glusterfs
ln -s /var/log/glusterfs /usr/local/var/log/glusterfs

And clear all config:

rm /etc/glusterfs/*

Ok, that’s all the software installed, now to make it work.

As I said above, gluster is configured by creating a set of “volumes” out of a stack of “translators”.

For the server side (the bricks) we’ll use the translators:

  • storage/posix
  • features/locks
  • performance/io-threads
  • protocol/server

and for the clients:

  • protocol/client
  • cluster/replicate
  • performance/io-threads
  • performance/io-cache

(in gluster trees the root is at the bottom).

I’ll assume you’ve configured an EBS partition of the same size on both bricks and mounted them as /gfs/web/sites/export.

To export the storage directory, create a file /etc/glusterfs/glusterfsd.vol on both bricks containing:

volume dir_web_sites
  type storage/posix
  option directory /gfs/web/sites/export
end-volume

volume lock_web_sites
    type features/locks
    subvolumes dir_web_sites
end-volume

volume export_web_sites
  type performance/io-threads
  option thread-count 64  # default is 1
  subvolumes lock_web_sites
end-volume

volume server-tcp
    type protocol/server
    option transport-type tcp
    option transport.socket.nodelay on

    option auth.addr.export_web_sites.allow *
    option volume-filename.web_sites /etc/glusterfs/web_sites.vol

    subvolumes export_web_sites
end-volume

NB. the IP authentication line  option auth.addr.export_web_sites.allow * is safe on EC2 as you’ll be using the EC2 security zones to prevent others from accessing your bricks.

Create another file /etc/glusterfs/web_sites.vol on both bricks containing the following (replace brick1.my.domain and brick2.my.domain with the hostnames of your bricks):

volume brick1_com_web_sites
    type protocol/client
    option transport-type tcp
    option transport.socket.nodelay on
    option remote-host brick1.my.domain
    option remote-subvolume export_web_sites
end-volume

volume brick2_com_web_sites
    type protocol/client
    option transport-type tcp
    option transport.socket.nodelay on
    option remote-host brick2.my.domain
    option remote-subvolume export_web_sites
end-volume

volume mirror_web_sites
    type cluster/replicate
    subvolumes brick1_web_sites brick2_com_web_sites
end-volume

volume iothreads_web_sites
  type performance/io-threads
  option thread-count 64  # default is 1
  subvolumes mirror_web_sites
end-volume

volume iocache_web_sites
  type performance/io-cache
  option cache-size 512MB               # default is 32MB
  option cache-timeout 60               # default is 1 second
  subvolumes iothreads_web_sites
end-volume

and restart glusterfs on both bricks:

/etc/init.d/glusterfsd restart

Check /var/log/glusterfs/etc-glusterfs-glusterfsd.vol.log for errors.

On the clients edit /etc/fstab to mount the gluster volume:

echo "brick1.my.domain:web_sites /web/sites glusterfs backupvolfile-server=brick2.my.domain,direct-io-mode=disable,noatime 0 0" >> /etc/fstab

Then create the mount point and mount the partition:

mkdir -p /web/sites
mount /web/sites

Check /var/log/glusterfs/web-sites.log for errors.

And you’re done!

The output of df -h should be something like this (though your sizes will be different).

bash# df -h
Filesystem Size Used Avail Use% Mounted on
...
brick1.my.domain 40G 39G 20M 0% /web/sites

In another post I’ll pontificate on tuning gluster performance, why I chose this particular set of filters and what the options mean.

This page and its contents are copyright © 2024, Ian Rogers. Theme derived from Prav