Why test in Haskell?
Every so often, the question comes up, should you test in Haskell, and if so, how should you do it?
Most people agree that you should test pure, especially complicated, algorithmic code. Quickcheck1 is a great way to do this, and most Haskellers have internalized this (Quickcheck was invented here, so it must provide value!). What’s less clear (or at least, more debated!) is whether you should be testing monadic code, glue code, and code that just isn’t all that complicated.
Quickcheck?
A lot Haskell I’m writing these days is with the web framework Snap,
and web handlers often have the type Handler App App ()
-
where Handler
is a web monad (giving access to request
information, and the ability to write response data), and
App
indicates access to application specific state (like
database connections, templates, etc).
So the inputs (ie, how to run this action) include any HTTP request and any application state, and the only outputs are side effects (as all it returns is unit). Using Quickcheck here is… challenging. You could restrict the generated requests to have the right URL, and even have the right query parameters, but since the query parameters are just text, if they were supposed to be more structured (like an interger), the chance of actually generating text that was just a number is pretty low… And then if the number were supposed to be the id of an element in the database….
But assume that we restrict it so that it’s only generating ids for
elements in the database, what are the properties we are asserting?
Let’s say that the handler looked up the element, and rendered it on the
page. So then we want to assert something about the content of the
response (which is wrapped up in the Handler
monad). But
maybe it should also increment a view count in the database. And
assuming that we wrote all these into properties, what are the elements
in the database that it is choosing among? And in some senses we’ve now
restricted too much, because we may want to see what the behavior is
like for slightly invalid inputs. Say, integer id’s that don’t
correspond to elements in the database. This is all certainly possible,
and may be worth doing, but it seems pretty difficult. Which is totally
different from the experience of testing nice pure functions!
Let’s try to tease out a little bit of why testing this kind of code with Quickcheck is hard. One problem is that the input space, as determined by the type, is massive. And for most of the possible inputs, the result should be some version of a no-op. Another problem is the dependence on state, as the possible inputs are contingent on external state, and the outputs are primarily changes to state, each of which, again, is a massive space.
But having massive input and output spaces is not necessarily a reason not to be using randomized testing. Indeed, this is exactly the kind of thing that fuzz-testing of web browsers, for example, has done with great effect2. The problem in this case is that the size of the input and output space is not at all in proportion to the complexity of the code. If we were writing an HTTP server, we may indeed want to be generating random requests, throwing them at the web server, and making sure it was generating well-formed responses (404s being perfectly fine).
Not that complicated…
But we’re just writing a little bit of glue code. Which isn’t that complicated. And can be tested manually pretty easily. And may change rapidly.
Which means spending a lot of time setting up property based tests
(which in these sorts of cases are necessarily going to be quite a bit
more complicated than quintessential Quickcheck examples like showing
that reverse . reverse = id
).
But you’re still writing code that has types that massively underspecify it’s behavior. Which should make you nervous, at least a little. Now granted, you should keep that underspecified code as thin as possible - validate the query parameters, the URL, etc, and then call a function with a type that much more clearly specifies what it is supposed to do. For example (this is coming from Snap code, with some details ellided, but should be reasonably easy to understand):
f :: Handler App App ()
f = route [("/foo/:id", do i <- read <$> getParam "id"
res <- lookupAndRenderFoo (FooId i)
writeText res)]
lookupAndRenderFoo :: FooId -> Handler App App Text
lookupAndRenderFoo = undefined
And certainly, this is a good pattern to use. We went from a function
that had as input space any HTTP request (and any application specific
state), and as output any HTTP response (as well as any side effects in
the Handler
monad) and split it into two functions. One
still has the same input and output as before, but is very short, and
the other is a function with input the id of a specific element, and as
output Text
, but still can perform any side effects in and
read any data from within the Handler
monad.
Increasing complexity?
We could split that further, and write a function with type
Foo -> Text
, but we would start getting in our own way,
as if we wanted to render with a template, the templates exist within
the context of the Handler
monad, so we would have to look
up a template first, and we would have ended up creating many new
functions, as well as a bit of extra complexity, all for the sake of
splitting our code up into layers, where the last one is pure and easy
to test (the rest still have all the same problems).
Depending on how complex that last layer is, this may totally be worth it. If your code is dealing with human lives or livelihoods, by all means, isolate that code into as small a portion as possible and test the hell out of it. But it makes coding harder, and makes you move slower. And if you want to change the logic, you may now have to change many different functions, instead of just one.
Which is where we come to the argument that testing slows things down, and that for rapidly changing code, it just doesn’t matter.
What about just not sampling?
But if we step back a bit, we realize that what Quickcheck is trying to do is to sample representatively (well, with a bias towards edge cases) over the type of the input. And it’s easy to see why that’s appealing, as it gives you reasonable confidence that any use of the function behaves as desired. But if we forget about that, as we already know that our types completely underspecify the behavior, we realize all that we really care about is that the code does what we think it should do on a few example cases. That’s what we were going to manually verify after writing the code anyway.
Which is easy to test. With Snap, I’d write some tests for the above snippet like3:
do f <- create ()
let i = show . unFooId . fooId $ f
get ("/foo/" ++ i) >>= should200
get ("/foo/" ++ i) >>= shouldHaveText (fooDescription f)
get ("/foo/" ++ show (1 + i)) >>= should404
And call it a day. This misses vasts swaths of inputs, and asserts very little about the outputs, but it also tells you a huge amount more about the correctness of the code than the fact that it typechecked did. And as you iterate and refactor your application, you get the assurance that this handler:
- still exists.
- still looks up the element from the database.
- still puts the description somewhere on the page.
- doesn’t work for ids that don’t correspond to elements in the database.
Which seems like a lot of assurance for a very small amount of work. And if your application is fast moving, this benefits you even more, as the faster you move, the more likely you are to break things (at least, that’s always been my experience!). If you do decide to rewrite this handler, fixing these tests is going to take a tiny amount of time (probably less time than you spend manually confirming that the change worked).
Why this should be expected to work.
To take it a little further, and perhaps justify from a somewhat theoretical point of view why these sorts of tests are so valuable, consider all possible implementations of any function (or monadic action). The possible implementations with the given type are a subset of all the possible implementations, but still potentially a pretty large one (our example of a web handler certainly has this property).
This perspective gives us some intuition on why it is much easier to
test simple, pure functions. There are only four possible
implementations of a Bool -> Bool
function, so testing
not
via sampling seems pretty tractable. To go even
further, we get into the territory of “Theorems for Free”4,
where there is only one implementation for an (a,b) -> a
function, so testing fst
is pointless.
But returning to our case of massive spaces of well-typed implementations: A single test, like one of the above, corresponds to another subset of all the possible implementations. For example, the first test corresponds to the subset that return success when passed the given url via GET request. Since we’re in Haskell, we also get a guarantee that the set of implementations that fulfill the test is a (non)strict subset of the set of implementations that fulfill the type, as if this were not the case, our test case wouldn’t type check. The problem with the first test, of course, is that there are all sorts of bogus implementations that fulfill it. For example, the handler that always returns success would match that test.
But even still, it is a strict subset of the implementations that fulfill the type (for example, the handler that always returns 404 is not in this set), so we’re guaranteed to have improved the chance that our code is correct, even with such a weak test (granted, it actually may not be that weak of a test - in one project, I have a menu generated from a data structure in code, and a test that iterates through all elements of the menu, checking that hitting each url results in a 200. And this has caught many refactoring problems!).
Where we really start to benefit is as we add a few more tests. The
second test shows that the handler must somehow get an element out of
the database (provided our create ()
test function is
creating relatively unique field names), which is another
(strict) subset of the set of implementations that fulfill the type. And
we now know that our implementation must be somewhere in the
intersection of these two subsets.
It shouldn’t be hard to convince yourself that through the process of just writing a few (well chosen) tests you can vastly reduce the possibility of writing incorrect implementations. Which, when we are writing relatively straightforward code, will probably be good enough to ensure that the code is actually correct. And will continue to verify that as the code evolves. Pretty good for a couple lines of code.