Recently I got a chance to work with Clojure in a production setting. I had played around with it before but not used it ‘in anger’ previously. It’s a language a few of us are evaluating as a possible general purpose language, particularly for high-performance tasks and problems where the use of concurrency is required.
Prelude: Clojure: The language and ecosystem
By the time I write about this Clojure has begun to lose the cutting-edge buzz, much as Skala has in many circles and simply become accepted as a standard choice for JVM development. It’s ‘weird’ syntax still puts most of the uninitiated off, presumably just because of it’s unfamiliarity more than anything else.
It’s a JVM based Lisp, unashamedly so, however, it’s also updated a number of older lisp’s assumptions, most obviously for me with data-structures being significantly syntactically similar in their “look” to JSON than anything else, though in implementation they’re high-performance immutable hash-maps, and tress implemented in Java.
Its interoperability with Java is perhaps one of its strongest claims to real-world practicality. Calling methods on Java objects is
as simple as (.toString foo)
. Consequently, it leans heavily on the Java ecosystem for library support. The defacto AWS sdk
is nothing more than an auto-generated reflection-based, normalised wrapper around the Java one, which, despite sounding somewhat
horrifying, appears to ‘just work’, with nary an issue for me (Unlike nodeJS… urgh).
The problem at hand:
There is a multi-gigabyte file in s3 which contains GZipped JSON lines. I need to download, parse and write the lines to Dynamodb in a timely fashion each day.
The solution:
By using the Java GZip stream reader to create a stream of decompressed lines we abstracted it to a lazy squence. This allowed the data to be read on-demand rather than shoving it all into memory or having to be concerned by the mechanics of pagination. The first cut reads the lines and shoves them into a central channel. This offers the same channel semantics as used in Golang - It’s a queue which will block the reader thread once it reaches capacity. This abstraction of interactions between threads via channels is one of the major features of the language: Communicating between threads is safe, loosely coupled and easily testible. The reader threads then pull off the queue in blocks and then batch write them to the database.
What worked:
- Streaming the data in a lazy sequence allows us to keep a slim memory footprint and is easy to think about. The app can run with as little as 100 MBs of memory given to the JVM.
- Core/Async is amazing, the channel based method of working with async data is vastly superior to callbacks/promises in that there is no leak of abstraction between processes and that backpressure is given for free.
- Clojure’s dynamic typing allowed for easy handling of JSON with very little friction.
- Repl driven development is effective and fast.
- Testing is easy and similar to nodeJS in many ways.
- Concurrency is easy and significantly safer than my experiences in Java.
What didn’t work:
- Clojure is very well tooled. This is a hangover from lisp where emacs and the like have vast arrays of methods for interactions with s-expressions and using the homoiconicity of the language to the full. However, this is also an assault of complexity on the uninitiated (me) and so I didn’t really begin exploring the tooling until the end of the project.
- Leiningen’s startup time is slow. Starting the app repeatedly like in nodejs isn’t really feasible on anything but the most modern of machines. Instead the idiomatic way to develop is in repl, however it took me a while to work this out.
- Clojure’s stack traces are still quite verbose and hard to read initially.
- The JVM was apparently unable to ascertain the amount of memory available and would, under various circumstances, exceed what was provided to it by docker. and the process was then killed by the Kernel. Inspired by the Mono development going on elsewhere a hardcoded amount of available memory to the JVM worked around the issue.
What would I do differently next time:
- With better understanding - work more with emacs and associated tooling (paredit, CIDER repl and CLJ Refactor) rather than the vanilla
Lein repl
- Look at using Transducers