Archive for the ‘computers’ Category.

First impressions of Scala

I don’t want to write raw Java again, but I need to write ImageJ plugins. I’m on the hunt for a new language.

Almost a year ago, I wrote an interface between Kawa and ImageJ, but ImageJ’s design precludes giving Kawa the type annotations it needs to make the interface reasonably fast.

I am no longer appalled by languages like Java. They are muck heaps. The last good language in this family was ALGOL 68. I am appalled by Scala. It tries to be a modern language while still poking its tendrils into the muck heap, but managing the muck has contaminated the language.

Certain things tell me I have been in Haskell too long. I am mildly disturbed by assigning types as values. The case classes, which try to be algebraic data types, miss the point, since you can’t check if functions over them are total.

The primary horrors:

Covariant/contravariant types: Java isn’t strongly typed. You can cast a String to an Object, assign it an integer, and the compiler cannot tell. To escape this, Scala introduces covariant/contravariant types.

Consider the function cons :: a -> [a] -> [a]. The first argument can be any subclass of a. We call it "covariant." The second argument, any superclass of a works in the list. It is "contravariant."

The best solution is to remove inheritance entirely, but the muck monster won’t talk to anyone without inheritance.

Implicit parameters: A library call returns a list. You want to use it as a set, which you have set up as a subclass of list. Write a function which converts lists to sets, annotate it with the implicit keyword, and then you can happily treat the library’s return value as a set.

There are two uses for this. One lets you create typeclasses by defining "traits" (Java interfaces with arbitrary restrictions removed) and then implicitly coercing everything you want into this trait. The other lets you add fields and methods to existing classes and objects.

In the first use, the only claim the creators advance for this approach instead of typeclasses is that implicit parameters are scoped. I learned my lesson about implicit things when I had to maintain a large Perl code base, but this makes the implicit conversions in Perl look like child’s play.

The second use is ludicrous. In Smalltalk, if I want to extend a predefined class C with a method f, I just assign the function body to the field f of C. I can only use the public interface of C, so it isn’t dangerous.

Physical intuition in biology

To start, a question: why are the expositions of Landau and Lifshitz compelling? Here are ten volumes, without data. Compare them to any number of disastrous texts with an equal amount of data. Why does Landau & Lifshitz leave us credulous and so many others make us peery?

I think the answer is physical intuition. Chapters one and two of Mechanics set forth are a reference on physical intuition. A physicist believe the world has this shape, or one near to it, no matter what object we embed. For the next four volumes we get physics justified only from this intuition, and then the first chapter of volume 5 - Statistical Physics - hammers in the next piece of intuition: here is how many particles behave. The road forward is probabilistic.

These are the preconceptions a physicist carries with him into biology, where at first they seem to fail him. Some physicists become cynical and jettison this baggage (often at the behest of some biologists). Others refuse and build models that fail, fail, and fail again.

Step back for a moment. Imagine a physicist who has internalized the intuition in Mechanics but not in Statistical Physics. What is his fate? Feynman tells us (Feynman Lectures on Physics, vol.1, p.39-2):

“Anyone who wants to analyze the properties of matter in a real problem might want to start by writing down the fundamental equations and then try to solve them mathematically. Although there are people who try to use such an approach, these people are the failures of the field; the real successes come to those who start from a physical point of view, people who have a rough idea where they are going and then begin by making the right kind of approximations, knowing what is big and what is small in a given complicated situation.”

In other words, don’t neglect the first chapter of Landau and Lifshitz, vol.5.

What must we bolt onto our intuition to handle biology? The key is in Dobzhansky’s statement “Nothing in biology makes sense except in the light of evolution.”

A biologist tells a physicist, “When you have enough of X, this happens.” The naive physicist takes this literally: there exists a concentration of X at which something physically happens. But the mechanism in question probably exists in several related species, from different environments, all of which rely on similar versions of X. X probably is transcribed at different levels, into a different environment, and yet the system functions roughly the same.

More importantly, the system had to evolve, and the system had to function in all the environments leading up to the present. This is what we must add to our intuition. Exact fixed points and thresholds don’t happen. Now when we hear, “When you have enough of X, this happens,” from a biologist we filter it to, “In a bunch of organisms and places, some level of increase of X causes this to happen.” We do this with (as-is) absurd statements about mechanics and ensembles of particles every day.

This can be made rigorous in the same way as the pieces of Landau and Lifshitz. I suspect that evolutionary game theory of some form may be the proper language to express this.

What I call myself

What follows is abject navel gazing.

I originally studied physics. I did a lot of mathematics. I dabbled in computer science (and programmed a lot, which is something else). Now I’m immersed in biology. What do I call myself when someone asks?

These days, a mathematician.

This strikes me as odd: I don’t define my occupation by what I work on. But I think I finally understand why.

Here is the best definition I know of a physicist: someone who has been through Jackson’s E&M. If you survived it, you cemented a mindset shared by a community. In physics, everything is about time series. Anything not about time series is a statistical property of a time series.

In biology — not biophysics — trees are the natural unit. The logic of genetics deals in trees. Time series are awkward and dangerous. If we go to behavior and psychology, even the tree is inadequate.

After I settled into biology, I found myself with two deep mental frameworks, one next to the other. I spend a lot of thought trying to mesh them, and I have not yet managed it.

But wait! This is half of mathematics. Gian-Carlo Rota pointed out that axiomatic frameworks in mathematics are a sort of “hardware” for a “software” of mathematical facts. Lisp wizards who think of programs as abstract, nigh Platonic objects will interpret this the way I mean it; I hesitate to throw this analogy to anyone used to a FORTRAN descended language. A new mathematical fact is desirable; a new framework that brings known facts closer to trivial is just as novel. It itself is a mathematical fact. This way lies category theory.

The scientific equivalent of a stateless expatriate is a mathematician.

How to teach Haskell

The authors of Real World Haskell just posted the first drafts for reviewers, including me. Much of the book is spent trying to explain simple concepts, while addressing the subtleties that result from Haskell’s shorthands and abstracctions. Something occurred to me as I was reading: the easiest way to learn Haskell is to learn some Lisp.

But not traditional Lisp. This Lisp is lazy, has nothing but symbols, lists, and tuples, and enforces the type differences among them. There is a let form but no define, certainly no setq. It’s lambdas are curryable. The whole realm can be explored, its lessons learned, in a couple of days.

It’s a useless language, except for its purpose. It is Limbo to Haskell’s Inferno — not Haskell itself, but sitting on the edge, where the virtuous of the past go who did not survive to the monadic revolution…

After developing the patterns of lists and lambda calculus, and the basic habits of strong typing, in this realm, the first descent into Haskell is fast. In ten minutes, you discuss the slightly changed notation, point out that we have entirely hidden the underlying cons structure of the list, and arrive at the type system.

Now the type system is an obvious extension of a habit to handle this new, more complicated world. The first unexpected structure is the algebraic data type, a transformation of a piece of BNF. I think this is the only place in computer science where the BNF is the clearest way to teach something.

Then there’s a bumpy half an hour about the syntax for defining functions, a period of clarity in general but confusion in detail about type classes, and then you arrive at

MONADS

And I don’t know how to make this one easy. sigfpe has made the beginning clear, but the rest is still obscure.

On the other hand, if a newcomer picked up a book on Haskell and was told that they were going to spend two chapters learning a dialect of Lisp they would never use again, it might not sell well.

Programmer Virtue in Molecular Biology

Larry Wall taught us the three great virtues: laziness, hubris, and ego. I have long cultivated this in my life, and continue to do so now that I am a biologist.

My latest act of virtue (or perhaps virtù) involves selecting restriction enzymes for some cloning. Starting with a plasmid and a sequence I want to replicate in it, I need enzymes A, B, C, D, E, and X with the following properties:

  • A and B have compatible ends, but when those two ends are ligated, none of the enzymes can recut them.
  • None of the others have compatible ends.
  • A must appear one time in the plasmid without the insert.
  • B and C must not appear in the plasmid.
  • D, E, and X must appear one time in the plasmid.

Don’t worry about where these rules come from for now. I’ll explain another time. Instead, let us focus on my tool for laziness: Prolog. In Prolog, you write down a set of logical propositions and rules, and then query them. This sounds fairly trivial, but it’s in fact really, really useful.

I typed in NEB’s compatible cohesive ends table for their restriction enzymes, which consisted of many lines that looked like this:

compatible(spei, [avrii,nhei,styi,xbai], [bfai]).

This is an example of a fact in prolog. It consists of a label for the fact (compatible), and a series of arguments. The first argument is the enzyme, the second is the list of compatible enzymes (lists are enclosed in [], and the entries are separated by commas), and the third is the list of enzymes that recleave such a ligation. I can create as many facts as I like with the same label and different arguments. In this case, it’s one for every line of that NEB table. Then I can define rules which let Prolog derive more facts from the simple facts I have given it:

is_compatible(X,X) :- true.
is_compatible(X,Y) :- compatible(X,F,_), member(Y,F).


The first line says that any sites cleaved by the same enzyme are compatible. The second says two enzymes are compatible is the second is a member of the list of compatible enzymes of the first.

is_incompatible(X,Y) :- \+ is_compatible(X,Y).

And to be incompatible, they must not be compatible (\+ is ‘not’ in Prolog). Here’s another predicate which is fairly straightforwards:

is_recleaved_by(A,A,A) :- true.
is_recleaved_by(X,Y,Z) :- compatible(X,G,F),member(Y,G),member(Z,F).

Then I put in the list of the restriction enzymes I have on hand: ncutter(aatii). and so forth. I also declare two lists for sites in the plasmid: restriction sites which appear once in the plasmid, and sites that do not appear in the plasmid (among those enzymes I have on hand):

m361_1([aatii, ecori, hindiii, hpai, kpni, mlui, ndei, nhei, noti, pvuii, spei, sphi, xhoi, xmni]).
m361_0([asci, ecorv, ncoi, paci, sbfi, scai, sfoi, snabi]).
And some convenience predicates so I don’t have to dig into the lists all the time:

singleM361(A) :- m361_1(B), member(A,B).
zeroM361(A) :- m361_0(B), member(A,B).

As an exercise, try to figure out what these do:

unrecleavable_pair(A,B) :-
is_compatible(A,B),
\+ is_recleaved_by(A,B,A),
\+ is_recleaved_by(A,B,B).

incompatible_with(_,[]) :- true.
incompatible_with(X,[Y|Ys]) :-
is_incompatible(X,Y),
incompatible_with(X,Ys).

incompatible_set([_]) :- true.
incompatible_set([X,Y|Xs]) :-
incompatible_with(X,[Y|Xs]),
incompatible_set([Y|Xs]).

Then I encode most of my rules into one query:

enzyme_set(A,B,C,X) :-
ncutter(A), ncutter(B),
unrecleavable_pair(A,B),
ncutter(C), ncutter(X),
incompatible_set([A,C,X]),
incompatible_set([B,C,X]).

Now I ask for sets of A, B, C, and X which work:

?- enzyme_set(A,B,C,X), singleM361(A), singleM361(X), zeroM361(B), zeroM361(C).

Prolog gives me every possible set. I look at my plasmid map and pick one that seems reasonable (A=MluI, B=AscI, C=EcoRV, X=AatII). Then I cheat a bit, look for a pair of enzymes in the polylinker that only appear once in the plasmid, and which are incompatible with everything else (which I check in Prolog as well).

It’s not as automated as it could be, nor is it elegant, but compare this to searching through enzymes descriptions by hand! Nor should you be intimidated by the language. I began this with a bare minimum of Prolog, which I picked up in about two hours by reading the first three chapters of Clocksin and Mellish’s Programming in Prolog.

How to learn to program

Peter Norvig has addressed this question. Eric Raymond has his nice essay on the subject. But these don’t provide much guidance for the beginner who wants to know “which books do I sit down and churn through?”

The short answer: The Haskell Road to Maths, Logic, and Programming by Doets and van Eijck (what appears to be an earlier draft is available in PDF here), and then Abelson and Sussman’s Structure and Interpretation of Computer Programs (also available online).

The long answer:

Until you understand what Abelson and Sussman are about in their book, you’re missing the whole point of computer science. Peter Norvig wrote about it,

“…if you’re like me, you’re not looking for one more trick, rather you’re looking for a way of synthesizing what you already know and building a rich framework onto which you can add new learning over a career. That’s what SICP has done for me. I read a draft version of the book around 1982 and it changed the way I think about my profession.”

Unfortunately, SICP requires a fair degree of intellectual, indeed mathematical, maturity. That’s where Doets and van Eijck comes in. Hopefully it will guide you far enough up the slope where SICP makes sense.

After this, there remains only one development in programming languages which you won’t be able to take completely in stride: monads and all the apparatus around them. But that’s a story for another day.

Crawling amoebas

I just finished implementing the Nelder-Mead algorithm for finding the minima of functions. This is a fairly robust algorithm that operates in n dimensions. It requires no information on derivatives, and can efficiently crawl through valleys and the eyes of needles. How? Simulate an idealized amoeba!

Say you have a function f : \mathbb{R}^n \rightarrow \mathbb{R} which you want to minimize. Construct an n-dimensional simplex (the simplex is the higher dimensional analog of the triangle or tetrahedron). Now let the simplex, consisting of points S = \{ x_i, i = 1, \dots, n+1 \} crawl around \mathbb{R}^n.

For each iteration, let x_w be the worst of the \{ x_i \} (in the sense that it has the largest value of f), x_s be the second worst, and x_b be the best. Also, let S' = S - \{ x_w \}.

We will also need the vector which describes reflecting one point through the hyperplane formed by the rest of the points of the simplex. This is given by r = \frac{1}{n-1} \sum_{x_i \in S'} (x_i - x_w), the vector to the center of mass of that hyperplane.

We make the simplex crawl according to the following algorithm:

  1. If f(x_w + 2r) < f(x_b), try going twice as far beyond the simplex: x_w + 3r, then replace x_w with whichever gives of a lower value of f: x_w + 2r or x_w + 3r.
  2. Otherwise, if f(x_w + 2r) < f(x_s), then we just replace x_w with x_w + 2r (the reflection isn’t promising enough to try the extension).
  3. At this point life is looking bad. Try just pulling x_w in towards the rest of the simplex: if f(x_w + \frac{1}{2}r) < f(x_w), replace x_w with x_w + \frac{1}{2}r.
  4. If even that isn’t an improvement, we try to go to a smaller length scale (it may be that we’re hovering over a narrow valley, or even across several valleys, and are being confused by this). To do this, we contract the simplex around x_b. Take S' - \{ x_b \}, add to it whichever is better of x_w and x_w + 2r (remember that we only tested up to the point where x_w + 2r was better than x_s). Then we start again with the new simplex S” = \{ x_b + \frac{1}{2} ( x_i - x_b ) \} where the \{ x_i \} are drawn from (S' - \{ x_b \} ) \cup the better of x_w + 2r and x_w.

Wikipedia has a very nice animated gif of a simplex crawling according to these rules.

I have one serious problem with the simplex algorithm: I have no idea to prove when or if it will work. There are modifications of it to try to improve its reliability, particularly in high dimensions, but I know of no theoretical examination of these properties.

Controlling ImageJ from Scheme

I just spent ten days with the Bioimaging Group at EPFL, Lausanne. They’re great folks, and I at long last am on the way to having a segmenter for our microscope images. There’s just one catch: they do everything as plugins for ImageJ.

I never really learned to write Java. It’s basically C with classes and memory management bolted on top, so it’s not exactly hard to learn, but I don’t appreciate managing memory anymore. I’m not willing to track syntax errors through hundreds of lines of code. And most damningly, I can’t program in languages where functions are first-class objects anymore. I just can’t seem to do it. At every turn, I say, “Ah, I’ll just pass this function as an argument to…oh wait.”

Thankfully, I was able to escape. Kawa is a Scheme REPL and compiler for the Java virtual machine that let’s you call native Java objects seamlessly. You can define new classes which can be instantiated from Java. It has multidimensional arrays that you can reshape at runtime, and which are compiled to unboxed, one dimensional Java arrays.

I spent a couple hours while I was there trying to write ImageJ plugins in Kawa. I failed. Horribly. I still don’t know why. However, I realized I was going at the problem backwards. Instead I opened up Kawa, and typed (<ij.ImageJ>). Up pops ImageJ! Another couple of hours learning the details of Kawa and coding gave me full control of the program from native Scheme, and access to the image pixels in place as a native two dimensional array. Further, the view of the array in Scheme is dynamically typed, so code you write to manipulate images automatically works on all four pixel types (Byte, Short, Color, and Float).

Then I started coding in earnest. I ported Todd Eigenschink’s matrix05.scm linear algebra library to Kawa using the aforesaid native arrays. Then I started coding my own cubic B-spline library. This is reinventing the wheel, but I would have had to in Java as well, and matrix05.scm is far superior to either of the Java linear algebra libraries.

When I get it cleaned up and documented a bit more, I’ll probably release the interface so everyone else can stop beating their heads against Java without sacrificing the sizeable codebase already in ImageJ.