A Separation, Description vs. Interpretation

2012-12-12

TL;DR

Whenever I create a new module or component, I will first try to separate it into two parts:

A descriptive, declarative part, in the form of a graph, or internal DSL.
An executive part, in the form of a generator or an interpreter.

And I think you should try that, too.

Why? Here is The Long Story

Programming, in the sense of telling a CPU what to do, independent of the methods involved, be it functional, logical, object-oriented, or actor-based, usually tells us not so much about our program, it tells us something about the CPU, the operating system, or the libraries involved.

So if programs tell our CPUs and operating systems what to do, who tells our programs what to do?

We do it, but that information is usually gone as soon we type the first line of code.

So for any reasonable complex program, programming will lead to the original intentions getting lost because a program is more focused on towards the hardware and frameworks that execute it.

So a program always does two things:

It implements an idea or a requirement that is bound to a domain.
It talks to the operating system and the CPU to implement that requirement.

No matter how we decorate a program with meaningful variable names, it will always be obscured by its very own structure, by system calls, additional operators, symbols, lists, lambdas, which are unrelated to the original requirement. These are tools that are optimized to talk to the CPU, to the compiler, to APIs, but barely to humans.

So now that we know there are two separate things, we could give the two parts names: The specification and the implementation.

That is nothing new. Classic programming is about taking a specification, implementing it, and then forgetting it.

But do modern (agile?) programmers use specifications anymore? From what I know, they talk to their customers in regular intervals and then repeatedly prune code into a shape so that it iterates towards their imagination of what they think the client needs. And often it is the programmer's imagination that needs these iterations, not the code. This "process" is the most effective way to develop software as far as I know. So frankly, here, the specification never existed. It can not get lost in translation. The problem seems to be solved.

But to make changes to an existing code base requires a very deep understanding of what is going on in the implementation, and also, more importantly, about what was originally defined by the specification or was imagined in the head of the client.

So one might wonder why we don't write specifications anymore. Or why, at times we did, it always felt pretty useless as soon the first code lines were written.

The most basic problem with the specification is that it is never as detailed as the implementation, because a proper implementation also needs to take a lot of additional variables into account that usually can not be foreseen by a person that is not the programmer.

And because programmers don't like to write or change specifications, all these important decisions and switches just appear in the code and never make its way back in to the specification.

... more precisely, programmers don't like to do anything. Programmers are - by their very nature - very lazy people, because if they wouldn't be, they would not be good at programming, which requires a basic motivation to avoid and automate boring and repeating labor, which then may lead to a world where only programmers and robots are required anymore. The mad realization here is that lazy programmers create a society in which everyone but programmers can be lazy. And although power and wealth is probably a good compensation for that, I doubt that we can survive run by programmers who just want to be lazy but do the entire manual work that's left. The only solution to that problem is to replace programmers by artificial intelligence. Fortunately, we need only lazy programmers to do that.

So we need to accept that a written or imagined specification can - by definition - never be so detailed than the code that runs it. Accepting that, we could both throw the idea of a "living" specification away and rebuild the specification (or intentional map) in our heads by reading a lot of code right before we want to do small changes, or we could finally accept that the code is the specification.

And this is no news either, but compared to all the other progress we made yet, we are heavily struggling with that challenge for a long time now.

For example, TDD or BDD are excellent examples of rudimentary attempts to bring back specifications into our programs by writing code that observes and verifies the behavior of programs. But even though these practices reduce bugs by a fair amount, they introduce yet another liability by adding a lot more code.

What we want is less code, not more. And should testing really be complected with the specification?

For once, we should not forget that a specification is pure in the sense that it defines what should happen. So whatever we test, it can never be the specification that is under test. That's one reason you never need to build a test case for a test, because the test ultimately defines what should happen, and so does a specification.

Consequentially, we need to get aware that it is important to separate specification and implementation right in our code.

One way to separate the specification from the implementation is to think about the specification as a simple data graph that is static and fixed once it has been built. A specification of a program should be an immutable graph that completely defines the dynamic behavior of a program.

Compare that to markup or the source code that we compile. It has the same properties. It is a complete, immutable blueprint that specifies to some interpreter or the CPU how our program is to be executed.

Most of the software projects I see today, mix the specification together with the implementation so that everything looks like a complected mashup of domain specific terms and executable code.

We need the discipline to separate the specification from the program that runs it. We need to create languages (preferable internal DSLs) together with the appropriate domain specific data types that to allow us to create a domain specific specification, which then can be run by an interpreter.

The language builds the data types that form the specification, which is then interpreted.

Now, even that concept is not new either. Compilers, Browsers are all working this way. They take a specification in, and interpret or translate it.

But if we know that this concept leads to the most sophisticated programs (namely the compiler or the browser), and probably the most complex and stable software besides the operating system, why don't we use this model to create our programs?

One explanation could be that we are not smart enough. Abstractions like markup or programming languages take a long time to develop, and even then it may not be guaranteed that they foster change and can be extended easily.

Also there is another scary element that eventually comes up in any fairly complex system. Executable parts, like Turing complete languages that compensate for abstractions we are not able to see yet. JavaScript, originally built to extend HTML, is a prominent example that is taking over the whole web right now.

So we need to be aware that sometimes a specification needs complex executable parts, but these should be small and separate from the time and context the interpreter runs in.

Instead of creating more powerful computer languages, we may need to craft libraries that enable us to create specifications and interpreters for the programs we want to build.

This separation would have some positive consequences:

We could understand our code again, which would result in much faster development and modification times.
Everything we build would be portable. We already introduced a natural porting boundary. Only the interpreter part needs to be ported to another platform.
This separation is scalable in the sense that once the interpreter has been implemented and new abstractions are found, the interpreter could be separated again into a specification part and an implementation part.

Admittedly, and so far, this is a rather linear view of the relation between a specification and the interpreter. In reality it would be more like a number of specifications and interpreters working together. But as long the boundaries are clear and we are aware of them, I can imagine that such a basic separation principle could lead to better, more maintainable programs. Programs, which don't hide their business logic between layers of functions or classes.

So how to start? My best guess is just to think first how a specific domain can be modeled, and if the problem can be clearly separated into a specification and an interpreter. If it can't, the domain needs to be untangled first or new abstractions need to be found.

This idea is growing on me now, and I am thinking a lot about the declarative nature of specifications, and how they can stay separate of their execution.

To summarize, I want to share my current ideas about program code the suites as a specification:

Except when a particular order is part of the specification, individual elements in the same set are commutative; and whenever duplicated elements do not make sense, idempotent. Ordering and duplication should be explicitly specified.

It is very important to stress here, that the order in which the specification is built should have a negligible effect on the resulting specification graph.

Programmers that are used to functional programming languages have a clear advantage here.

The resulting specification graph is immutable and can not be changed (it may be extended inside the interpreter, though).

There might even be translators, acting like interpreters that translate one specification into another.

The specification graph is completely built before the actual program runs and is separated from the interpreter.

A specification never contains parts that adapt as the program runs. This is an intended limitation that strengthens the boundary between data that is changed while the program runs. In other words: A specification defines behavior, it does not behave.

Of course the specification graph is known to the interpreter, but the DSL that is used to build the graph, is not.

One specification is for one domain only, but may refer to other specifications of other domains.

This is clearly a bottom up process in which abstractions may appear that belong to another or a new domain.

The actual fun starts when two or more interpreters need to run in parallel to interpret specifications that define systems that affect each other.

For such scenarios we may need to consider unification options to find a specification that is more generic and can be generated out of several other ones. This is obviously a hard problem, if not the hardest.

Links to other specifications, data, or files are "by name" and not by a language construct.

This is open for discussion, because interfaces (preferable memberless, generic interfaces that use types as tags) may be a good enough abstraction mechanism for referring to other specifications.

But this is an implementation detail, and may differ from language to language.

A specification should completely cover all aspects of the program behavior.

Of course this is the ultimate goal, but we should never block development if we can't find another abstraction yet. For example, reasonable defaults are fine at the beginning and may later find a way into the specification so that they can be overridden.

Additionally, we surely need to inject code into certain parts of the specification as long we are not yet able to specify what that code does in an abstract way. This can be considered as a last resort to fill abstraction gaps.

So while I can not really grasp how a complex program could be specified instead of programmed, I can try to summarize what it would be like:

Programs could be much more portable, most of the specification could be ported without changes.

Of course, every platform is different, so parts of the low-level specification need to be extended to cover these differences.

Specifications usually don't need to be tested anymore.

The proper implementation of the interpreter needs to be tested. So interpreters do have their test-cases.

Once an interpreter is fully tested, it can be guaranteed that the specification exactly results in what was specified. If it does not, the error is in the interpreter.

So higher abstractions rely on the specification to fully cover all the details. If assertions in the sense of invariants are required, they need to be set up in the specification.

But I am not sure if acceptance tests can be avoided. I do hope that they are a dual that can be extracted from the specification or that acceptance tests are part of the specification. And if this gets really messy, the specification may need to be extended to include another perspective.

The program may be completely decoupled from the target language.

With a bit of luck, there are only a limited number of abstractions and interpreters we need to cover. So these interpreters can then be all ported to multiple languages.

We may get a problem with functions or expressions that are embedded in a specification. They may need to be convertible to a widely available language, like JavaScript.

The program could be decoupled from the type of the implementation, be it a functional, class-based, or actor language.

Depending on the nature of the specification, it might be more suited to be run in an interpreter that makes use of the actor model, for example. In an optimistic scenario, the type of implementation language could be chosen by the interpreter depending on the requirements of the specification.

The program could run on multiple computers.

Decoupling the implementation from the specification could enable single programs to run on multiple computers in a massively parallel setup while strongly adhering to the original specification.

This is what scares me with actor based implementations when there is no central control of change. I don't mean the central control in the sense of a god actor, more in the sense of a controlled distribution and set up of the individual components. When multiple individual components run together, and are changed independently, indeterminism happens. But when they are fabricated from the same specification, they can run with the same properties and think independently, but are at the same time able to deterministically comply to the semantics in the original specification.

There is probably a strong evolutionary reason of having the same DNA in each human cell. But right now, actors and software components in general are programmed with the DNA only existing in the mind of the humans who created them.

That said, I think we should start small by setting up a seed constraint:

We should try to create or favor code that can be split up in a descriptive part and an executive part.

And I will try to set up a page with some of the C#/.NET libraries that are great candidates to build software that does not forget its specification. But I need your help for that.

If you really made it down to here and you are a .NET developer, please send me all the libraries and frameworks you like to see on that list. Comments or Twitter preferred.