Reference inversion and why collections must die!

Databases don’t store collections, they store tables. If some data needs to be stored as a list, it needs to be normalized. Usually meaning that a new table is required. For example: to get the collection of a 1:n relationship, an SQL query is processed to select a subset of that table.

Projected to object orientation, a table is a collection of all instances of the same class.

Why this matters? Because a collection is a concept that is not easily implementable in self-adjusting computation.

An optimal program is one that incrementally computes only the minimum set of changes required to produce the new output. Humans are in general not capable to write such programs.

An example: Imagine that you need to create a simple graph model that contains nodes each having any number of bidirectional connections.

There are several ways to model this relationship in OO systems, one of them looks like the following:

class Node
{
  string Name;
  List<Connection> In;
  List<Connection> Out;
}
 
class Connection
{
  string Name;
}

Eventually you may end up storing the node pointers in the connection, so that you can follow the graph from any point in any direction:

class Node
{
  string Name;
  List<Connection> In;
  List<Connection> Out;
}
 
class Connection
{
  string Name;
  Node From;
  Node To;
}

But now integrity gets an issue. The list of connections in the Node must be updated when connections change.

Clearly, this violates the DRY principle.

In the relational model (the database) the collections are not explicit, and – because all instances of a type are accessible, they can be derived by a reasonable fast query (assuming indices are set appropriately) over all connections by matching the Node’s id(pointer) with the From column(Node pointer) or To column(Node pointer), for example:
select * from connections where from=mynode

For applications where performance is a number one priority, a database engine can not be used. So developers often end up implementing a huge pile of code that maintains integrity of references and lists thereof.

One strategy to challenge this obvious problem is first to reduce the model classes so that they reflect the structure of relational tables, and then to optimize the performance issues. Laziness strikes again.

Implementing the data structures required to create the list of all In and Out connections of all Nodes by tracking the From an To properties of all Connections seems – at first – a much more complex task compared to the seemingly simple code required to maintain the integrity in the model.

But effectively factoring out this mechanism from the code that touches the model and implementing it so that it can be able to incrementally maintain any 1:n relationship brings a number of new opportunities.

For example, such views may be combined with generators, filters, custom sorting rules or dependency tracking that may be used to build a ecosystem of live abstractions lying on top of the model. Views then get capable to optimize themselves and adjust in real time to changes. From a database perspective, these are triggers on speed!

One implementation path would be to put all objects in a central repository and to notify that repository about modifications. A view manager then receives change sets that contain references to all modified objects and a number of type-based views can then take this information to update their relationship caches.

Because there is no more maintenance code to keep the model’s integrity, multi core support may become an option. Enforcing immutability for the model may be required, but then preserving identity gets pretty hard.

yours
armin