The next step for BrainSharper is to extend its capability to handle very large documents. It should be possible to edit documents that contain thousands of nodes and connectors without noticing a drop in rendering performance. While this “optimization” may look like a premature one, I fear that continuing without it would have dramatic consequences on the internal design, which can be very hard to fix later on. Exceptionally, and only when an optimization directly concerns scalability, my personal guidelines allow me to optimize prematurely, so here it goes:
In its document editor, BrainSharper uses event sourcing to build up the domain model. There are two derived models, or as I call them now “views”, these are a) the domain model and b) the compensating changesets. When a new event is created, the event and both views are written to an SQLite database in one single transaction.
SQLite has this great feature, the R-Tree, which implements a data structure that is suitable for queries over multi-dimensional ranges. So the idea is to maintain one separate R-Tree table to optimize queries for the presentation layer. Maintaining an R-Tree table has the following consequences:
- Besides the R-Tree table, one more view table is required: In the domain model, different types of visuals are stored in separate tables (nodes, connectors, etc.), but when only one R-Tree is used to cover all presentation queries, the R-Tree also needs to refer to different visual types. All the visuals are identified by a GUID, but the R-Tree can only refer to 64 bit integer identifiers. So my first idea was to link the R-Tree to the domain model by using a pivot table, but then I thought that CQRS is about optimizing read performance and so it shouldn’t be a problem to introduce another, simpler, “view” that just contains a list of serialized domain visuals referable by a single integer primary key. The additional table may nearly double the storage that is used, but I guess it’s worth it, because then a query of all visible objects is just single SQL statement.
- The location of the visuals have to be determined up front: Right now, only the minimum information is stored in the domain model. For example, there is no location information directly available for connectors or labels in the database. Their positions are derived from the positions of the nodes in the presentation layer. Now, at the time the R-Tree is generated, the location information needs to be derived directly from the domain model, which is basically simple, but may require some additional lookups.
- The size of the visuals have to be determined up front: To compute the final rectangle that a visual covers on the document, the size of each visual needs to be computed. A problem for which I can see no exact solution. For example, the effective height of a node or the width of a label is dependent on the size of the text that is displayed, which is a measurement that only the presentation layer can provide (or the text subsystem). So to avoid overcomplicating things I’ll need to impose some limits on the maximum size of the visuals and consider these as margins for the final range queries. I guess it should be tolerable to have a number of additional, invisible objects on the screen without compromising performance that much. But even if so, invisible visuals could be thrown away immediately after their final position has been measured.
- The location and the size of some visuals may depend on others: While a node’s center is directly encoded in its domain model object, the start and ending positions of the connectors are not. Therefore, whenever a node is changed, the rectangle information in the R-Tree of all its related connectors need to be updated. For now, this behavior needs to be hardcoded in the component that maintains the R-Tree, however, the chain of dependencies can be derived implicitly from the relations of the visuals. But implementing a change notification system on top of the domain model is not what I want to work on right now, but I am afraid that it might be required for future features.
Well, and that is only half of the complexity. A lot more is required to implement the caching behavior for the visuals that are shown on the screen and specifically the logic that decides what rectangles are being queried from the R-Tree. I may cover that in another post.