2012. november 3., szombat

Working with data 3

Collections again

It is very hard to get rid of old habits and choose a more complex-looking solution (which is complex only until we peek under the hood of the programming language where the problems are hidden).

I thought (and programmed in this way many times) that collections are generic part of data management. Again: they are not. Collections are aspects themselves, they either are part of the owner entity (like: subpanels of a window, scheduled commands of the scheduler, etc.) or referred internal entity members. Yes, the last time I have said that I did not need a collection directly, but only the iteration/search function - but forgot to mention the term "temporary". It is true that I don't need temporary collections, but the object themselves do contain many collections. If they are hidden under the "variant" layer, I have returned to the same problem of temporary collections, but under the hood with expressions for example.

So: Variants do not have multi-value content. If there is a need of it, it is a reference to a Collection aspect or entity. When expression evaluation encounters such a reference, that is again an accumulate / broadcast / search action (and reintroduce the parallel execution on this very low level).

Serialization

When different nodes communicate, they need to serialize their content. This component is responsible for working with references (independent of the actual implementation or the syntax of the stream). The generic solution is that each serialization action contains an ID map, where the unique id of the entity instance and a local, action-unique id is connected. The local id is used all the time, while the content is pushed into the stream only on the first call, when the item is registered into the map. In this way both sides have the simplest way to write and read the self referring structures properly.

Very important: the mapping must be set when starting the serialization: the referred entities may contain back references to the entity being serialized. They must receive the local id only, not start a recursive serialization process. On the other hand: anytime a local id based reference is resolved, the instance itself should be considered "final", but may be incomplete, and its content should not be used at this time. The read process should end with an independent initialization action after all referred components are read.

Content in serialization

From the content point of view, we have static and dynamic serialization.

Static is when I want to store and "remember" something as it is, like a configuration file: I want to keep some settings... NO! The config files are actually the primary source of the stored instances! Hmm... This means that for persistent entities, I have to serialize the content only if that serialization is actually the primary storage for those instances. All the others must only be stored by their unique identifiers (global type Id plus the identifier there). Of course, static serialization cannot contain temporal instances? NO: in some cases, like logging, anything, including temporal instances, must be stored persistently.

Dynamic serialization occurs when I send some content to another active node, which can ask back for additional information as well. In this case, the content of the stream should be optimized by the number of other request that the target has to make when processing the stream content. Some of the referred entities it may already have; others it can miss and ask back for. To cut it short: the sender may add the content of any entity instances to a dynamic serialization, even though the target is not the primary source of that information. This happens if the aim of the communication is in fact to get that instance (like requesting a Person record from a node that owns them), or that they are required to understand the response (sending Address, or Medical records as well with the Person).

It is possible that the sender keeps a sort of log about which objects is had sent to the requester, because it is responsible for keeping them in sync with the actual state, or even add event communication to them (later, this is screen sharing and active teamwork), or at least to optimize the serialization content: send only if the information is not available on the requester side. Of course, this is just an optimization: the requester is allowed to reload all content again (the client may be restarted or lose any content independently from what the server knows about it).

Format and type negotiation

When thinking about connecting to other nodes, we have to keep in mind that the endpoints may have different feature set. The most important is when the client is limited, like it has static type management, so it is simply unable to load a new type to understand the content of the serialization stream.

Level zero is to be able to build a stream connection and be able to transfer bytes through that line reliably. This is a lower level requirement that has to be solved with implementing the proper connector components; as long as I work with Java or other higher level languages, this should be a no issue. However, it also means that I must pack the stream operations into units as well, otherwise I might surprise myself when moving to a more limited environment.

The first level is that both ends must be able to parse and write the stream, that is: be able to handle the actual syntax. This means having Serializer component implementation, and the language elements for that syntax. This is a reference to an existing declarative information package, the EBNF-like declaration of the JSON language, which is actually used to parse and generate JSON streams (and in the hand of the Serializer, a valid JSON serialization) works just like the type declaration themselves. They are persistent entity instances, can be referred to, etc.

So, the limited client has the JSON language definition compiled into its codebase, and the client introduction contains the reference to that instance along with the statement that it is a static client, and cannot handle additional stream formats. When a stream connection is made, this information is used, and the more dynamic node can adapt to this requirement, optionally download the required language definition.

The second level is the content type set. Again, a static client may have final set of types, perhaps encoded again. This is a more generic limitation affecting any environment that is unable to load codes and object dynamically. For example the GWT environment can adapt to additional stream syntax, because if it has the generic EBNF code set, a new syntax is just another structure made from them. However, the GUI declaration types are final, a "Table" type is mapped to one specific code; it is not able to understand new, derived Table types. On a really static client this is also not a question: it is not even able to download and integrate new Type definitions.

So, a limited client introduction can also contain that it has a final type set, and the identifiers of the accepted types. The dynamic sender must "downcast" its content to the required types, and skip aspects that are of types unknown to the client. Naturally, this may result errors because of the lack of required references - the sender must handle these situations. But at least, they are visible...