My talk at QCon London 2013 is about how this highly complex domain is designed.
It is time for a little retrospective.
The Continual Aggregate Hub contains a repository or data storage, consisting of immutable documents containing aggregates. The repository is aggregate-agnostic. It does not impose any schema on these; it is up to the producer and consumers to understand the content (and for them it must be verbose). The only thing the repository mandates is a common header for all documents. The header contains the key(s), type, legitimate period and a protocol. Also part of the CAH is a processing layer. This is where the business logic is, and all components here reside in-memory and are transient. Valid state only exists in the repository, and all processing is eventually consistent, and must handle things idempotent. Components are fed by queues of documents, the aggregates in the documents are composed into a business model (things are very verbose here), and new documents are produced (and put into the repository). Furthermore all usage of information is retrieved from the repository.
Realization
With this way of structuring our systems, we can utilize in-memory or BIG-data architectures. The success for utilizing these lies in understanding how your business domain may be implemented (Module and Aggregate design). The IT-landscape within NoSQL is quickly expanding, in-memory products are pretty mature, PaaS looks more promising than ever, and BIG-data certainly has a few well proven candidates. I will not go in details on these, but use some as example as to how we can utilize them.
This is in no way an exhaustive list. Products in this blog is used as examples for different implementations or deployments of the CAH. It is not a product recommendation, nor represent what Skatteetaten might acquire.NoSQL: It’s all about storing data as they are best structured. Data is our most valuable asset. It brings out the best of Algorithms and Data structures (as you where taught in school). For us a document store is feasible, also because legislation states formal information set that should last for a long time. In this domain example candidates are: Couch DB because of its document handling, Datomic because of immutability and timeline, or maybe MarkLogic because of XML support.
Scaleable processing, where many candidates are possible. It depends on what is important.
In-memory: I would like to divide them into “Processing Grid” or “Data Grid”. Either you have data in the processing java-VM, or you have data outside the java-VM (but on the same machine).
PaaS: An example is Heroku, because I think the maturity of the container is important. The container is where you put your business logic (our second most valuable asset), and we want it to run for along time (10 years +). Maybe development and test should run at Heroku, but we run the production environment at our own site. Developers could BYOD. Anyway Heroku is important because it tells a lot of how we should build systems that has the properties we are discussing here. And the CAH could be implemented there (I will talk about thet at SW2013).
BIG-data: We will handle large amounts of data live from “society”. Our current data storage can’t cope with the amounts of data that will be pouring in. This may be solved with Hadoop and its “flock” of supporting systems.
Deployment models
The deployment models resemble our dilemma of balancing a “Total Cost of Ownership”; ease of maintenance, robustness, ability to scale, and cost.
In-memory – Processing Grid (~GemFire)
- Pro. Very low latency. Elastic (scale and re-balance)
- Con. Cost. (Open Source not stable enough). Heap limitation leads to many VM’s. Business code and data are close, leads to deployment issues.
- Pro. Elastic (scale and re-balance). Number of VM’s solely by processing modules. Business code and data are separate, better deploy situation. Low latency (serialisation, but on same machine).
- Con. Cost.
- Pro. Super simple VM (jetty) that only handle local data. Cost. (Open Source stable). Business code and data are separate, better deploy situation. Number of VM’s solely by processing modules.
- Con. Slow elastic (scale and re-balance). Disk-to-disk. Latency (map-reduce)
Conclusion
Our application and systems ovehaul seems to fit many scalable deployment models, and that is good. Lifetime requirements are strict, and we need flexible sourcing.
We are doing Processing Grid now (we are using HazelCast), but will acquire some "in-memory" product during 2013 (either Processing or Data). Oracle is the document database, extremely simple, just a table with the header as relational columns, and the aggregate as CLOB. The database is “aggregate agnostic”.
Somewhere around 2015 when the large volumes really show up, you will probably see us with something like Hadoop, in addition maybe to the ones mentioned above. Since latency in sub-seconds is OK, and we will have a long tail of historic data, maybe just Hadoop? Who knows?
We are navigating into a system landscape where we may choose deployment models, in a more tactical cost / benefit evaluation. We are not stuck to a Database or a single machine anymore.
Target architecture looking good by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.