Saturday, March 5, 2011

Workshop with John Davies

We had John here for 2 days, it was intense and very encouraging. First and foremost we got support for our efforts in the direction as to what stack of technology to use, what development process to follow and how to construct future applications. Secondary we got tips of directions or prioritization to take, that just seems so obvious after the session was over, but that we did not see before we started. Third there are feedback and experiences that we not yet know are relevant or achievable for us.

We started off with some hours of introduction to our domain. It is a steep hill and I am sure we could have used weeks to get more into it. But still I think we managed to get the rough picture so that we got input targeted for our challenges. Compared to what John previously has been working on, I think roughly that we have a lot less transactions, but the diversity in data captured and in regulations put on our systems seems harder.

So to shortly summarize:
  • Get confidence and commitment from top management (by demoing prototypes).
  • Build prototypes, have a sandbox, prove concepts. Don´t use valuable staff on reports.
  • Run agile, have much shorter projects.
  • Any plan over 2 years is a risk or a waste.
  • Understand your problem, use what you need, not what a salesperson think you need.
  • Implement software to run on more that one platform. Put this into the build process, be able to change database, app server or cache at any time.
  • Don´t use to much time on storage architecture. Build for the cache, concentrate on the business logic.
  • Integration architecture is much simpler than what´s in an ESB product; Make protocol-neutral services, use a broker and DNS (JNDI).
  • Standardize and virtualize IT platform.
  • Use UML in design.
  • Ruby on Rails is great for GUI.
  • We have the chance to make a revolutionary tax-system.
Are you up to it? Can you help us implement this?

Tuesday, February 8, 2011

Comment on: RESTful SOA or Domain-Driven Design - A Compromise?

I find this talk by Vaughn Vernon very encouraging and it fits nicely into what we are trying to achieve.
It is very clear and educational and should be made basic training at our site.
http://www.infoq.com/presentations/RESTful-SOA-DDD


It actually fils a gap very nicely as to how our target architecture may be implemented, and more importanlty; why it should behave as as described. The Continual Aggregate Hub and restaurant is really a macro description of how such domain systems may cooperate in a pipeline, and how they can share a document (this is how we store Aggregates) storage.

In in the migrate to the cloud I am talking about ´Cooperating federated systems´; and this is exactly what I observe Vernon is talking about in his presentation. I especially like the Domain Event (page 19 and 34), it is the implementation of what I have called the ´Lifecyle´. Maybe lifecycle is not a good word, but it chosen to illustrate what drives behavior on the domain objects, between and within, these loosely coupled systems (they have separate bounded contexts).

The Domain Event Publisher (page 37) is the Process-state and handling of the CDH, and it is relaxed so that subscribing Modules may consume at a pace they wish. The Event logger is also an important aspect of it so that re-running event is possible. (illustrated nicely at page 39-).

We have very interesting performance results (over 50.000 aggregates pr second) from a Proof of Concept on Continual Aggregate Hub running it in a grid.

(BTW: The real essence is not RESTful or not. Other service implementations and protocols will also do)

Saturday, February 5, 2011

Aggregate persistence for the Enterprise Application

Designing Aggregates (Domain Driven Design) is key to a successful application. This blog explains what the Aggregate and the Root Object is like in our Continual Aggregate Hub.

As I have written in earlier blogs, we are trying to handle our challenge - a large volume, highly flexible, pipeline sort of processing, where we seek to handle quite different sets of information, fix it and calculate some fee - by mixing Domain Driven Design, SOA, Tuple Space, BASE (and others) and a coarse grained document store that contains Aggregates (see previous discussions where we seek to store Aggregates as xml documents).

Known good areas of usage
We know that a document approach is applicable for certain types of challenges; Content Management Systems, Search Engines, Cookie Crunchers, Trading etc.. We also know that documents handle transactions (messages) very nice. But how applicable is it to an Enterprise Application type of system. We want loose coupling between sets of data because it can scale out, functional loose coupling and for other reasons discussed in earlier blogs here.

Why two data structures?
We want systems that are easier to develop and maintain. Today most of Java systems have one structure on the business layer, where we successfully develop code and have a god pace. Using unit tests and mock data to enable fast development. Every thing seems fine, until we have the object relational mapping (ORM). Here we also must model all the same data again, but now in a different structure. At the storage level we put tables and constrains and indexes so that we are sure that the data is consistent. But that also has already been done on the business layer. Why do we continue to do this twice?
The relational model is highly flexible, and is sound and robust. A good reason to use it, is that we want to store data in a bigger context than the business logic did handle. But would´n it be great to relax this layer and trust the business logic instead?

Relational vs. Document
We know that the document approach scale linear very well, and that the relational database does not have the same properties because of ACID and other stuff, but why is it so?
Structure Comparison


The main reason for not being able to scale out, is that data is spread out over many tables (and that is the main structure of most object databases too). Data for all contexts is spread on all tables. Data belong to Party S is all over the place, mixed with Party T. During an insert (or update) concurrency challenges happen at tables A, B and C. The concurrency mechanism must handle continuous resource usage on all tables. No wonder referential integrity is important.
In the document model the objects A, B and C are stored within the document. This means that all data for Party S is in one document and T is in another. No common resource and no concurrency problem.
The document model is not as optimal if there are many usage scenarios that handle all objects C, regardless of what entity it belongs to.

The Enterprise Application challenge
So how do we solve the typical Enterprise Application challenge, with a document store approach? (Should´n we be twice as agile and productive, if we do not need to maintain a separate storage model.) Finding the granularity is important, and most probably should follow the main usage scenarios. To be able to compose aggregates there should be some strong keys that the business logic must ensure referential integrity on. Even though we may not have integrity checks in the storage layer, I am not sure it is that bad. We do validate the documents (xsd and business logic) before we store. And I have no counts on how much bad data I have debugged in databases, even though they have had a lot of schema-enforcement.

The super-document
A lot of the information that we handle in our systems are not part of the domain. There are also intermediate information, historical states and audit for instance. Remember that we in a document approach reverse the concept and store everything about Party S, by itself in its own document. To be able to cope with a document approach, the document itself must be placed within a structure (a super-document) that has more meta data about common concerns such as: keys, process information (just something simple like : new, under construction, accepted), rationale (what decisions did the system do in order to produce this result), anomalies (what errors are there in the aggregate), and audit (who did what, when). The <head> is the Root Object, and its generic so that all documents are referenced in a uniform way. The super-document is structured like this:

<head>
<keys>
<process>
<aggregate>
<rationale>
<anomalies>
<audit>

The IO challenge
We succeed in this only if we also manage to make this perform. The main pitfall for any software project is the time and space dimensions. Your model may look great and your code super, but it does not perform, it does not scale, and you loose. The document storage model is only successful if you manage to reduce IO, both calls and size. If you end up transporting too much information, or if you have too many calls (compared to ORM), then the document model may not be optimal. An Enterprise Application may have 100´s of tables, where probably 30% is m-m relations. I have seen applications with more that 4000 tables... Only a genius or semi-god may manage that. Most probably it will just be unstable for the rest of its life-time (see comment on the Silo). My structure example above is way too simple compared to the real-world. But surely for many of these applications there is a granularity, that fits the usage scenarios better. I have seen documents with 100.000 nodes getting serialized in less than a second. Does not 20 document-types, small and large, seem like a better manageable situation, than 200 tables?

In our upcoming Prof-of-concepts we will be investigating these ideas.
Creative Commons License
Aggregate persistence for the Enterprise Application by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Tuesday, January 25, 2011

Implementing the CAH

So how do we do this? How do we implement the Continual Aggregate Hub (see also comment on restful soa or domain driven design).

I recall how fast we produced business logic using Smalltalk and object oriented programming with good data structures and data operators. Since then I have not seen much to "Algorithms and data structures". What we all learned in basic IT courses at University, does not seem to be in much use. Did SQL or DTO's ruin it all? Did the Relational normalized data structure put so many constraints that programming was reduced to plain looping, some calculations and moving data to the GUI? Where is the good business model that we can run code on. We need good software craftsmanship.

The basis for good implementation is a good architecture. We need to make some decisions on architecture (and their implementations) for our processing and storage. I would really like to have the same structure in the storage architecture as in the processing layer (no mapping!). It means less maintenance and a more verbose code base. There are so many interesting cases, patterns, and alternative implementations that we are not sure where to start. So maybe you could help us out?

Our strategy for our target environment is about parallel programming and being able to scale out. I find this talk interesting; at least slides 69/35 and the focus on basic algebra properties. http://www.infoq.com/presentations/Thinking-Parallel-Programming I find support here for what we are thinking of; the wriggle room, sequence does not matter. The waiters in the CAH restaurant are Associative and Commutative handling orders. I also agree that programmers should not think about parallel programming, they should think "one-at-a-time". But in designing the system parallel processing should be modeled in and it should be part of the architecture.


It seems like Tuple Space is a right direction, but also here there are different implementations. But what other implementations is there that will be sound and solid enough for us? Several implementations are referenced at (http://blogs.sun.com/arango/entry/coordination_in_parallel_event_based), but which?

For the storage architecture there are also many alternatives. Hadoop with HBASE, or MarkLogic for instance. Or is Hadoop much more. If we can have all storage and processing at every node. How do we manage it? How much logic can be put into the Map-Reduce. What is practical to process before you merge the result?
I just cant let go of feeling that storage structured is within a well known and solid relational database. The real challenge is to think totally different as to how we should handle and process our data. (see document store for enterprise applications) Is it really needed to have a different data structure in the storage architecture? Maybe I am feeling like waking up from a bad dream.

In the CAH blogg we want to store the Aggregates as they are. I think we not need different data structure in processing architecture (layer) and the storage architecture.

(2013.10.30): It is now implemented:  big data advisory definite content

Creative Commons License
Implementing the CAH by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, January 13, 2011

Migration strategy and the "cloud"

Our background
We have a classic IT-landscape with different systems built over the last 30 years, large and small "silos" - mostly serving is own business unit - and file based integration between these. Mainly we have Cobol/DB2 and Oracle/PL-SQL/Forms.

We are challenged by a quite classic situation. How do we serve the public, how do we get a holistic view on the entities we are handling, how do we handle events in more "real time", how do we make maintenance less expensive, how do we get much more responsive to change, how do we ensure that our systems and architecture are maintainable in 2040? Every business which is not greenfield, is in this situation.

We are working with Enterprise Architecture and using TOGAF. Within this we have defined a target architecture (I have written about core elements of this in Concept for a aggregate store and processing architecture), and are about to describe a road map for the next 5-10 years.

Content:
  • What's this "silo"?
  • What do we expect of the cloud?
  • Loose coupling wanted
  • Cooperating federated systems
  • Complexity vs uptime
  • Target architecture main processes 
What's this "Silo"?
(C) Monthy Pyton - Meaning of Life
Don't feed the silo! The main classifier for the silo is that it is to big to manage and that it is not very cooperative. It is very self-centered and wants to have every data and functionality for itself. By not feeding it, I mean that it is so tempting to put just another functionality onto it, because so much is already there. But because so much is intermingled, consequences of change is tough to fore-see. What you have is a system really no-one understand, it takes ages to add new functionality, probably you never get it stable and testing very costly (compared to the change). In many ways the silo is this guy from Monthy Pyton exploding from a "last chockolate mint".
Typically these silos each have a subset of information and functionality that affects the persons and companies that we handle, but getting the overall view is really hard. Putting a classic SOA strategy on top of this is a catastrophe.
Size is not the main classifier though. Some problems are hard and large. We will have large systems just because we have large amounts of state and complex functionality.

What do we expect of the cloud?
Cloud container
First and foremost it is an architectural paradigm describing a platform for massive parallel processing. We need to start building functionality within a container that lest us have freedom in deploying, separate business functionality for technical concerns or constraints. We need an elastic computing platform, and start to construct applications that scale "out of the box" (horizontal scaling). We will focus on IaaS and PaaS.
But not all problems or systems gain from running in the cloud. Most of systems built to this day, simply does not run in the cloud. Also  data should not "leave our walls", but we can always set up our own or have a more national government approach.

Divide and conquer
Modules and independent state
No-one understand the silo itself, and it's not any easier to understand it when the total processing consist of more than one silo. The problem must be decomposed into modules and these modules must have loose coupling, and have separate concerns. We find Domain Driven Design to be helpful. But the challenge is more than just a functional one, there are also both technical and organizational requirements that put constraints on what modules are actually need.  Remember that the goal is to have an systems environment which is cheaper to maintain and is easier to change as requirements change. The classical SOA approach oversees the importance of state. No process can function without it. So putting a SOA strategy (implementing a new integration like Web-Service / BPEL like system) on top of silos that already had a difficult maintenance situation, is in no way makings things simpler. The total problem must be understood. Divide and conquer! Gartner calls this application overhaul.
The organization that maintains and runs a large system must understand how their system is being used. They must understand the services other depend upon and what SLA's are put on them. Watch out for a strategy where some project "integrates" to these systems, without embedding the system's organization or the system itself. Release handing and stability will not gain from this "make minor" approach. The service is just the tip of the iceberg.
A silo is often a result of unilateral focus on organization. The business unit deploys a solution for its own business processes, and overseeing reuse and the greater business process itself. Dividing such a silo is a lot about governance.
Also you will see that different modules will have different technical requirements. Therefore there may be different IT-architecture for the different systems.
When a module is master in its domain, you must understand the algorithms and the data. If they have independent behavior (can be sharded), it can be paralleled and run in the cloud. In the example the blue module has independent information element. It will probably gain from running in the cloud, but must still cooperate with the yellow and green module.

Wednesday, December 22, 2010

Continual Aggregate Hub and the Data Warehouse

Devlopment in this area makes this article obsolete. 

The CAH will contain aggregates represented as xml-documents. These aggregates are tuned for the usage patters relevant for the process executed, aka. the primary products. Many other, less relevant but not unimportant, usage patterns do exist (for more secondary products). These are often not so timely relevant for the main process and can wait. We see that the data warehouse (DWH) or a more dedicated operational data store (ODS) has the role of fulfilling the purpose of the secondary producs. (The CAH is a ODS to some extent). The CAH emits all new aggregates produced and will enable the DWH to be more operational. The DWH can query and collect data at any interval it is capable of. The Aggregates are also clearly defined and identified, so it makes the ETL process simpler. Furthermore the all details are available for querying in the CAH so the DW does not need to keep them. These capabilities will lessen the burden on the DWH.
Although the CAH stores data as xml, the DWH may store this as it is best suited.
Creative Commons License
Continual Aggregate Hub and the Data Warehouse by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Continual data hub architecture and restaurant

Continual Aggregate Hub architecture and restaurant

The continual aggregate hub (CAH) have these core (albeit high level) architectural elements:
  • Component based and layered
  • Process oriented
  • Open standards
  • Object oriented
  • Test, re-run and simulate
  • and last; it should behave as a restaurant
CAH processing architecture


Component based and layered
The main concern is to protect the business logic form the technical domain. Freedom to deploy! This is more or less obvious, but the solution should consist of small compact Modules with a clear functional purpose, embedded in a container that abstracts technical constraints form functional. This is roughly dependency injection and the Java container.
The components and the layers should be independent so that they can scale out.

The TaxInfo data store and cache is preferably some in-memory based architecture (aka. GigaSpace…), with relevant services around it. A Module will deploy service components into this architecture. These services peek (consume) or poke (produce) into the aggregates that the Model own. The services are used by humans or other systems. The Bounded Context of DDD spans vertically User Interface Architecture, TaxInfo and Processing Architecture.

The User Interface Architecture is a browser based GUI platform where the user can go from one page to another without knowing what system he is actually working with. We do not need any portal product, but some sets of common components, standards and a common security architecture.

Process oriented
The Modules should cooperate by producing Aggregates. The architecture must continually produce Aggregates to facilitate simple consumption. This will enhance and make a base for parallel processing, sharding and linear scalability. It will also make the process more robust.
Manual and automated processes cooperate through process state and common Case understanding.
A functional domain consist of Modules for automated processing and Modules for human interaction. Human interaction is mainly querying or inspecting a set of data, or making manual change to data.
In between processing modules there is either a queue or some sort of buffer, or the Module may process at will.

Open standards
The solution we implement will live for many years (20-30?) so we must choose open standards we assume to live for some time. But also those architectural paradigms that will last longer than the standards themselves. This stack will never consist of one technology.
We must protect our investment in business logic (see components based), and we must protect the data we collect and store.
As of today we say Java, HTML, and XML (for the permanent data store).

Object oriented
Business logic within our domain is best represented in an object oriented manner. Any aggregate has a object oriented model and logic with basic behavior (a basic Module). Business logic can then perform on these aggregates and do higher level logic.
By having a 1:1 relationship between stored data and basic behavior (on the application layer), locking and IO will benefit.

Test, re-run and simulate
By having discrete processing steps with defined modules and aggregates, testing is enhanced.
All systems have errors and these often affect data that has been produced over some time. By having defined aggregates and a journal archive of all events (the Process state component), it is possible to just truncate some aggregates, and restart the fixed Modules.
Simulation is always on the wish-list of the business. Simulation modules can easily sit side-by-side the real ones, by produce to specific aggregates for the simulated result. Simulated aggregates will only be valid in a limited context and not affect the real process flow.

The Continual Aggregate Hub restaurant
Think of the CAH as a restaurant. Within the restaurant there are tables with customers who place orders, that are served by waiters, and kitchen with cooks. A table in the CAH is a collection of Aggregates, in our domain the “Tax Family”. (All tax families are independent and can be processed in parallel.) The waiters serve tables that has changes that need to be served. The change results in an order with the relevant aggregates and brings it forward to the kitchen. All orders are placed in a queue, (or order buffer of some sort) and the cook process orders at capacity. It is important that the cook does not run to the shop for every ingredient, but that the kitchen has the resources necessary for processing the order. So the kitchen is autonomous (then it can scale by having more cooks, or specialized cooks), but of course has a defined protocol (types of orders and known aggregates). When the dish (aggregate) is finished it is brought back to the table (it is stored).
So the main message is that the cook must not go to the shop while cooking. I think this is where many systems have the wrong approach; somewhere in the code the program does some search here or there. Don’t collect information as you calculate. Separate concerns; collect data, processing it and storing the result!

Creative Commons License
Continual data hub architecture and restaurant by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.