Tuesday, January 25, 2011

Implementing the CAH

So how do we do this? How do we implement the Continual Aggregate Hub (see also comment on restful soa or domain driven design).

I recall how fast we produced business logic using Smalltalk and object oriented programming with good data structures and data operators. Since then I have not seen much to "Algorithms and data structures". What we all learned in basic IT courses at University, does not seem to be in much use. Did SQL or DTO's ruin it all? Did the Relational normalized data structure put so many constraints that programming was reduced to plain looping, some calculations and moving data to the GUI? Where is the good business model that we can run code on. We need good software craftsmanship.

The basis for good implementation is a good architecture. We need to make some decisions on architecture (and their implementations) for our processing and storage. I would really like to have the same structure in the storage architecture as in the processing layer (no mapping!). It means less maintenance and a more verbose code base. There are so many interesting cases, patterns, and alternative implementations that we are not sure where to start. So maybe you could help us out?

Our strategy for our target environment is about parallel programming and being able to scale out. I find this talk interesting; at least slides 69/35 and the focus on basic algebra properties. http://www.infoq.com/presentations/Thinking-Parallel-Programming I find support here for what we are thinking of; the wriggle room, sequence does not matter. The waiters in the CAH restaurant are Associative and Commutative handling orders. I also agree that programmers should not think about parallel programming, they should think "one-at-a-time". But in designing the system parallel processing should be modeled in and it should be part of the architecture.

It seems like Tuple Space is a right direction, but also here there are different implementations. But what other implementations is there that will be sound and solid enough for us? Several implementations are referenced at (http://blogs.sun.com/arango/entry/coordination_in_parallel_event_based), but which?

For the storage architecture there are also many alternatives. Hadoop with HBASE, or MarkLogic for instance. Or is Hadoop much more. If we can have all storage and processing at every node. How do we manage it? How much logic can be put into the Map-Reduce. What is practical to process before you merge the result?
I just cant let go of feeling that storage structured is within a well known and solid relational database. The real challenge is to think totally different as to how we should handle and process our data. (see document store for enterprise applications) Is it really needed to have a different data structure in the storage architecture? Maybe I am feeling like waking up from a bad dream.

In the CAH blogg we want to store the Aggregates as they are. I think we not need different data structure in processing architecture (layer) and the storage architecture.

(2013.10.30): It is now implemented:  big data advisory definite content

Creative Commons License
Implementing the CAH by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, January 13, 2011

Migration strategy and the "cloud"

Our background
We have a classic IT-landscape with different systems built over the last 30 years, large and small "silos" - mostly serving is own business unit - and file based integration between these. Mainly we have Cobol/DB2 and Oracle/PL-SQL/Forms.

We are challenged by a quite classic situation. How do we serve the public, how do we get a holistic view on the entities we are handling, how do we handle events in more "real time", how do we make maintenance less expensive, how do we get much more responsive to change, how do we ensure that our systems and architecture are maintainable in 2040? Every business which is not greenfield, is in this situation.

We are working with Enterprise Architecture and using TOGAF. Within this we have defined a target architecture (I have written about core elements of this in Concept for a aggregate store and processing architecture), and are about to describe a road map for the next 5-10 years.

  • What's this "silo"?
  • What do we expect of the cloud?
  • Loose coupling wanted
  • Cooperating federated systems
  • Complexity vs uptime
  • Target architecture main processes 
What's this "Silo"?
(C) Monthy Pyton - Meaning of Life
Don't feed the silo! The main classifier for the silo is that it is to big to manage and that it is not very cooperative. It is very self-centered and wants to have every data and functionality for itself. By not feeding it, I mean that it is so tempting to put just another functionality onto it, because so much is already there. But because so much is intermingled, consequences of change is tough to fore-see. What you have is a system really no-one understand, it takes ages to add new functionality, probably you never get it stable and testing very costly (compared to the change). In many ways the silo is this guy from Monthy Pyton exploding from a "last chockolate mint".
Typically these silos each have a subset of information and functionality that affects the persons and companies that we handle, but getting the overall view is really hard. Putting a classic SOA strategy on top of this is a catastrophe.
Size is not the main classifier though. Some problems are hard and large. We will have large systems just because we have large amounts of state and complex functionality.

What do we expect of the cloud?
Cloud container
First and foremost it is an architectural paradigm describing a platform for massive parallel processing. We need to start building functionality within a container that lest us have freedom in deploying, separate business functionality for technical concerns or constraints. We need an elastic computing platform, and start to construct applications that scale "out of the box" (horizontal scaling). We will focus on IaaS and PaaS.
But not all problems or systems gain from running in the cloud. Most of systems built to this day, simply does not run in the cloud. Also  data should not "leave our walls", but we can always set up our own or have a more national government approach.

Divide and conquer
Modules and independent state
No-one understand the silo itself, and it's not any easier to understand it when the total processing consist of more than one silo. The problem must be decomposed into modules and these modules must have loose coupling, and have separate concerns. We find Domain Driven Design to be helpful. But the challenge is more than just a functional one, there are also both technical and organizational requirements that put constraints on what modules are actually need.  Remember that the goal is to have an systems environment which is cheaper to maintain and is easier to change as requirements change. The classical SOA approach oversees the importance of state. No process can function without it. So putting a SOA strategy (implementing a new integration like Web-Service / BPEL like system) on top of silos that already had a difficult maintenance situation, is in no way makings things simpler. The total problem must be understood. Divide and conquer! Gartner calls this application overhaul.
The organization that maintains and runs a large system must understand how their system is being used. They must understand the services other depend upon and what SLA's are put on them. Watch out for a strategy where some project "integrates" to these systems, without embedding the system's organization or the system itself. Release handing and stability will not gain from this "make minor" approach. The service is just the tip of the iceberg.
A silo is often a result of unilateral focus on organization. The business unit deploys a solution for its own business processes, and overseeing reuse and the greater business process itself. Dividing such a silo is a lot about governance.
Also you will see that different modules will have different technical requirements. Therefore there may be different IT-architecture for the different systems.
When a module is master in its domain, you must understand the algorithms and the data. If they have independent behavior (can be sharded), it can be paralleled and run in the cloud. In the example the blue module has independent information element. It will probably gain from running in the cloud, but must still cooperate with the yellow and green module.