Friday, September 21, 2012

Business Intelligence and Predictive Analytics in the CAH

Use of Complex Event Processing (CEP) in the Continual Aggregate Hub (CAH)
With all these Aggregates being stored, it is necessary to understand what events (or combination of them) that should trigger some business activity. Some of these are already discussed in CAH, "The Restaurant" and in "Module and Aggregate design". It is the hard job of the waiter in "Restaurant" to execute CEP. CEP is already in use for the primary processing job of the CAH - to process the correct tax. This blog is about processing secondary concerns - correlating events that should trigger actions to follow up on the primay task: Business Intelligence and Predictive Analysis. (This blog-post is not about the analysis algorithms, but on the architecture and tool-box)

Just to summarize some constraints of the CAH (definitions from Domain Driven Design):
  • All Aggregates have a commonly defined Root-object.
  • Aggregates that are "Private" may change at the will of the Module producing it.
  • Aggregates that are "Private" are not known by other Modules.
  • When an Aggregate goes "Public" that is a business event and the Aggregate contains the data. 
  • Aggregates that are "Public" will never change.
  • If the Module needs to change a "Public" Aggregate, a new new version is published.
  • All Aggregates are two-sided; the creditor and the debitor (eg. bank and account owner)
Maintenace and robustness
The design presented below provides excellent isolation from the business logic, linear scaling, and the other features of the CAH. The design gives these "Business Intelligence" concerns a dedicated place in the architecture. The idempotent feature of the CAH is important because we can calculate on historic events when we understand new patterns for fraud detection. (Any tax case up to 10 years old can be re-instantiated).

Chain of events
Catching the business event is a valuable thing. In database systems many events occur, but since the database has a very fragmented view of the data, understanding what actually are important events drown in the flood. It is in the business logic (mainly the Application Layer in DDD) that business events are known, and it is here that Aggregates are made Public.

The aggregates stored in the CAH represents a chain of events that can be reasoned upon. These events may be divided into these groups, or lanes of events.
  • Everything. Typically used for a stream to the data warehouse or others that have a holistic approach
  • By document type (or schema). This is a lane for subscribers that are interested in a particular set of data.
  • By Party. This is a lane for subscribers that reason about data concerning a Party.
  • A combination of document type and Party

Identfying a complex event
This part of the processing is about having an inferred state of the events over a certain period of time. In the CAH this would be the "waiters" job and a specialized set of such; Event Monitors. The unique keys for event monitoring a Party could be: Event Monitor Type (Payroll vs VAT balance), Concerns (the party Id) and Legitimate period (2012) (see definition of the super-document). These event monitors only access the Header of the document, this for performance and clear separation of concern. 

Reasoning about a complex event
If the Event Monitor fins a prospect, it triggers an Event Reasoning Module. This module is capable of retrieving a subset of data from the CAH - this is data not part of the event itself - and the Module does some logic on these data. There will certainly be re-use of Services present in other Modules (which contain business logic necessary to understand the content in an Aggregate). The ERMs will also use services on Party for segmentation parameters such as "Scoring". These parameters are often set by analytics in the data warehouse, that focus on Party behavior. 
There is a many to one relationship here; many Event Monitors may trigger the same Event Reasoning Module. Either the Module has a negative or positive case. If there is a positive case, then the Module produce an Aggregate and makes it Public in the CAH.  These Aggregates are a special set of Aggregates containing data for the BI processing. They in turn can be used for trend analytics and form basis for Predictive Analytics.

Responding to a complex event
Responding to a complex event would be to listen to Aggregates that come from ERM´s. These Aggregates will be stored for a long period of time, so that trend analysis can be done on them. The Aggregate can be sent to any participating system real-time or they are used by other Event Monitors to form chains of CEP. These Aggregates are secondary products of your pipeline of tasks. The first product is about calculating tax correctly.


In the illustration the blue Aggregates (the primary ones) emit events (below the CAH) that form an event stream. EM modules monitor events and trigger ERM for further investigation. ERM 1 has a negative case and does not produce an Aggregate. ERM 2 produce an Aggregate (a green one, representing secondary products in the CAH), which also triggers an event. The Party Registry is also in action supporting segment or scoring information, for the ERMs.


Predictive analytics
Business Intelligence, complex validation, fraud detection and monitoring done in the "live" environment and not in the Data warehouse. (As I have discussed in other blog-posts, the Data warehouse has an important role, but not for these "real-time" tasks). Fraud detection could be analyzing typical patterns and triggering action if they are out of bounds. (For example: a carpenter with certain characteristics, in Oslo, having revenue 25% less than the average, for the third month in a row, then do "something"). Patterns could also be a balance between data-sets or chains of real-world events that are linked. Predictive analytics will contribute to tackling things up front, or enable the Party (tax payer) to act in the right way. Also note that our Aggregates have two-sides; the debitor and the creditor, they help us chain events.

Involving the Party as events occur in the real world
By catching events as they occur in the real would - marriage, death, bankruptcy, trades, liability, payroll etc. - the system could also respond to the Party (the physical or legal entity) and make them acknowledge the event. The Party will then have more insight into its own tax-case, it will validate the data we have, and we are treating our citizens in the correct way. Or the Party may understand that he is doing something wrong and act in the right way. It is better to tackle this up-front.

Implementation
The Aggregates are stored as XML, and are Java object structures in-memory.
Modules are plain Java deployed in our linear scalable processing architecture, and the Modules have services in RMI, REST or WS.
Event streams are Atom feeds, or JMS in-memory.


Creative Commons License
Business Intelligence and Predictive Analytics in the CAH by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.