Friday, 14 November 2008

Data Aggregation via JMX and the Grid

Following my post on JMX and the Grid which got picked up by The Server Side and by Nati Shalom's blog here I thought I'd add some more brief thoughts on another complimentary JMX pattern we've used in conjunction with grid applications.

The original post talks about collating client-side access to a distributed population of JMX MBeans that comprise the application. In essence the technique described in that post to use a JavaSpace (or other rendezvous technology) to act as a point of registration and lookup. This gives the client-side access to (say) a list of MBeans for each instance of a given type component wherever its running in the grid and the ability for the agent to communicate with any MBean to get/set attributes, invoke management operations or hear notification events.

The client-side (or "agent") of JMX is by nature pretty dumb. Generally the agent uses metadata info about the MBean to generate a UI on the fly. Although it's possible to write custom JMX agents for your application (and we do that), to make sure your management MBeans will work with any JMX agent you really have to design to the lowest common denominator agent.

So let's consider the use-case where our MBeans are collecting stats about (say) our application's performance: average task execution time, latency etc. Stats can be produced for each individual component and made available via the MBean, but we also want to be able to see an aggregated view statistics for the application as a whole.

Aggregation

To deal with the dumb JMX agent we really need to collate and aggregate the data server-side. I'm not going to dwell too much on the approach to this, other than to say aggregation might be done in one of three ways:
  1. Writing an server-side component that collects stats from individual MBeans and aggregates. In this case, using the approach outlined in my previous JMX piece might be handy
  2. Tapping into the underlying components using some application-specific API and aggregating from there
  3. Having the components publishing their stats into a JavaSpace and having an aggregating component attached to the space to perform the aggregation.
Focussing on the last of these approaches for a moment, using the space as a rendezvous point for collation and aggregation has some merits: publication of stats as POJOs to the space is easy and listening to those publications to trigger aggregation is also simple to implement.

Publication to JMX

Regardless of the approach to aggregation, we also need a technique for making the aggregated stats available to dumb JMX agent. The aggregating component needs to expose an MBean to provide access to the aggregated data values. In a simple application these can be held as in-memory values within the aggregating component. However, to deal with large data volumes and to provide fault-tolerance we prefer the following approach:
  1. Aggregating components write the results back to the JavaSpace
  2. A stateless component provides an MBean that acts as a facade to the aggregated data, which is actually fetched on demand from the space
Using the GigaSpaces product we can rely on the space itself to manage live reliable backup of our aggregated data and the Service Grid to host and maintain our stateless aggregated MBean facade.

Summary

Although in our simple aggregating stats use-case we might not care about dropping data or fault-tolerance, there are many real-world examples where we would care far more about these issues. The bare-bones architecture of using the space as both a rendezvous point and a safe holding repository, with access via stateless service components applies well.

One of the reasons I'm a fan of GigaSpaces and space-based architectures is that a number of architectural choices that are traditionally hard-wired: transactional/non-transactional, sync or async replication can be changed through configuration only. This enables common design patterns (and therefore components) to be applied to a wide range of application problems, by enabling the data integrity/performance equation to be tweeked at a late stage of application assembly.

I know this last paragraph is a bit of a leap from the initial topic, but I'll return to this theme in later postings which discuss other use-cases where data integrity and fault-tolerance are a significant issue, in an attempt to make it stand up.

2 comments:

William Louth said...

Hi Steve,

We are able to aggregate statistics across threads, processes, and hosts are we do not use JMX or JavaSpaces or a data grid.

Most importantly we can aggregate (merge|collate|combine) from multiple contexts which do not necessarily have to represent the complete population of nodes in the grid/cluster/cloud.

This does not need to be done in the middleware itself (certainly not from a trigger other than a click) but from a users management model because each of our monitoring agents maintains a database that inherently supports merging (aggregation) with other agent databases pulled into the management console monitoring the model.

I find this approach much better because it also affords us the ability to merge models in offline mode which is very useful when dealing with disconnected agents (no central management server required).

By the way we do also support the publication of our metering data to JMX even JMX frameworks that are grid enabled.

Kind regards,

William

Steve Colwill said...

William

Not quite sure what point you are making. Are there other valid approaches/architectures to solve the problem? Sure.

If you replace the word "database" with "space" in your description, I'm not sure you are saying anything very different to what I'm proposing. Aggregation of data surely requires some consolidation somewhere. I'm suggesting using a space as a point of consolidation for higher-level aggregated reporting into JMX-enabled consoles.

Of course, depending on the resolution/latency of you basic measurements you should also consider some level of aggregation at the original source. The space plays a role in defining a rendezvous point within the grid for higher level aggregations.

You'll note a say "a space" rather than "the space". I didn't mean to imply in the original post that you'd necessarily use the main application space as the destination for monitoring data.

cheers

Steve