Case Study 15: Clustered Metrics

Implementation Guide › Case Study Examples › Writing Effective Business Logic Examples › Case Study 15: Clustered Metrics

Case Study 15: Clustered Metrics

Usually, when describing a certain piece of software, the description can be broken into two pieces, the WHAT and the HOW. By WHAT we mean the description of what this piece of code does. The HOW is how does it do it. There is a tendency to concentrate on the WHAT part, and to ignore the HOW part. The reason for that is simple and in many cases justified. By doing so, you reduce the coupling between your components and do not bother your mind with information which is in many cases irrelevant. In many cases tough, the cost of ignoring the HOW part is bad performance.

This case study discusses the way the engine is calculating clustered metrics (answer the HOW part) and describes the performance cost it implies on certain implementations. It also discusses several ways of reducing this cost by changing the implementation.

What are Clustered Metrics

Clustered metrics are metrics that embed in their definition a certain group of resources. This group is referred to as the Cluster of the metric, and each of the resources in that group is referred to as a Cluster Item. When calculating a clustered metric a separate calculation is performed for each of those cluster items. The calculations for each of those cluster items are similar to one another except for:

Context.ClusterItem – The value of the Context.ClusterItem property which is equal to the cluster item that is being calculated.
Registrations by Cluster item – When a metric is registered to events by cluster item, each cluster item receives raw data and reusability events that are sent to this cluster item.

How Clustered Metrics are Calculated

The important thing to understand about the calculation of a clustered metric is that all the cluster items are calculated in parallel. By parallel we do not mean that they are calculated by different threads, but that while processing the various events that should be handled by the various cluster items, the events are being processed sequentially and for each event the relevant cluster items are called and they process this event. For example, there are many events that should be handled by many cluster items. There are two ways to do this:

Example: Option 1

For each cluster item C
     For each event E that should be handled by C
          Let C handle E

Example: Option 2

For each event E
     For each cluster item C that should handle E
          Let C handle E

The engine handles clustered metrics using the Option 2.

Another important point to understand is that the execution of the VBScript inside the PslWriter is performed by a component called Script Control. There is only one single instance of this component per each metric and this instance is reused for the calculation of the various cluster items. Since the cluster items are calculated in parallel as mentioned before, and since the Script Control component can contain only the data of a single cluster item at each moment, we have to frequently switch the data inside the Script Control component.

To explain this, a more detailed pseudo code that performs the calculation is presented below.

1-     For each metric M
2-          Let X be the earliest event not yet handled by M
3-          Let T be the timestamp of the latest state before X
4-          Let L be the list of all events registered by M (all cluster items) starting from timestamp T until the current time
5-          For each event E in L
6-               For each cluster item C that should handle E
7-                    Let C’ be the cluster item that is currently loaded into the script control
8-                    Take the values of the global variables from the script control and store them aside for C’
9-                    Take the values of the global variables stored aside for C and load them into the script control
10-                    Handle event E

This whole process of finding some time of an event that was not taken into account yet and then performing calculation from this point onward is called Recalculation. The process of replacing the values of the global variables (steps 8 and 9 in the code above) is called Context Switching.

The two main problems that can be easily seen in the code above are:

Recalculation is done for all the cluster items together. Since the point in time T (step 3in the code above) is found once and then all the cluster items perform a recalculation based on that point. This means that whenever a single cluster item has some new event, all the cluster items are recalculated.
Context switching is done very often. This can be easily seen since the context switching (steps 8 and 9 in the code above) are located inside the inner loop.

Recalculation of Clustered Metrics

Problem Description
As already explained, all cluster items in a clustered metric are recalculated as a whole. This means that if we have a metric which is clustered over 1000 cluster items and one of them needs a recalculation of the last year due to some new event that arrived, then all the 1000 cluster items are recalculated for the last year.
Possible Solutions
The following solutions suggestions can reduce the pain of this problem, but they are not always applicable and each has its own disadvantages. The important thing is to understand the problem and its estimated cost and compare this cost to the cost of the proposed solutions.
- When the number of the cluster items is small, we can consider the option of defining each of them as a separate metric. The downside of this approach is of course the maintenance cost of maintaining several metrics as well as the fact that we cannot perform a report for the whole metric and then drill into a specific cluster item
- When the number of cluster items is large, but only one (or few) of them are frequently recalculated, we can consider putting this cluster item in a separate metric and leave all other cluster items in the other metric
- Frequently use a calculation freeze for the relevant contract/metric so that this metric never has long recalculations
- Perform some change in the adapters and the data sources so that there is no long recalculations (i.e. do not send events whose timestamp is older than one month)

Context Switching

Problem Description
As already explained, context switching is done in the most inner loop. In other words, for each event that should be handled by each cluster item. When a metric receives many events and when each event is handled by many cluster items, this amount can be very high. Add to this that the context switching operation is relatively expensive (relative to the handling of the event itself in the business logic) and you have a problem.

The cost of the context switching operation is proportional to the size of the data that is “switched”. The data that we switch during context switch is the values of all the global variables in our business logic (also called “the state”). Thus, the more global variables you have and the larger the size of those global variables is, the more expensive the context switching operation is.

In particular, it is not recommended to use business logic maps in clustered metrics, especially if the size of those maps can be large.
Possible Solutions
- Reduce the time of each of the context switches
  The idea is to reduce the size of the state (global variables). This approach can be done by rewriting the business logic so that it does not contain maps. Of course this is not always possible, but when it is, it is recommended.
- Reduce the amount of the context switches
  When the cluster is small it is possible to create a separate metric for each cluster item.
  
  Avoid clustered metrics with many clustered items which register to the same events. The idea here is the following:
  
  If each event is handled by a single cluster item then the amount of the context switching is proportional to the number of the events
  
  If each event is handled by all cluster items then the amount of context switching is proportional to the number of the events times the number of the cluster items
  
  Create a non clustered metric that calculates the results for all of the original cluster items (which are now simple resources and not cluster items). Make this metric send the result of each of the cluster items as an event. Create another metric which is clustered and which receives the events from the first metric and report the value received in those events as the result. The idea here is that the large amount of the raw data events will be handled by a non clustered metric and the clustered metric will handle a single event per period per cluster item.