Speed up ASO SQL data loads by using multiple rules files

Just another quick post today about possibly speeding up data loads to an ASO database when loading from SQL. I got on a quick call with a former colleague that was looking to gain a little more performance on their load process to a massive ASO database, and the first thing that jumped out at me was that I recall you can do parallel loads with some native MaxL syntax.

Here’s a quick example of the syntax:

import database $APPLICATION.$DATABASE data
connect as $SQL_USER identified by $SQL_PW
using multiple rules_file $RULE1, $RULE2, $RULE3, $RULE4, $RULE5
to load_buffer_block starting with buffer_id 100 on error write to "errors.txt";

Basically, you provide multiple rules files (configured for your SQL datasource of course). The rules files are likely to be the same as each other but I suppose it’s possible you might want to partition the data in some logical way to try and speed things up even more.

For example, let’s say that in the code above, we are loading five years of data from a relational database. We might then make it so that each rule is set for this particular year by doing the following things:

  • Set the year in the data header
  • Remove that column from the list of SELECT columns
  • Put a filter/predicate in the WHERE clause on the query
  • Bonus points for using substitution variables in both the header definition and the where clause

Performance in this particular use case went up substantially. It’s my understanding that data loads that were taking an hour are now cut down to 17 minutes. Your mileage may vary, of course.

Let’s Not Forget About Hybrid BSO

That said, I think this can be an effective strategy for trying to squeeze performance out of some ASO cubes that need a smaller load window and you don’t want to go changing a lot of the internals in play. If you’re doing new development, then I strongly, strongly recommend using hybrid BSO (or rather, BSO and making sure the cube is configured properly so as to get the hybrid BSO performance benefits). I have been seeing hybrid BSO cubes absolutely killing it in performance, what with their ability to leverage ASO technology for aggregates, and massive calculation improvements owing to the smaller block sizes and indexes you get from having so many dynamic calc members in dimensions. Plus, you of course get all of the classic/rich/awesome BSO functionality out of the box, like dynamic time series, expense tagging, time balance, and more. These were never very strong areas for ASO and often required a lot of non-optimal workarounds to make users happy.

Some performance observations changing an ASO hierarchy from Dynamic to Stored

There are numerous ASO cubes among my flock.  Usually the choice to use ASO was not arrived at lightly — it was/is for very specific technical reasons.  Typically, the main reason I have for using ASO is to get the fast data loads and the ability to load oodles of noodles… I mean data.  Yes, oodles of data.  The downsides (with version 7.x) is that I’m giving up calc scripts, incremental loading (although this has been somewhat addressed in later versions), native Dynamic Time Series, some flexiblity with my hierarchies, and I have to have just one database per application (you… uhh… were doing that already, right?).  Also, due to the sparsity of much of the data, trying to use BSO would result in a very unwieldy cube in this particular instance.

I have a set of four cubes that are all very similar, except for different Measures dimensions.  They range from 10,000 to 40,000 members.  This isn’t huge, but in conjunction with the sizes of the other dimensions, there is an incredible “maximum possible blocks” potential (sidenote for EAS: one of the most worthless pieces of information to know about your cube.  Really, why?).  The performance of these cubes is generally pretty acceptable (considering the amount of data), but occasionally user Excel queries (especially with attribute dimensions) really pound the server and take awhile to come back.  So I started looking into ways to squeeze out a little more performance.

Due to the nature of the aggregations in the cubes, they all have Dynamic hierarchies in the Accounts/Measures dimension.  This is due to using the minus (-) operator, some label only stuff, and shared members, all of which ASO is very particular with, especially in this version.  All of the other dimensions have Stored hiearchies or are set to Multiple Hierarchies (such as the MDX calcs in the Time dimension to get me some Year-To-Date members).

Actually, it turns out that all of the these cubes have Measures dimensions that make it prohibitively difficult to set Measures to Stored instead of Dynamic, except for one.  So, even though I would need to spin off a separate EIS metaoutline in order to build the dimension differently (these cubes are all generated from the same metaoutline but with different filters), it might be worth it if I can get some better performance on retrieves to this cube — particularly when the queries start to put some of the attribute dimensions or other MDX calcs into play.

What I need is some sort of method to test the performance of some typical retrieves against the two variants of the cube.  I setup one cube as normal, loaded it up with data, and materialized 1 gig worth of aggregations.  Prior to this I had also copied the cube within EAS, made the tweak to Measures to change it from Dynamic to Stored, loaded the data, did a gig of aggregations.  At this point I had two cubes with identical data but one with a Dynamic hierarchy (Measures with 10,000 or so members) and one with stored.  Time to compare.

I cooked up some report scripts, MaxL scripts, and some batch files.  The batch file loads a configuration file which specifies which database to hit and which report to run.  It then runs the report against the database, sets a timestamp before and after it runs, and dumps it all to a text file.  It’s not an exact science, but in theory it’ll give me somewhat of an idea as to whether making the hierarchy Stored is going to help my users’ retrieval operations.  And without further ado, here are the results:

Starting new process at Tue 01/20/2009 10:11:08.35 Time Duration Winner
start-report_01-DB (Dynamic) 11:10.0
finish-report_01-DB (Dynamic) 11:13.4 00:03.4
start-report_01-DB (Stored) 11:14.6
finish-report_01-DB (Stored) 11:21.4 00:06.8 Dynamic
start-report_02-DB (Dynamic) 11:22.6
finish-report_02-DB (Dynamic) 11:51.9 00:29.3
start-report_02-DB (Stored) 11:53.0
finish-report_02-DB (Stored) 12:00.0 00:07.0 Stored
start-report_03-DB (Dynamic) 12:01.3
finish-report_03-DB (Dynamic) 12:02.2 00:00.9
start-report_03-DB (Stored) 12:03.9
finish-report_03-DB (Stored) 12:42.1 00:38.2 Dynamic
start-report_04-DB (Dynamic) 12:43.6
finish-report_04-DB (Dynamic) 12:50.2 00:06.6
start-report_04-DB (Stored) 12:51.3
finish-report_04-DB (Stored) 14:26.4 01:35.1 Dynamic
start-report_05-DB (Dynamic) 14:36.3
finish-report_05-DB (Dynamic) 15:18.3 00:42.0
start-report_05-DB (Stored) 15:19.6
finish-report_05-DB (Stored) 17:32.0 02:12.4 Dynamic
Starting new process at Tue 01/20/2009 10:30:55.65
start-report_01-DB (Dynamic) 30:57.5
finish-report_01-DB (Dynamic) 30:59.9 00:02.4
start-report_01-DB (Stored) 31:01.0
finish-report_01-DB (Stored) 31:05.8 00:04.7 Dynamic
start-report_02-DB (Dynamic) 31:07.7
finish-report_02-DB (Dynamic) 31:40.8 00:33.1
start-report_02-DB (Stored) 31:42.5
finish-report_02-DB (Stored) 31:46.1 00:03.5 Stored
start-report_03-DB (Dynamic) 31:50.4
finish-report_03-DB (Dynamic) 31:51.0 00:00.6
start-report_03-DB (Stored) 31:52.4
finish-report_03-DB (Stored) 31:52.8 00:00.3 Tie
start-report_04-DB (Dynamic) 31:54.0
finish-report_04-DB (Dynamic) 32:06.3 00:12.3
start-report_04-DB (Stored) 32:12.1
finish-report_04-DB (Stored) 32:51.4 00:39.3 Dynamic
start-report_05-DB (Dynamic) 32:55.5
finish-report_05-DB (Dynamic) 33:38.1 00:42.6
start-report_05-DB (Stored) 33:39.7
finish-report_05-DB (Stored) 36:42.5 03:02.8 Dynamic

So, interestingly enough, the Dynamic dimension comes out on top, at least for most of the tests I wrote. There is one of the tests though (report_02) that seems to completely smoke the Dynamic hierarchy.  I wrote these report scripts kind of randomly, so I definitely need to do some more testing, but in the mean time I think I feel better about using a Dynamic hierarchy.  Since the ASO aggregation method for these cubes is simply to process aggregations until the database size is a certain multiple of it’s original size, one of the next steps I could look at for query optimization would be to enable query tracking, stuff the query statistics by running some reports, and then using those stats to design the aggregations.  In any case, I’m glad I am looking at some actual data rather than just blindly implementing a change and hoping for the best.

This isn’t to say that Dynamic is necessarily better than Stored or vice versa, however, I ran this very limited number of tests numerous times and got essentially the same results.  For the least part, this goes to show that there isn’t really a silver bullet for optimization and that experimentation is always a good way to go (except on your production servers, of course).  I am curious, however to go back and look at report_02 and see what it is about that particular report that is apparently so conducive to Stored hierarchies.