Making order out of chaos with Essbase Integration Server – part 2

Last time I gave sort of a high-level look at pounding the rigid financial page definitions from the corporate overlords upstream folks, and generating the outlines based on this information. So now I have some nice fresh outlines, but I don’t have any processes for data just yet. I have a couple of issues I need to address:

  1. History is going to be restated each period so I need to go out and fetch all new data each period
  2. I need to load all of the data as fast as I can
  3. I want the data load process that runs each period to be as robust as possible.

Well, let’s start with the basics. The data I want is upstream in a DB2 server. One way to go would be to talk to the ETL group and see if they can drop off files for me each period. This isn’t really a bad option because they already deliver some pretty similar information in a similar fashion. But I don’t really feel like doing all the paperwork and cutting through the red tape right now. Since I already have a working ODBC connection, I’m just going to use this. I know, I know, ODBC is gross. I can’t really use JDBC at the moment because this particular DB2 server is z/OS and you apparently need a special .jar file to get JDBC connections to work (some kind of licensing issue). My emails to the DB2 folks about this have apparently been routed to /dev/null, and I can’t just go to the IBM download site and grab it. So that’s out for now. So let’s go with ODBC.

I determined earlier that the raw data for each period is just south of 200 megabytes. That adds up to a few gigs when you take in to account all the periods I want to look at. During the night, when network activity is relatively quiet, I can pull this down in about 20 minutes for a single period. For some reason I think this should be faster (a lot faster), but the speed seems constant regardless if I use JDBC or ODBC, so at the moment it seems like more of a network thing which I can’t change (I’m pretty sure that there is fairly extensive packet shaping going on… it’d be nice if I could have this port loosened up a bit). Maybe that’s just as fast as the DB (or my user ID on the DB) is allowed to go at. In any case, it’s tolerable, because I basically have the entire period (or one week, at the least) to pull down the history. The server has a tendency to kill my connections though, so I can’t get it all at once. I wouldn’t really prefer that option anyway, since I’m just throwing this into a text file, so I’ve decided that I would like one period per text file.

I wrote a small Python script that will read a configuration file, take some command-line parameters, and go grab all the data I want and put it into the folder I want with a particular naming convention (one period of data per file). It’s pretty nice. Source is available if anyone wants it (it’s a few pages long — nothing fancy). I used pyodbc to connect against the already setup DB2 ODBC connection. This same Python script is also flexible enough so that I can use it to pull down a single period of data — this will come in handy for scripting the period load process.

Each row of data consists of a division, location number, year, period, account, and amount. Recall that there are four different cubes in this instance that are modeled after different financial pages. The thing with this approach is that an account that is on one page may also be on another page. And based on just the account number, I don’t have a way to determine which pages the account belong to. So loading this data straight to the cubes with a load rule is out. And yes, I know that you can load data with EIS, but in this case I prefer to script the process in MaxL. Another thing about the databases is that they split out different departments in them, so I will need a way to associate the account to a department. This means that I will be loading the data files to my SQL Server database and massaging it a little bit. The massaging will be fairly minimal, because I have setup a view for each database that will do a LEFT JOIN of the ‘raw’ data to a table that associates accounts to departments to pages. I could do the JOIN in the load rule itself but that seems a bit messy, and since there are only four databases, this seems sensible.

Keeping in mind that my goal here is to be able to load all four cubes as fast as I possibly can, my process that runs at the close of each period looks something like this:

  1. Pull down current period of data
  2. Load data to SQL server
  3. Process Cube 1
  4. Process Cube 2
  5. Process Cube 3
  6. Process Cube 4

I’m using the word Process here rather than Load because these as ASO cubes, part of the process will be to design the aggregations on them that will make queries nice and snappy. The process that needs to run sometime during closing week and can be done ahead of time looks something like this:

  1. Pull down all of the historical data
  2. Load all of this data to SQL server
  3. Pull down the latest cross-reference/locations data
  4. Update outlines in staging cubes
  5. Load current data to staging cubes for validation (so we don’t get caught by any surprises if the cross-reference changed in some kind of wacky way).

What are the staging cubes? They are duplicates of the four databases but suffixed with an underscore. My original strategy with regard to these cubes was to load them, agg them, and then copy them on top of the other cubes so that the period load process would not result in any downtime for the ‘real’ cubes, but along the way I have discovered that I can load these babies up fast and in a scripted fashion, and that I am better off just loading direct. Also, the MaxL process ‘create or replace…’ seems to really shake the server to its knees and doesn’t have the best performance. Right now I keep the “_” cubes around for testing purposes. This hasn’t resulted in a duplicate of scripting functionality though, I just have a text file with a list of app/db names that I pass to a batch file, so I can choose which cubes to work with just by feeding a different configuration file.

The load process for each cube, then, simply clears the data, loads it (directly from SQL Server), and aggs it.  I have tweaked the essbase.cfg to try and use 16 threads (the servers are quad-proc Xeons) and the speed seems much improved even from specifying 8, and definitely faster than none at all.  The connection between the SQL server and the Essbase server is all gigabit with a theoretical 125 MB per second of throughput, although in actuality the throughput is much less than this.  Altogether, just under a billion or so facts are loaded up and calculated (aggregrated) in about 40 minutes.

It’s not necessary right now, but a few areas that I could shave time off of would be to pre-stage the historical data that is in SQL to a text file on the Essbase server itself, then load this to a buffer, then load the new data.  I could also use a different load rule that doesn’t need to prefix the various members with our naming convention (Lo. for a location, etc), so that the load rule has to do less ‘thinking,’ but again, this hasn’t been necessary.  If it were, I’d be very curious what kind of performance we could squeak out by attaching three more drives to the storage array and setting up a partition on them so that it was separate than the drives with the Essbase data, then the disk wouldn’t be needing to read and write from the disks at the same time.

Making order out of chaos with Essbase Integration Server – part 1

Essbase has been a great addition to the decision-support toolbox in our enterprise due to the fact that the financial pages are produced via a “home-cooked” solution. It’s actually quite impressive, although in some ways I would say that it is limited due to it’s rather rigid structure. Our world consists of seven-digit accounts for everything. You can glean a little bit about what the account is based on what digits are where, although don’t count on it. Accounts are then associated with lines (NET SALES, GROSS PROFIT, RENT, EBITDA, etc), and the lines make up the “physical pages.” The lines are numbered 00 through 99 and they quite literally correspond to a position on a piece of paper. 00 is the first line, followed by 01, then all the way to the next page with 99 on it. It’s not necessary to use all the lines, in fact, many of them are blank. Accounts are then directed towards lines. They are not “nested” in a true hierarchy — an account is simply directed multiple times. For example, sales for a particular department are directed to the NET SALES line, but since it also affects GROSS PROFIT, that particular account is also directed to the GROSS PROFIT line. This is all maintained in the Cross Reference, which is the master list of all report types and account groups (basically just different accounting systems for different business segments), the account, the line it rolls to, a description of the account, and a description of the line.

The trick, then, is how do we make this useful in Essbase? It is our goal to mirror the financials, and provide a window into them that previously did not exist, but essentially our Measures dimension is a fairly fast moving target. It can (and is) changed at any time, and we have anywhere from 5,000 to 80,000 accounts that affect a single financial page (the decision to model separate cubes after particular groups of financial pages was made long ago to make the cubes more manageable and to make more sense to the end-users). We also need to be able to load the data for a new financial period right after it closes. So altogether, here’s what we know:

  1. The Measures dimension has tens of thousands of members, and they change every periods (additions and deletions).
  2. The Measures dimension is a ragged hierarchy based on the way the accounts roll up on the financial page itself.
  3. The Location dimension changes each period, as well as the Attribute dimensions that are associated with the Location dimension.
  4. The raw data for a single period of data is about 200 megabytes (a straight, tab-delimited pull from DB2)
  5. In this particular case, there will be four cubes modeled after four different groups of financial pages. This means that data will be loaded four times.
  6. Data will be final (the books closed for a particular period) at 4AM Eastern time. All databases must be loaded and ready by 7AM.

Although using BSO isn’t automatically ruled out in this scenario, it was definitely not a favored option. The databases in this scenario will easily be tens of gigabytes each. These databases essentially already exist at just one division and a single period start to finish is many hours. ASO was the obvious choice. ASO has a few quirks, probably the most notable is that when you touch the outline, you lose all the data. So this means that we are going to load all of the data, for all periods and all years, every period. This isn’t really a bad thing though — due to history restates occurring for possibly any location and any measures, it is basically necessary to reload all historical data each period anyway.

Well, let’s tackle the issue of the outlines first and worry about the data second. Essbase Integration Server (EIS) was an obvious choice for this. The cross-reference and list of locations are both stored on the WAN and in a DB2 system. The Essbase and SQL Server boxes are in a different state, but all on the same LAN as each other. The format of the cross-reference data — that is, that way that everyone else looks at it besides the Essbase folks — isn’t really usable for us in its present state. We’re going to need to massage it a little bit. In this case, because I need the data on my on SQL box, and I want it on the same LAN as Essbase (so I can test faster, among other things), I will setup a quick DTS job to pull the data down for me.

The Merant DB2 drivers are garbage, so I found some IBM DB2 ODBC drivers to setup on the SQL box, then hook up the DTS job. This is SQL Server 2000 (ugh), otherwise I would have used whatever they call it now (SSIS?). It’s about 3 million rows altogether and takes a little while to pull down. I can automate this in the future, but for now I am content to just refresh it by hand once a period. Now I have my vey own cross-reference, yay! I setup some views on it for convenience (to split up some fields for me), and indexed on the physical page to make some of my queries a bit snappier. I did the same thing for the list of locations (all five-thousand or so) that I’m also pulling down in a DTS job.

The Measures dimension is quite ragged, so I chose the Parent / Child method of building things in EIS. My final table that I will use as the source data for EIS generally is of the form Parent, Child, Alias, Agg, UDA, Formula, SolveOrder, OrderWeight, and DataStorage. This works out pretty well for me and gives me everything I need for EIS. I don’t generally need more alias tables, if I did I would probably just bust out another table and do a physical join on some ID. With some fairly simple Transact-SQL, I have written a script that generates the dataI want (all the children and their parents). I used some temporary tables, some REPLACEing, and a few other tricks. Of special note, due to the sign on accounts that are in GROSS PROFIT, I have tagged these accounts with a UDA (‘GP’) so that I can do sign flips on it when I go to load that particular cube.

Speaking of ragged hierarchies, Location is ragged too so I setup some T-SQL to translate that from a flat structure (store — zones — division) into Parent/Child as well.  I took care in EIS to setup the model carefully so I could use it effectively for all four of the cubes.  Whip up some build scripts so that the outline updates can be scripted, and we’re golden.  The outline update process only takes about a minute for each cube — keep in mind that the data is dumped from the get-go, making things substantially faster.

Next time: data load strategies with an emphasis on loading as fast as we can.