Fun with Unix join (on Windows too!)

Last week I mentioned that I was going to take a look at some interesting Unix utilities that you can also run on Windows by way of GnuWin (or some similar GNU on Windows library).

The first item that I want to look at is the “join” program. Join can perform operations very similar to a SQL join, except that it works with text files. Let’s say I have two text files, one named people, and one named colors. The contents of the people file is a unique ID followed by a name, and the colors file is everyone’s favorite color – where the columns are the person’s ID and then their favorite color.

Here’s the people file:

1 Jason 
2 Cameron 
3 Celvin 
4 Michael 
5 Opal 
6 Amy

Here’s the colors file:

1 Orange
2 Pink
3 Purple
4 Red
5 Blue
6 Hot Pink

As a very simple case, we can run join with just the names of these two files:

join people colors

And we’ll get this output:

1 Jason Orange
2 Cameron Pink
3 Celvin Purple
4 Michael Red
5 Opal Blue
6 Amy Hot Pink

In this case, the join program just assumed that the the first columns of each file were the same thing, and join accordingly. Again by default, the output was all of the columns from each file.

Let’s change things up and use a different people file, this time named people-small and containing fewer entries, like this:

1 Jason 
4 Michael 
5 Opal

Now if we run join with this file but the same colors file, like this:

join people-small colors

This will be our output:

1 Jason Orange
4 Michael Red
5 Opal Blue

If you come from the SQL world at all, you will notice that by default, join performed an inner join with respect to the data – that is, the three rows in the “left” data store (people-small) were paired up with the matching rows from the colors file, but the colors that didn’t match anything were just discarded. Similarly, if we run the full people file with a smaller colors file (named colors-small), we’ll see something similar. Consider this colors-small file:

2 Pink
3 Purple
6 Hot Pink

And running this command:

join people color-small

Then we’ll get this:

2 Cameron Pink
3 Celvin Purple
6 Amy Hot Pink

Again, an inner join was performed, in that only matched up rows (from one side or another) were paired up and put into the output.

Note that this behavior (assuming the first column is the ID to match, also, performing an inner join) is just the default behavior for join. As with most Unix tools, join has a plethora of command-line switches that define its behavior. You can choose different join methods, fields, and more.

The Hyperion Connection

I wanted to mention this tool specifically because I think for many of us working with ostensibly relational data (in text files), it can come in handy every now and then. I actually had a perfect use case for this the other day were I just wanted to quickly load some data but I didn’t want to mess with ODI to load it up or deal with importing it to a database. I had to sort the records in the input file first, but after that the join program worked like a charm and generated a file that I could easily use with a load rule to load up to a cube.

Drillbridge drill-to-bottom token parentheses note

Just a quick note on the syntax for Drillbridge tokens that use drillToBottom. One of the more compelling Drillbridge features is the ability to “drill to bottom”. What this means is that a user can drill on an upper-level member such as “Quarter 1”, then Drillbridge will open up the outline, get the level-0 descendants, transform them according to the token expression, and then replace the whole token with the list of members combined into a string.

The string that Drillbridge generates in this case is almost always used in a SQL “IN” clause, like this:

SELECT * FROM Transactions WHERE Month IN ('01', '02', '03')

In this case, the tokens ’01’, ’02’, and ’03’ were perhaps generated because they were transformed with the #monthToTwoDigits function and processed the member names that are the children of Quarter 1 (“Jan”, “Feb”, “Mar”).

Earlier versions of Drillbridge required you to supply the starting an ending parentheses, such that your query text looked something like this:

SELECT * FROM Transactions WHERE Month IN ({{token definition}})

This changed back around version 1.5.0, however. Now Drillbridge by default will supply the starting and ending parentheses. This change was made to accommodate a few use cases where for performance reasons you might need or want to generate the query a little bit differently (such as using a sub-select or something fancy).

Given that current and future versions of Drillbridge supply these parentheses, you would instead write the query before more like this:

SELECT * FROM Transactions WHERE Month IN {{token definition}}

If you really want to supply the parentheses yourself for some reason, you can set the “suppressParentheses” parameter on your token to “true” and Drillbridge won’t supply them – meaning that you will write the query more like the first style. While I’m on the subject, also note that by default, Drillbridge surrounds your member values with single quotes (see the first example in this post where 01, 02, and 03 are all surrounded by single quotes). You can also turn this off if you need to (for example, if the items in the IN clause must be numeric) by setting the “quoteMembers” token parameter to false.

Cool tool for generating data

I came across a really useful tool the other day for generating sample relational data. It’s a web service called genereatedata.com (inventively enough). It lets you define the columns, data types, how to generate the data (such as random names, states, numbers, numbers in a range, and more), then generate the data. You can get data in a delimited form or even directly as INSERT statements for your database of choice. It looks like it generates 100 rows of your data for you for free, but that you need to pay for a premium service that is able to generate more rows, if you need it. Fortunately in my case I was good with 100 rows.

Of course, if you need to generate data for your cubes, you might want to use a tool designed specifically for doing so, such as the Cubedata tool I built some time ago (version 2 is in the works!).

GNU CoreUtils for Windows

Did you know that many Unix core programs are available as Windows executables? Because the code for these utilities is open source, there are a couple of different organizations that have taken the liberty of compiling them for the Windows platform and making them available. One such project is GnuWin32. If you want to, say, use common Unix text processing programs on Windows as part of your Hyperion automation, you definitely can. You would just install the programs on your server, probably add their folder to your system PATH, and then be able to call them like any other program.

Why would you want to do this? There are some really great programs available on Unix platforms that might be useful for some one-off processing you do as part of your automation. In the coming days I’m going to post a few examples of some interesting text processing examples you might do with these utilities that are inspired by Essbase/Hyperion. Of course, for those of you already running Linux/AIX, you should already be set (and of course, as a smug Mac user myself, I am good to go at least with my development machine…).

Another such project that otherwise makes Unix tools available on Windows is the Cygwin project, which you might want to check out too if you need to get this functionality on a Windows machine.

Code This, Not That: Python date string formatting

Continuing on in the Code This, Not That series, I’m going to point out some convenient functions for formatting dates. Since Jython runs in the JVM, we get easy access to all of the libraries that the rich Java programming language provides.

Working with FDMEE and custom scripting, it’s likely that you’re going to need to do some work with dates. This will often involve the Java “Date” object. You might find that you need to do something with the date – get the year, get the name of the month, create a string based on the day of the month, and so on. First off, there are a few things you definitely don’t want to do.

One, you don’t want to convert the date object to a string and then start doing string manipulation on them to get the data you want. If you are converting some VBScript code you might be tempted to do something like Year = RIGHT(someDate, 4). This can work but it’s a bit of a hack. What you want to do is work with the date object directly and get some other tried and true code to do the heavy lifting for you.

There are other ways to achieve this, but a pretty simple and robust one is the SimpleDateFormat class. The SimpleDateFormat class allows us to specify a formatting code, then supply it a Date object, and it’ll generate a string for us.

For example, the formatting code for the abbreviation for the month of a date object is “MMM”. Therefore, we can create a SimpleDateFormat object with this code, pass it a date, and the resulting string is Mar or Apr or whatever the current month is:

import java.text.SimpleDateFormat as Sdf
import java.lang.System as System
import java.sql.Date as Date

date = Date(System.currentTimeMillis())

monthShortName = Sdf('MMM')
print monthShortName.format(date)
# prints "Mar" (or whatever the current month is)

There are plenty of other codes that take care of formatting the minutes, hours, seconds, year, and more for us. We can even combine codes in a single string, like the following:

dateFolder = Sdf('yyyyMMdd')
print dateFolder.format(date)

In this case, the format is the four digit year, followed immediately by the two digit month (including a leading zero if it’s less than 10), and the two digit day. You can refer to Oracle’s documentation for more codes over here. You can also put in other characters that might be ignored in the formatting string, such as hyphens between codes, colons, spaces, and more.

Also keep in mind that you create the formatting object and use it as many times as you want to format dates into strings. In other words, if for some reason you need to format many dates into strings using the same format, you just have to create one SimpleDateFormat object and then just use it over and over again.

As I mentioned, there are plenty of ways to go about formatting dates in Java (Jython) but this is a particularly flexible and useful approach. With just a few lines of code you can generate all sorts of formats that you might need. Consider this an FDMEE/Jython pattern – and anti-patterns to this might include the following code situations you can cleanup and make more readable:

  • Treating dates like strings and extracting data via substrings
  • Prefixing “20” on a two-digit year to get a four digit year
  • Keeping a table of month names in an array and using the month number to get the proper month name
  • Keeping a table of month names and then taking the first three characters to get the “short” month name

Lastly, note that the fdmContext["PERIODKEY"] object (which is where you are likely to be doing some date processing) comes in as a java Date object (in other words, if you are treating it as a string object, you are actually relying on the Jython interpreter silently casting it to a string for you, when you could just be working with it directly). For example, to get the numeric four digit year from the PERIODKEY object, you could do this:

# Note the explicit conversion to an integer
year = int(Sdf('yyyy').format(fdmContext["PERIODKEY"]))

Happy coding!

Code This, Not That: Python string interpolation/concatentation

That title is a bit of a mouthful, ain’t it?

I often see VB code in FDM jobs that is creating SQL statements – concatenating things together, adding in parameters from variables, lots of quotes, single quotes, plus signs, underscores, and more all over the place. It’s hideous.

For example, check some of this garbage out:

strSQL = "INSERT INTO " & TEMP & " (ENTITY,ACCOUNT,COSTCENTER,AMOUNT) " & _
" SELECT Entity.label, ACCOUNT.label, Entity.label+'_CC'", & Period & _
" from " & DATABASE & " as FACT"

For the record, I’m not saying if you see this or write this that you’re a bad developer. What I will say, however, is that you and/or your predecessors were hobbled by a clunky programming language and now you can atone for your syntactic sins with Python (Jython).

Python offers some nice facilities for handling strings. If you are looking for a straight translation of the normal VBScript string concatenation, you could do this:

account = "170100"
select = "SELECT Amount FROM Transactions WHERE Account = " + account

Python uses the plus operator (+) instead of the ampersand (&) to concatenate strings. This is straightforward and works well for simple situations. Of course, things get a little harder to read when we need to surround our value with single quotes (since it’s a SQL query after all):

select = "SELECT Amount FROM Transactions WHERE Account = '" + account + "'"

This isn’t terribly hard to read but we’re on a slippery slope. One thing we can try is simple “printf” style string interpolation:

select = "SELECT Amount FROM Transactions WHERE Account = '%s'" % account

In this case we have a %s placeholder meaning that we will place a string in it. Then we are using the modulo (%) operator to place the value of account into the string. This probably looks funny if you’re not familiar. What’s going on with this syntax?

Remember how you used the ampersand in VB to concatenate strings but in Python you use a normal plus? Well, the fact that we can use the plus in Python is due to operator overloading. This is programming term that essentially means we can actually program in our own behavior for math operations. You’re really used to this with integers (think 2 + 3 = 5), but maybe not so used to it for strings. It’s a clever trick and a nice facility that let’s us work with strings (or indeed, other types of objects) without having to resort to a lot of boilerplate code like newStr = StringUtilities.concatenate(query1, query2) or something similarly awkward.

Along those lines, whereas the modulo operator in math means “give me the remainder when dividing these two numbers” (such as 5 % 3 = 2), in Python this operator has been overloaded to provide convenient operations with strings (after all, how on earth would you ever perform modulo arithmetic on strings?). So when we write string1 % string2 (which is what we had above with account), Python calls the appropriate handler for the string object, which in this case is a fancy string replacement functionality.

This string replacement facility, as it turns out, is pretty handy – and this leads me to the final example of this post.

Python supports objects known as dictionaries. For those from the Java world, this is essentially the same as a Map (or perhaps more like a Map). VBScript has these objects too but they aren’t part of the language itself – they have a clunky syntax. In Python we can make a dictionary simply like this:

params = {
    "ACCOUNT" : "0170200",
    "PERIOD"  : 2
}

Next, Python also lets us create multi-line strings (and we don’t even need the line continuation operator from VB where the lines end with & _). We tell Python we have a multi-line string simply by using three quotes:

select = """
    SELECT 
        Amount
    FROM
        Transactions
    WHERE
        Account = '%(ACCOUNT)s'
        AND Period = %(PERIOD)d
""" 

Note that this string has been parameterized with ACCOUNT and PERIOD placeholders. The syntax is %(KEYNAME)s for strings and %(KEYNAME)d for integers (there are other codes as well but these are the most common).

Now we have a multi-line string with placeholders that we can replace with values from a dictionary. We could print this out easily:

print select % params

And we get this code:

SELECT 
    Amount
FROM
    Transactions
WHERE
    Account = '0170200'
    AND Period = 2   

This works exceptionally well for crafting SQL queries (also, yes, I know about placeholders with ? that are specific to SQL query parameters, let’s save that for another post). The wins here are that you get to format the query how you want, you don’t have quotes and ampersands and plus symbols all over the place, it’s easy to read, easy to maintain, and even faster to write.

Remember, code for readability. Also keep in mind that if you are converting VB to Python/Jython, you are not supposed to slavishly translate line by line – you are translating the solution as whole.

Code This, Not That: From VBScript to Jython

I recently completed a fairly complex VBScript to Jython import integration script conversion. For those that are unfamiliar, Jython is the preferred scripting language of FDMEE (the successor to FDM). Jython offers numerous advantages over VBScript. Chief among those is that the language is more expressive and offers a lot of syntactic sugar, particularly compared to VBScript.

People that know me know that I have a passion for programming. I love writing code in many languages – Java is by far my strongest language, but I also am quite fond of Objective-C (with Swift on my list of things to play with), Python, and a few others. That said I don’t just like to write good, working code – I challenge myself to write clean, elegant, high-performance, easily maintainable, easy to understand, deceptively simple, and fluid code that reads like prose.

One of the greatest compliments I ever got from someone was on an Essbase automation system I had designed. The compliment was, “You made it look simple. And I know it’s not simple – it’s just that good.” I was beaming for days.

But what does this all have with FDMEE and Jython?

Think about this: code is read many more times than it is written. Therefore, it behooves us to write it with as much expressiveness and conciseness as we can, to add useful comments, to be idiomatic with respect to the language we are writing in. Along these lines, in my recent conversion project I was reminded of all the opportunities for writing better code – more expressive code – that Jython (Python) affords us over VBScript.  So to point out some of those examples, over the coming weeks I thought I would write about come particular code examples between VBScript and Jython that I think are worth pointing out, along with some general code recommendations I might have.

So let’s kick things off.

Checking if a value is one of a list of values

It’s common in an import integration script to check if some value, say, the current member from the Accounts dimension, is a particular account. The most straightforward way to handle this condition is with the following code:

if account == '0170100' or account == '0170200' or account == '0170300':
    print "Matched account!"

Thinking back to the notion of reading code more times than we read it, the English language interpretation of this code might be “If the account is 0170100, or if the account is 0170200, or if the account is 0170300, then do this thing.” You might call this the “brute force” approach – to just literally test each condition we might handle.

Can we come up with something a little more… refined? Python offers a convenient syntax that can make this code a little more readable. In Python we can use tuples. A tuple is like a variable that is a bag of other values or variables. Think of it as a simple collection. Using the “in” operator, we can test for membership in a tuple. The following code works the exact same way as the above code:

if account in ('0170100', '0170200', '0170300'):
    print "Matched account using tuple!"

Now the English interpretation of the code might be “Is the account one of these accounts?” The code is shorter, simpler, and reads a little more easily. While this example is somewhat trivial, it’s a good example of leveraging this great tool we now have to use. Let’s take it one step further. Let’s declare a variable with these values ahead of time:

gross_profit_accounts = ('0170100', '0170200', '0170300')

And then we’ll check if the account is one of these:

if account in gross_profit_accounts:
print "Matched account against account list!"

Again, this code works the exact same way, but we’ve introduced a variable name. Think of the variable as a bit of built-in documentation. We easily could have done this too:

# This is for example purposes, the previous example is the best one (in my opinion)
# Check if the account is one of the gross profit accounts
if account in ('0170100', '0170200', '0170300'):
    print "Matched account using tuple!"

But now without even using a comment, the variable name serves as an expressive explanation of what the code is doing. Again, code is read more than it is written, so I think this is a better solution than using the variable name “gp_accounts” or “gpa” or something that requires a decoder ring.

Think about the English translation of this code now: “If the account is a gross profit account, do this thing.” Also, what if we wanted to test the opposite case (not being a gross profit account)? We could do this:

if account not in gross_profit_accounts:
    print "Account not a gross profit account!"

In English: “If the account is not a gross profit account, do this thing.” It’s like the code documents itself – and that is the best kind of code. Think of code as prose!

Advanced Drillbridge tokens: Protecting against unexpected members

Here’s another quick tip for setting up Drillbridge tokens. I recently created a Drillbridge report that needs to map a member from the Cost Centers dimension over to multiple columns in a relational database table. For example, a cost center may have the format 111-222-333-44, and the code 222 corresponds to a column in a table, 333 corresponds to a column in a table, and 44 corresponds to a column in a table (these are all in the same table, by the way).

Because the incoming member on the POV comes across as a Java string, we can perform normal Java string operations on it. In this case we can easily extract the “222” by doing the following (assuming the cost center dimension is named “Cost Centers”:

#Cost_Centers.substring(4, 7)

Recall that in Java, string character offsets start at 0, and that the substring we extract does not include the character at the ending offset (this is all spelled out nicely in the Java String documentation). So effectively, in the above function we are saying “give me the substring of the #Cost_Centers variable from characters 4 through 7 (but not including 7).

A problem with this approach is that members with fewer characters could potentially cause a StringIndexOutOfBounds Java exception. For example, a level one member or parent to our cost center might be “Other”. Therefore when the substring method is called on “Other”, we are asking Java to give us characters that do not exist, and an exception is thrown.

I thought for awhile on what the best way to handle this is. There are a lot of ways I could go in terms of the code, but I decided that the best approach is one that won’t really affect the Drillbridge code much at all right now.

The decision to use a full-fledged expression language as the basis of Drillbridge token expressions has turned out to be quite fortuitous, and one of the things it affords us inside of Drillbridge expressions is the ability to write complex scripts that can handle this situation for us. One such way to guard against member names that are too short is to use the ternary operator.

Many programming languages support the ternary operator. It’s a compact way of representing an “if-else” construct. For example, consider this simple code:

if (stopped) {
    return "Red";
} else {
    return "Green";
}

In this case, the “stopped” variable is a boolean variable that is always one of either true or false. If the variable is true, our function returns the text “Red” and if the variable is false, our function returns the value “Green” (assume that the if block is contained within a function that returns a String value.

This construct is so prevalent in computer programming that many languages support a way to compactly write this, using the ternary operator:

String colorText = stopped ? "Red" : "Green";

In the above example, the variable stopped gets evaluated and if true, the whole expression will evaluate to “Red” and if false, it’s “Green”. Then the String variable named colorText will get the value.

We can use this exact same construct inside of a Drillbridge token expression, thanks to the powerful expression language being used. Now consider this enhanced query example:

SELECT * FROM TRANSACTIONS
WHERE
    SEGMENT2 = '{{
        "name":"Cost Centers", "expression" : "#Cost_Centers.length() >= 13 ? #Cost_Centers.substring(4, 7) : '~~ Cost Center name not long enough to parse ~~'", "sampleValue":"111-222-333-44"
    }}'

Now we are checking the value of an expression first – checking the length of the incoming cost center. If it is at least 13 characters, then the return value of of the expression is safe to be figured out using the substring method. If the member fails the length test, then we return an arbitrary string, which in this case we have determined cannot match anything at all in the database, and therefore return nothing.

This little trick can be used in situations where we need to parse out sub-strings from member names but might have some members that don’t fit the format we need. Also, thanks to Drillbridge’s inventive handling processing of multiple members with the drill-to-bottom feature, we can also use this to effectively filter out members from a list of members, such that we keep the ones we want and discard any others that we don’t need (for example, it’s not uncommon to have “non-conforming” member names for input members).

Drillbridge White Paper now available

I am excited to say that the official Drillbridge white paper is now generally available. It is titled “Drillbridge: Easy Hyperion Drill-through with No Redevelopment”. In my first official white paper, I have hoped to concisely capture the essence of this useful software tool that is now amazingly  approaching 20 production deployments.

In other news, the next version of Drillbridge is still in progress, slowly but surely. The focus continues to be on polish, performance, and incremental feature additions. I have a few small but exciting tidbits that I will hopefully have a chance to post later this week.