I do seem to love me some ODI journalization. It has been a really great feature that I have implemented countless times to great effect. For those not familiar, journalization in ODI is where you tell ODI that you want to record what rows get updated and inserted in a particular table, such that when you load data out of that table, you can choose to work with just those updated rows.
This is useful for many obvious reasons: only processing data that’s new or changed is usually faster than processing a whole table. When you are dealing with millions and billions of rows, it’s a necessity. Journalization is also nice for instances where it replaces processing data based on time.
For example, I have reworked many systems where the data to be processed was
sysdate - 1 or some other time period, and the automation runs daily. This can work. But what if something goes wrong and you don’t catch it for a day? Well, you go in and tweak things and run a one-off job to fix it. Or you have other coping mechanisms. What if data from a month ago is updated? Many times just processing anything that changed is the effective strategy.
It’s how journalization works under the hood, though, that is the focus of this post (for performance reasons). When you journalize a table (at least for the simple Oracle implementation), ODI installs a trigger on that table so that updates and inserts are recorded somewhere.
In this case, that somewhere is your work schema. If you were to journalize a table called CUSTOMERS and the primary key of customers was CUSTOMER_ID, then you would end up with a journal table called J$CUSTOMERS that has the following columns:
The structure will always be the four JRN_ columns and the primary key (which can be multiple columns but is often just one) of the table being recorded. This simple table records everything you need to know about the new and updated rows.
Here’s a quick crash course in these columns:
JRN_SUBSCRIBER: You supply a text name indicating what the consumer of the data will be. For example, you may have two systems that independently want to process new and updated customers, so you might have two subscribers: MARKETING and CRM, for example. Rather than creating two separate journal tables, we just have a column in here to differentiate the columns.
JRN_CONSUMED: This defaults to 0 and will be 0 until processing data out of the journal occurs. This column provides a really elegant solution to making sure you don’t lose data when processing journalized data. When an ODI interface runs to process data out of a journal, it sets the JRN_CONSUMED column for rows to process to 1. This does a couple of things. One, any rows that are added to the journal after a job starts do not get lost or otherwise suffer from some timing issue. They will just get picked up on the next processing job. Two, if the ODI interface fails for some reason, before it completes successfully, and you need to re-run the job, you won’t have lost the journalized rows: they are only dropped when the job successfully completes (it’s one of the last cleanup steps in the job).
JRN_FLAG: Indicates whether the row was inserted or updated (I or U), although in practice you will notice that the value is always ‘I’, indicating an inserted row (even if it was actually just an update). This is because later on in the job ODI will mark rows that are actually updates with a U and then handle the processing accordingly
JRN_DATE: The date (time) that the row was recorded into the journal.
With all of this infrastructure in place (plus a couple of views that ODI puts in for you), journalization is ready to go. You design your interface as normal, and if you want to process just the new/updated rows, all you have to do is check a single checkbox on the Mapping tab (Use Journalized Data). ODI handles the rest for you.
This is a really killer feature of ODI and can cleanup countless home-brew solutions that are fragile, complex, and code-intensive.
That being said, the original purpose of this post was to actually talk about a performance issue with ODI journals, but it was necessary to provide some context with an overview of journalization first. With that out of the way, check back tomorrow for a tweak you can make to your journal tables that might speed things up if you are processing large amounts of data!