Andrew Drinkwater
I always enjoy talking with IR professionals jumping to using daily data warehouse data, which in itself presents challenges and opportunities. But what about shifting from daily to infrequent snapshots?
This week, I found myself struggling over this college system reporting requirement. In this set up, the college sends in periodic snapshots of enrollment – after the fall term, part way through the spring, and near the end of the fiscal year. There’s also a recalculation period, if required. In the end, they need a snapshot which sometimes includes a very basic forecast of what might happen for the rest of the fiscal year. These requirements, in my view, were designed for a snapshot era, not one where data is available at any time.
I earned my data and analytics stripes in a domain where we had daily data. It was my first job as an actual data analyst and I worked for a forward-thinking registrar (and Pat, my fellow co-founder). The techniques weren’t as advanced as the kinds of data warehouses Plaid builds now, but it helped us stay informed of what was going on in admissions and enrollment in ways that were vastly superior to waiting until the end of the week, or three weeks into the term even, for information.
Imagine my surprise when I moved to institutional research and I learned that our data arrived three times per year: November 1, March 1, and August 1. After the data arrived, the team would spend time reviewing it and checking to make sure everything was correct. It was rare that the data would be available for analytics by the 2nd of the month.
It was a bumpy transition for me. I’d never actually worked with snapshots before and found the concept to be strange. Basically, the snapshot would take the current set of results from the various database queries and save it in a static format. In our case, it was a series of SPSS files that were then simulcast into a database to make it easier for Tableau to read from.
When I started building the admissions data warehouse with this team, the reaction from one of my colleagues really summed up the difference:
You know, we work from opposite ends of the same spectrum. I get the data, massage it, fix it, and clean it up. Then, when I’m good and ready, I publish it. You seem to do the opposite: you spend a bunch of time before the data even exists validating rules, making sure that it will work, and ensuring that it publishes automatically. I think we achieve the same thing, but in a very different way.
I can’t say I ever fully got onto the snapshot train. But I do see use cases where this concept can be helpful:
- Government reporting almost always requires data as of a particular date.
- Working with accountants or auditors.
- Enrolment forecasting.
In these cases, there is advantage in having a snapshot-like technique. In a data warehouse, this may well be point in time milestones so that when you filter to the milestone, you get the data as it existed at that milestone. You can have different rows that provide restatements if necessary – the restatements help you figure out what was truly accurate and gives you an opportunity to explain the changes if necessary, rather than having to overwrite your data when a mistake was found. Being able to trace back changes and explain them keeps your partners in government, audit, and accounting and finance happy.
Snapshots can also be helpful with enrolment forecasting – in particular, many users I speak to don’t want their forecast to change daily, but something like every second week, or every month.
Have you had to learn how to use less frequent data after getting accustomed to daily or more frequent data?