Monday, December 20, 2004

Quick Distraction

Now for the past week or so I've been plagued with a problem at work with some analysis I've been doing.

I've been looking at how people's giving has varied over 12 months (e.g. Has person x giving more or less in the past 12 months than in the previous 12 months?, that sort of thing.) In theory it's quite simple, make up a couple of views, one or two functions here and there. Job done. There was only one problem.

I had to use Access.

Normally I wouldn't be concerned about this. I use Access all the time. I grew up on it. But unfortunately, Access doesn't do things very quickly (if it does it at all) and neither does it enjoy long-running processes. It just locks your machine up and won't let it go until it's damned well good and ready.

This data I was working with was just using a single table, of some 500 000 rows. This caused one or two problems from the outset...

The number-crunching involved a function, 'Status', which was based on several factors (How much the supporter had given this year, how much last year, when the first started donating, stuff like that). Now I had a choice for this, neither options particularly appealing:

  1. Try to assemble this function in Access SQL using a whole bunch of nested IIF statements
  2. Do it with a VBA function
Well at first I opted for writing a VBA function. Boy was I misguided! What I hadn't taken into account was that actual cost of running a VBA function. It's just never been high enough to worry me before.

Draft 1 of this function was very bad - The function even needed to read data from the tables, so for every row, it would have to:
  • Compile the function
  • Start running the code
  • Open a connection to the database
  • read some data
  • Close the connection
  • Do some maths on the values pulled in
  • Finally spit out a result

Whoops. I didn't even try to time it. It was taking forever. I figured one of the big costs (at least one of the costs I could eliminate) was these calls back to the database, so I jigged abut with it, and added all the values it was reading as parameters. Fine. That made it a lot faster than it was before. Cool.

But not fast enough. It still took bloody ages to work through the table.

So I had an idea. 'Why not', I thought 'create a table from the resultset using an in-memory recordset?' So I did. I rewrote the code, so that it loaded up 2 or 3 recordsets and then (because as far as I know I can't use ADO.NET from VBA) go through iteratively and for each record do the maths and add the result to a table. 'Whoah!' I though. 'Just look at that performance boost. Lightning quick, that!' I manged to get the processing time down to a very nippy 30 mins. Woo Hoo!

30 minutes. That's just silly. I can't wait for 30 minutes for something to happen.

So I skived off work one afternoon (while everybody was at the office christmas party. What a geek!) and redid the whole thing on SQL Server. It rocked. Muli-statement Stored Procedures rule. User Defined Functions also rule.

And not just from a performance point of view, either. One of the big problems was trying to validate my queries. Making sure what was being produced was what I thought was being produced. Not that difficult, just a pain in the behind when you've got to wait 30 minutes to find out every time (and this was 30 minutes if I wasn't trying to use my machine at the same time. Longer otherwise!). Sprocs? No problem - I can go through my sproc and pull out the bit I'm trying to validate. Run that, and see if all is well. Takes seconds. Sweet.

And I've noticed something very fishy about Access' query designer. If you've got a nested subquery you're doing a join on, then it runs fine. Open it up in Design View, that works fine too. Your subquery appears in there as a table, same as usual. HOWEVER open the bugger up in SQL view, and you'll find Access has done something very very strange to it...

Well, it's not strange at all, just irritating because it stops it from working.

I started with this:
SELECT DISTINCT ThisYear.[Constituent ID], ThisYear.YearValue
FROM (SELECT Data.[Constituent ID], sum(Data.[Gift Amount]) as YearValue FROM Data where Data.[gift date] > #09/30/2003#
GROUP BY Data.[Constituent ID]
) AS ThisYear INNER JOIN (Select data.[Constituent ID] from data where data.[gift amount]>=100 AND Data.[Gift Date] > #10/01/1998#
) AS BigGivers ON ThisYear.[Constituent ID] = BigGivers.[Constituent ID]) LEFT JOIN [September 04] ON BigGivers.[Constituent ID] = [September 04].ConsID
WHERE ((([September 04].ConsID) Is Null));


Ugly, but the only way I could figure of doing it in a single SQL statement (rather than a series of separate queries). Open it up in the query designer, though, and this is what you get:
SELECT DISTINCT ThisYear.[Constituent ID], ThisYear.YearValue
FROM ([SELECT Data.[Constituent ID], sum(Data.[Gift Amount]) as YearValue FROM Data where Data.[gift date] > #09/30/2003#
GrOUP BY Data.[Constituent ID]
]. AS ThisYear INNER JOIN [Select data.[Constituent ID] from data where data.[gift amount]>=100 AND Data.[Gift Date] > #10/01/1998#
]. AS BigGivers ON ThisYear.[Constituent ID] = BigGivers.[Constituent ID]) LEFT JOIN [September 04] ON BigGivers.[Constituent ID] = [September 04].ConsID
WHERE ((([September 04].ConsID) Is Null));

Spot the difference. It took me a while to notice it, but the subqueries are losing their brackets. The surrounding '()' is being replace by '[].'. However Access doesn't notice this change until you edit the text. If you just open it in this view, and re-run the query, it works fine.

Edit the text, however, and you have to remember to change the brackets back to how they were initially, otherwise it throws an error.

Grrr. Like I said, it's more of an annoyance than anything else.

So that's what I've been doing lately. Sorry about the very long, very boring post.

I can get back to making my blogger client now...

Just wanted to share.

No comments: