Thursday, 24 April 2014

Quick Fan-Out Queries to SQL Database

SQL Azure, sorry I mean Windows Azure … sorry, Microsoft Azure SQL Database is an interesting database platform.  Connections to it have to be made per database meaning the more traditional connections made to a SQL Instance don’t work quite the same.  Add to that a sharded dataset spanning multiple SQL Databases and you’re into the realm of “tricky queries”.  Even a simple query can be quite challenging if when you don’t have the right tools available.  Take for example a request for the number of users by country – it may look something like this:

select countrycode, count(*)
from dbo.myusers
group by countrycode;

Oooooh, but hang on.  You have your users sharded across 100 databases, each one requiring a discrete connection.  We have to run a fan-out query hitting all 100 databases, gather the results, then apply a second aggregation on the result set to get the answer we want.

Some time ago in my early days of SQL Database, it became clear very quickly that a tool was needed to run fan-out queries and return a single merged result set.  I had started learning Powershell so played around with it for a while, but in truth an application was needed with a nice(ish) interface to use.  A developer colleague of mine knocked together such a tool allowing me to import a configuration file, i.e. a list of database connections, and run TSQL against those I selected from a treeview, using the SqlCommand C# library.  It was great!  I could specify the number of threads to use to maximise performance, and get a consolidated set of data returned.  Even better, I could copy and paste the data into Excel to perform further analysis or aggregation when required.  It was something I used for a very long time until just recently…

Another colleague of mine, Antonio Vlachopoulos (Linkedin), mentioned he was using Registered Servers and it was much quicker.  It intrigued me.  I’ve been aware of the feature for years but never tried it for SQL Database.  It stemmed back to when I first needed to run those fan-out queries, and the thought of setting up each individual database connection put me off as we had hundreds.

I’ve happily used my fan-out query app for years, without ever having the need to change to something else.  But in the interest of finding “the best way” I thought I’d try it out.  And today things are different.  Today I know Powershell.  With Powershell I don’t have to manually create every Registered Server connection to my databases.  Instead I can use my configuration database to generate the connections and create them automatically.

Below is a portion of the Powershell I used.  You need to write a process to import your database connections somehow, whether it’s from a config file, or a database etc.  After that the code will go off and create your Registered Server entries.  It does assume it’s running on the local instance and that you have the appropriate permissions, so you may need to do a little experimentation for your environment.

cls
$ParentGroupName = "MyAzureDatabases"

# import the sqlps module to enable us to use the relevant features
import-module sqlps

# navigate to the Server Group folder where the Registered Servers are stored
CD 'SQLSERVER:sqlregistration\Database Engine Server Group'

# add the parent group
new-item $ParentGroupName

# navigate to the new group
CD $ParentGroupName

################################################# 
#    Get your connection details in this section.
#     This script expects an array of nested hash
#     tables. 
#
#     Server
#      `-Database
#         `-Username (secure)
#         `-Password (secure)
#
#     It is easily adapted to add more nesting 
#     for greater flexibility with groupings.
#################################################

# NB - this is not how I specified my connections and credentials, I keep them all in a configuration database
#      but it is useful as a demonstration.
$username = (ConvertTo-SecureString "myUsername" -AsPlainText -Force)
$password = (ConvertTo-SecureString "myPassword" -AsPlainText -Force)
$connections = @{ "MASD_server1" = @(@{"ShardedDb1" = @($username,$password)},@{"ShardedDb2" = @($username,$password)}) }
$connections += @{ "MASD_server2" = @(@{"ShardedDb3" = @($username,$password)},@{"ShardedDb4" = @($username,$password)}) }


#################################################
#    - END
#################################################

# specify the standard SQL Database server name extension for connections
$ext = ".database.windows.net"

# loop through each key and pull out values to generate Registered Server entry
foreach ($server in $connections.Keys) {
   
# Add each server as a group
   
New-Item $server
   
# Navigate to server path
   
CD $server
   
foreach ($databaseList in $connections.get_Item($server)) {
       
foreach ($databaseHash in $databaseList) {
           
foreach ($database in $databaseHash.Keys) {
               
$bstr = [System.Runtime.InteropServices.Marshal]::SecureStringToBSTR((($databaseHash.get_Item($database))[0]))
               
$uid = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto($bstr)
               
$bstr = [System.Runtime.InteropServices.Marshal]::SecureStringToBSTR((($databaseHash.get_Item($database))[1]))
               
$pswd = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto($bstr)
               
New-Item "${server}_${database}" -itemtype registration -Value "server=tcp:${server}${ext};integrated security=false;uid=$uid;password=$pswd;database=$database;Encrypt=True"
           
}
       
}
   
}
   
# go back up a level to parent group
   
CD ".."
}

So, onto the testing.  I created my Registered Servers and decided to run something simple, but with enough records returned to make it a good test.  In steps my original query at the start of this post:

select countrycode, count(*)
from dbo.myusers
group by countrycode;

For me, I was pulling back 100+ records each from about 400 databases.  My fan-out query app went off and pulled the data back in around 17 seconds.  Not bad for hitting that many databases in the cloud and getting the results collated, I thought.  I ran it again, to allow for cached results.  The time was about the same.  Now for the turn of my Registered Servers – 4 seconds!  If it was going to be quicker, I expected it to be marginal - but this was 4 times quicker.  Remarkable.  It was very important for me at that point to test out the speeds over both large and small datasets.

Large(r) Dataset

I started by taking 800kb of data from each of my 400 databases.  This was done by generating 8000 bytes of nonsense data over 100 rows:

select top 100
   
replicate(cast(0x0 as varbinary(max)),8000) as mydata
  
,cast(sysutcdatetime() as varchar) as runtime
from dbo.MyTable;

Surprisingly, the Registered Servers query came in slower.  29s compared to 13s for the fan-out query app.  Tweaking the size of the binary data, I plotted out a number of timings for both the Registered Servers (RS) queries and fan-out query app (FQA).

image

You can quickly see that around 50MB the Registered Servers query hits a saturation point and slows badly.  The fan-out query app stays fairly steady regardless of the size of the dataset being returned.

Explanation

The reason my test query contained a ‘runtime’ field was to allow me to see when the data was returned by each database.  By comparing the earliest time against the latest across the complete set, we can get a feel for how long it took each method to execute the queries across all of the databases.  The chart below shows the comparison.

image

It seems that with Registered Servers the queries are all executed as quickly as possible, which suggests a high number of threads in use (perhaps even one thread per target).  With the fan-out query app, I control the number of threads in the config – for these tests I was using 32 threads.  Having a staggered approach to returning the data, rather than everything coming back at once, can be beneficial for larger datasets.  Any SSIS developers out there know all too well that trying to do too much at once ends up with much slower results.  That certainly seems to be the case here – Registered Servers try to do too much.  As soon as the result set grows too large for the local instance to handle efficiently, the process slows down considerably.

Conclusion

Registered Servers provide a brilliant mechanism for running lightweight queries against SQL Databases (and on-premises ones too for that matter).  It is super fast as long as the result sets aren’t too large, but you’ll have to test out the thresholds for yourself as they’ll differ for every environment.  The best part about it?  It’s provided out-of-the-box with SQL Server.

The downside is the lack of granular control, in terms of threads etc.  Also, when it comes to SQL Database my fan-out query app has a distinct advantage that it includes retry logic, a beneficial aspect when running queries in the cloud.

But for general ad hoc querying, Registered Servers will be my first choice from now on.  Only when I need larger datasets will I switch back to the fan-out query app, but I suspect that will be a rarity.