Thursday 24 April 2014

Quick Fan-Out Queries to SQL Database

SQL Azure, sorry I mean Windows Azure … sorry, Microsoft Azure SQL Database is an interesting database platform.  Connections to it have to be made per database meaning the more traditional connections made to a SQL Instance don’t work quite the same.  Add to that a sharded dataset spanning multiple SQL Databases and you’re into the realm of “tricky queries”.  Even a simple query can be quite challenging if when you don’t have the right tools available.  Take for example a request for the number of users by country – it may look something like this:

select countrycode, count(*)
from dbo.myusers
group by countrycode;

Oooooh, but hang on.  You have your users sharded across 100 databases, each one requiring a discrete connection.  We have to run a fan-out query hitting all 100 databases, gather the results, then apply a second aggregation on the result set to get the answer we want.

Some time ago in my early days of SQL Database, it became clear very quickly that a tool was needed to run fan-out queries and return a single merged result set.  I had started learning Powershell so played around with it for a while, but in truth an application was needed with a nice(ish) interface to use.  A developer colleague of mine knocked together such a tool allowing me to import a configuration file, i.e. a list of database connections, and run TSQL against those I selected from a treeview, using the SqlCommand C# library.  It was great!  I could specify the number of threads to use to maximise performance, and get a consolidated set of data returned.  Even better, I could copy and paste the data into Excel to perform further analysis or aggregation when required.  It was something I used for a very long time until just recently…

Another colleague of mine, Antonio Vlachopoulos (Linkedin), mentioned he was using Registered Servers and it was much quicker.  It intrigued me.  I’ve been aware of the feature for years but never tried it for SQL Database.  It stemmed back to when I first needed to run those fan-out queries, and the thought of setting up each individual database connection put me off as we had hundreds.

I’ve happily used my fan-out query app for years, without ever having the need to change to something else.  But in the interest of finding “the best way” I thought I’d try it out.  And today things are different.  Today I know Powershell.  With Powershell I don’t have to manually create every Registered Server connection to my databases.  Instead I can use my configuration database to generate the connections and create them automatically.

Below is a portion of the Powershell I used.  You need to write a process to import your database connections somehow, whether it’s from a config file, or a database etc.  After that the code will go off and create your Registered Server entries.  It does assume it’s running on the local instance and that you have the appropriate permissions, so you may need to do a little experimentation for your environment.

cls
$ParentGroupName = "MyAzureDatabases"

# import the sqlps module to enable us to use the relevant features
import-module sqlps

# navigate to the Server Group folder where the Registered Servers are stored
CD 'SQLSERVER:sqlregistration\Database Engine Server Group'

# add the parent group
new-item $ParentGroupName

# navigate to the new group
CD $ParentGroupName

################################################# 
#    Get your connection details in this section.
#     This script expects an array of nested hash
#     tables. 
#
#     Server
#      `-Database
#         `-Username (secure)
#         `-Password (secure)
#
#     It is easily adapted to add more nesting 
#     for greater flexibility with groupings.
#################################################

# NB - this is not how I specified my connections and credentials, I keep them all in a configuration database
#      but it is useful as a demonstration.
$username = (ConvertTo-SecureString "myUsername" -AsPlainText -Force)
$password = (ConvertTo-SecureString "myPassword" -AsPlainText -Force)
$connections = @{ "MASD_server1" = @(@{"ShardedDb1" = @($username,$password)},@{"ShardedDb2" = @($username,$password)}) }
$connections += @{ "MASD_server2" = @(@{"ShardedDb3" = @($username,$password)},@{"ShardedDb4" = @($username,$password)}) }


#################################################
#    - END
#################################################

# specify the standard SQL Database server name extension for connections
$ext = ".database.windows.net"

# loop through each key and pull out values to generate Registered Server entry
foreach ($server in $connections.Keys) {
   
# Add each server as a group
   
New-Item $server
   
# Navigate to server path
   
CD $server
   
foreach ($databaseList in $connections.get_Item($server)) {
       
foreach ($databaseHash in $databaseList) {
           
foreach ($database in $databaseHash.Keys) {
               
$bstr = [System.Runtime.InteropServices.Marshal]::SecureStringToBSTR((($databaseHash.get_Item($database))[0]))
               
$uid = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto($bstr)
               
$bstr = [System.Runtime.InteropServices.Marshal]::SecureStringToBSTR((($databaseHash.get_Item($database))[1]))
               
$pswd = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto($bstr)
               
New-Item "${server}_${database}" -itemtype registration -Value "server=tcp:${server}${ext};integrated security=false;uid=$uid;password=$pswd;database=$database;Encrypt=True"
           
}
       
}
   
}
   
# go back up a level to parent group
   
CD ".."
}

So, onto the testing.  I created my Registered Servers and decided to run something simple, but with enough records returned to make it a good test.  In steps my original query at the start of this post:

select countrycode, count(*)
from dbo.myusers
group by countrycode;

For me, I was pulling back 100+ records each from about 400 databases.  My fan-out query app went off and pulled the data back in around 17 seconds.  Not bad for hitting that many databases in the cloud and getting the results collated, I thought.  I ran it again, to allow for cached results.  The time was about the same.  Now for the turn of my Registered Servers – 4 seconds!  If it was going to be quicker, I expected it to be marginal - but this was 4 times quicker.  Remarkable.  It was very important for me at that point to test out the speeds over both large and small datasets.

Large(r) Dataset

I started by taking 800kb of data from each of my 400 databases.  This was done by generating 8000 bytes of nonsense data over 100 rows:

select top 100
   
replicate(cast(0x0 as varbinary(max)),8000) as mydata
  
,cast(sysutcdatetime() as varchar) as runtime
from dbo.MyTable;

Surprisingly, the Registered Servers query came in slower.  29s compared to 13s for the fan-out query app.  Tweaking the size of the binary data, I plotted out a number of timings for both the Registered Servers (RS) queries and fan-out query app (FQA).

image

You can quickly see that around 50MB the Registered Servers query hits a saturation point and slows badly.  The fan-out query app stays fairly steady regardless of the size of the dataset being returned.

Explanation

The reason my test query contained a ‘runtime’ field was to allow me to see when the data was returned by each database.  By comparing the earliest time against the latest across the complete set, we can get a feel for how long it took each method to execute the queries across all of the databases.  The chart below shows the comparison.

image

It seems that with Registered Servers the queries are all executed as quickly as possible, which suggests a high number of threads in use (perhaps even one thread per target).  With the fan-out query app, I control the number of threads in the config – for these tests I was using 32 threads.  Having a staggered approach to returning the data, rather than everything coming back at once, can be beneficial for larger datasets.  Any SSIS developers out there know all too well that trying to do too much at once ends up with much slower results.  That certainly seems to be the case here – Registered Servers try to do too much.  As soon as the result set grows too large for the local instance to handle efficiently, the process slows down considerably.

Conclusion

Registered Servers provide a brilliant mechanism for running lightweight queries against SQL Databases (and on-premises ones too for that matter).  It is super fast as long as the result sets aren’t too large, but you’ll have to test out the thresholds for yourself as they’ll differ for every environment.  The best part about it?  It’s provided out-of-the-box with SQL Server.

The downside is the lack of granular control, in terms of threads etc.  Also, when it comes to SQL Database my fan-out query app has a distinct advantage that it includes retry logic, a beneficial aspect when running queries in the cloud.

But for general ad hoc querying, Registered Servers will be my first choice from now on.  Only when I need larger datasets will I switch back to the fan-out query app, but I suspect that will be a rarity.

Wednesday 26 February 2014

Slide Deck for SQL NE Talk

The slide deck for my talk "Introduction to WASD for DBAs & Developers" is now available for download at the link below.  I have added some notes to the slides, so hopefully you'll find them useful.

Introduction to WASD for DBAs and Developers

Tuesday 4 February 2014

Understanding Text Encoding – What Data Professionals Should Know

It surprises me somewhat just how little is known about text encoding and code pages, by those professionals working with data every day.  The fundamentals should be known by every data professional, so this post is intended to cover just that.

Firstly, I know that as an intelligent reader and data professional you know that characters exist in the presentation layer only.  Kind of – the glyphs at least.  They still exist as a sequence of bits, but depending on the encoding used they can be interpreted in many different ways.  It’s up to the client reading the data to know what text encoding is being used so the correct characters can be displayed to the user.  Why reiterate this?  Well, it’s fundamental to understanding text encoding.

For example, let’s say I encode my alphabet as A=0, B=1, C=2 and so on.  I send you the message “SQLTUNA” which would be passed over the wire as 18 16 11 19 20 13 0.  Now let’s say you’re using the coding system of A=1, B=2, C=3 and so forth.  When you display the message it would come out as “RPKSTM?” where the final character (?) is unknown.  I don’t want to be known as RPKSTM?, so for that reason alone it’s important to have a standard encoding system that both parties are aware of.

Good news!  We have standards.  Lots of them.  Without going into too much history, some time in the 1960s the American Standard Code for Information Interchange shortened their name and the ASCII standard was borne.  OK, so there may have been a bit of intelligence and thought applied in between, but that’s just semantics.

ASCII maps the main Latin characters to the numbers 0 to 127.  Lower and upper case letters, numbers and common punctuation, as well as some non-printable control characters such as the carriage return and tab (numbers 0-31 are reserved for control characters).  It all fits neatly into 7 bits.  And that’s where the problems started.  Once it became commonplace to transfer data in 8-bit sequences, a.k.a. bytes, there was suddenly a “spare” bit opening up another 128 possible characters.

To cut a long story short, the “spare” or “extended” characters were used differently by different entities, mainly desktop providers such as IBM.  The printable character set of 32-127 were kept as standard by pretty much everybody, however the extended set for 127-255 were used in different ways, and in some cases the control character range of 0-31 were changed.

Thus code pages came into existence.  Allocating a code page to a particular set of characters allowed these encodings to be exchanged and mapped correctly by the recipient.  So for example, if you received codes from someone and they told you it was encoded using code page 437 (IBM’s original code page) you would decode it to produce the output below:

Received:

201 205 187 13 200 205 188

Displayed:

╔═╗
╚═╝

“A Pretty Box”

However, if you mistakenly used the wrong code page to decode it, such as the Windows 1252 code page, you would end up with the following instead:

ÉÍ»
Èͼ

“A fraction better than Old Macdonald…”

You get the point.  Many code pages have now been made official to describe the various different mappings of characters to codes.  They all share the consistency of codes 32-127 from the ASCII standard, but beyond that you need to know the code page you’re dealing with in order to display the text correctly.

Unicode

What happens if you need more than 256 characters in a set?  Some languages such as Japanese have over a thousand characters.  That simply ain’t gonna fit into 1 byte.  So in stepped Unicode to rescue the situation.  Devised around the late 1980s/early 90s, in conjunction with the Universal Character Set (an ISO standard), the intention was to create a mapping for every character in every language.  They pretty much succeeded, and currently have mappings for over 110,000 characters!

An important point to note here – Unicode does not tell you how to encode your characters for storage or transmission.  It simply maps the characters to code points.  They are generally in the form “U+” followed by a hexadecimal number.  For example:

U+0053 – S
U+0051 – Q
U+004C – L
U+0054 – T
U+0075 – u
U+006E – n
U+0061 – a

Those of you paying attention will notice that the code points above are using 4-digit hex numbers, equivalent to 2 bytes or 16 bits.  That’s 64k possible mappings, which is way below the 110k they have standardised.  Well, Unicode provisioned the use of 5 and 6-digit hex numbers (up to 0x10FFFF), meaning they can currently map over 1 million characters should they wish.

Onto the implementations.  You need to store and transmit text data, and you want to use Unicode.  Multiple different encodings exist for this.  UTF-8 has become a bit of a favourite, and UTF-16 (which extends the older UCS-2) is still popular.  UTF is the Unicode Transformation Format, with UCS being Universal Character Set.  Many different methods exist, but these are probably the most well known.

I’ll start with UCS-2, as it is the simplest to explain.  The system basically used a fixed length 2-byte encoding of the Unicode code point.  That’s it.  Up to 64k characters allowed, which at the time of its creation was sufficient.  Taking my code point example above, we would see the following storage in UCS-2:

Code Point

Character

UCS-2 (hex)

U+0053

S

0x0053

U+0051

Q

0x0051

U+004C

L

0x004C

U+0054

T

0x0054

U+0075

u

0x0075

U+006E

n

0x006E

U+0061

a

0x0061

UTF-16 superseded UCS-2 and extended the implementation to allow code points beyond 0xFFFF.  If we go back to where I mentioned the upper range of the current Unicode standard, we have 0x10FFFF.  The first byte (0x10) can be thought of as the “plane”, and the 2 bytes following it as the character.  Up to 0x10 is allowed, meaning 17 planes.  Each 2-byte plane has 0x0000 to 0xFFFF (64k) possible combinations.  Plane 0x00 is the first plane, which contains the most common characters and is often referred to as the Basic Multilingual Plane (BMP).  Stay with me.  UTF-16 uses this in pretty much the same way as UCS-2, and they are almost identical.  However UTF-16 makes use of a reserved range, 0xD800 to 0xDFFF, to allow mappings into the other planes.  When mapping to other planes, an additional 2 bytes are required for each character, meaning UTF-16 varies between 2-byte and 4-byte encodings depending on whether the character exists within the BMP or not.  I’m certainly not going to regurgitate the full implementation when you can read it for yourself online, such as here.  Partly because it’s not relevant to this post, but also because I’d have to click on that link to get those details myself…

UTF-8?  That’s variable width also.  For the ASCII range it uses only one byte.  In fact, the first 7 bits are identical to ASCII, as UTF-8 was designed for backwards compatibility, meaning the first 128 characters will appear the same.  The 8th bit is used as a flag to signify more bytes are required to store the character, and in that circumstance other flags exist in the encoding to allow 2, 3 or 4 bytes for a character.

UTF-8’s popularity comes from its flexibility.  When passing characters in the ASCII range, UTF-8 needs no conversion, it is identical to ASCII and uses a single byte per character.  It only requires extra bytes for characters outside this range such as for non-Latin characters and those outside the BMP.  The prevalence of English across the web for example, means UTF-8 is mostly 1 byte per character but provides the flexibility to use other character sets.  Compare this to UTF-16 which requires at least 2 bytes per character – typically double the size.  Across the internet this can make a big difference.

SQL Server

SQL Server has two types of text storage, [var]char and n[var]char.  It’s pretty simple really.  The first is a 1-byte ASCII encoding, and the second is the 2-byte UCS-2 encoding of Unicode.

The upper 128 characters in the [var]char data type are determined by the Collation of the database, so this becomes very important when moving data in and out.  Ultimately the data is stored as a 1-byte number between 0 and 255, but the Collation determines the character and therefore the glyph you see when extracting the data.  It can also have an effect on clients/software used to import and export the data.  Some software will attempt to convert the text encoding depending on the supplied settings, therefore knowing the encoding is vital to avoid unwanted results.

An Example

A great example in the current world, is migrating a database from a local instance to WASD using the SQL Azure Migration Wizard, available on Codeplex here: http://sqlazuremw.codeplex.com/

It’s a great tool that essentially provides a wrapper for the bcp utility, automatically generating bcp commands to extract data from the chosen databases and objects, then creates the objects in Azure and bcp’s the data in.  There are also various checks performed in between for compatibility.

Let’s say both our local database and our Windows Azure SQL Database are using the SQL collation SQL_Latin1_General_CP1_CI_AS.  Often the local database will use the Windows collation Latin1_General_CI_AS, which is almost identical, however for this example I wanted them the same.

I create a table in the local database for migration, and insert a value:

if object_id('dbo.migrationtest_local') is not null
   
drop table dbo.migrationtest_local;
go

create table dbo.migrationtest_local
(
  
value varchar(64) not null
);
go

create clustered index ix_migrationtest_value on dbo.migrationtest_local (value);
go

insert into dbo.migrationtest_local
values ('he cc’d his friend''s wife.');
go

It looks like a normal string being inserted, but look closer.  The first apostrophe doesn’t need escaping as it’s a curly apostrophe and actually sits outside the normal character range of 32-127 for this collation.  It is actually character 146, or in hex 0x92.  I run a quick query to look at the data:

select value, cast(value as varbinary(64)) value_BIN
from dbo.migrationtest_local;
 
image

I’ve highlighted the difference above.  The 6th character (in red) is our curly apostrophe with a value of 0x92 (146).  The 19th character (in blue) is a standard apostrophe with a value of 0x27 (39).

As a comparison I run the same SQL on my WASD, the only difference being the suffix on the table name to identify the platform:

if object_id('dbo.migrationtest_wasd') is not null
   
drop table dbo.migrationtest_wasd;
go

create table dbo.migrationtest_wasd
(
  
value varchar(64) not null
);
go

create clustered index ix_migrationtest_value on dbo.migrationtest_wasd (value);
go

insert into dbo.migrationtest_wasd
values ('he cc’d his friend''s wife.');
go

I run the SELECT to see how the results compare:

select value value_wasd, cast(value as varbinary(64)) value_BIN_wasd
from dbo.migrationtest_wasd;

image

Identical, we see the same results.

Now to run the Azure Migration Wizard.  Leaving all the defaults, I select the option to analyze/migrate a database, then connect to my local instance and select the relevant database, in my case the database was called DBA.

I select a specific object, namely my new table dbo.migrationtest_local, and generate the SQL.  It generates the CREATE TABLE statement, and also displays the bcp commands it will use.  Here is my export bcp command:

bcp.exe "[DBA].[dbo].[migrationtest_local]"
  out "c:\SQLAzureMW\BCPData\dbo.migrationtest_local.dat" -E -n -C RAW -S localhostT
 
The key flags here are –n and –C RAW.  The default settings for export are to use native format and the RAW code page option.  This essentially means the data is extracted “as is”.  So for text the bytes are extracted without any conversion.  I opened the export file using a hex editor to check the data:
 
image

The first 2 bytes in red are the data length, in reverse order, i.e. we have 0x001A, or 26, bytes.  Also highlighted are the two apostrophes, and you can see that they have the same binary value as we saw in the database.  All good so far.

Next we connect to WASD via the wizard, select our target database, and execute the SQL and bcp command to create the table and import the data.

I run a SELECT against the new table in my SQL Database (note the table name – it is the same as our local database as the SQL was generated by the wizard):

select value value_wasd, cast(value as varbinary(64)) value_BIN_wasd
from dbo.migrationtest_local;

image

Uh-oh!  That’s not right is it?  What has happened?  We know the collation is the same, and the data looked correct when we ran the same SQL statements above.

Going back to the SQL Azure Migration Wizard, I grab the bcp command for the import:

bcp.exe "MyDBAStuff.dbo.migrationtest_local"
 in "c:\SQLAzureMW\BCPData\dbo.migrationtest_local.dat" -E -n -b 10000 -a 16384
 -qS <removed>.database.windows.net -U "<removed>@<removed>" -P "<removed>"

The import statement adds a few parameters, such as the credentials for starters which I’ve removed, but also batch size and packet size.  However, notice the one key omission?  Yup, the –C RAW option is not there.  By default, the tool isn’t configured to use it for import.  When omitted, bcp defaults to the –C OEM option, or in other words Original Equipment ManufacturerThis means conversion.  Not only that, but conversion based on the client, which in our case involves the host nodes in Azure, NOT just our SQL Database.

The converted character value of 0xC6 (198) appears to come from code page 850.  Both the curly apostrophe in our database collation AND the diphthong of AE from code page 850, map to the same Unicode code point of U+0092.  It seems the host node is using code page 850, most likely in the physical host database, and has converted the binary data before insertion.

The fix?  In this case using the –C RAW option for both export and import would work, or even omitting it for both.  This can be done in the SQL Azure Migration Wizard config file, or of course you can perform the tasks manually.

Conclusion

Knowing how text is stored and encoded is vital when you work with data, as is recognising the signs that something is wrong, or hasn’t converted correctly.  Different software and different machines can interpret text in different ways, so always try to be aware of any non-standard characters in your data and account for them when moving data around.

I have only touched on the subject of text encoding and collations, and will probably write further posts going into more depth, but as a starting point I think every data professional should know the details covered in this post.  It may save a lot of head scratching in the future when your text doesn’t look quite how you expected.

Friday 6 December 2013

When is a compressed page full?

I saw a question on twitter not too long ago regarding page compression.  It was posted by Klaus Aschenbrenner (blog|twitter) and read as follows:

What is SQL Server doing with free space on a page (e. g. low Fill Factor) when applying Page Compression? #sqlhelp

Andreas Wolter (blog|twitter) provided an answer Klaus seemed happy with, relating to this page in BOL which states that page compression is only applied once the page is full.

For me, there are still too many unanswered questions here.  What constitutes “full”?  Does FILLFACTOR influence this?  If not, does that mean setting a FILLFACTOR on an index render page compression effectively pointless?

The only way to show what happens is to run through a compression scenario.  To start with we need a table, uncompressed of course.  On the basis that page level compression always applies row compression until it is “full” we want to make the rows fairly difficult to compress.  We also want only a few rows to fill a page to help test this.  So we use a 4-byte varbinary ID and a 790-byte varbinary filler.  Both columns carry a 2-byte overhead as they are variable length, as well as an extra 2 bytes holding the count of variable length columns.  We also have the various row overheads, in our case 7 bytes, so in total our record is 4+2+790+2+2+7 = 807 bytes (see a great description of the row structure here: http://www.practicalsqldba.com/2012/08/sql-serverunderstanding-data-record.html).  Varbinary is unaffected by row compression, so in theory this should allow us 10 records per page.

Tip: a data page is 8192 bytes.  96 bytes are used for the header, plus a 2-byte overhead for each row offset at the foot of the page.  Therefore for 10 rows we are left with 8192-96-(10*2) = 8076 bytes to fill.  That’s why I chose a record length of 807 bytes…

Let’s create the table and add 10 records:

-- drop if exists
if object_id('dbo.CompressionTest') is not null
   
drop table dbo.CompressionTest;
go

-- create table uncompressed
create table dbo.CompressionTest
(
    
ID varbinary(4) not null
   
,Filler varbinary(790) not null
   
,constraint pk_CompressionTest_ID primary key clustered (ID)
);

go

-- add 10 rows
insert into dbo.CompressionTest
(ID, Filler)
select 0x00000001, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000002, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000003, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000004, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000005, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000006, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000007, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000008, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000009, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x0000000A, cast(replicate(0x01,790) as varbinary(790));

go

Let’s make sure we only have 1 data page:

declare @db sysname;
set @db = db_name();

dbcc ind(@db,'dbo.CompressionTest',1);

go

Results:

image

You can see that the highlighted row has a PageType of 1, which is a data page.  We only have 1 of them, ID 1297.  Let’s have a look at the page header:

declare @db sysname;
set @db = db_name();

dbcc traceon(3604);

dbcc page (@db,1,1297,0);

go

PAGE: (1:1297)


BUFFER:


BUF @0x000000037C2A2100

bpage = 0x000000036F0D4000          bhash = 0x0000000000000000          bpageno = (1:1297)
bdbid = 5                           breferences = 0                     bcputicks = 0
bsampleCount = 0                    bUse1 = 15638                       bstat = 0x10b
blog = 0x215acccc                   bnext = 0x0000000000000000         

PAGE HEADER:


Page @0x000000036F0D4000

m_pageId = (1:1297)                 m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 276   m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594056015872                               
Metadata: PartitionId = 72057594048675840                                Metadata: IndexId = 1
Metadata: ObjectId = 622625261      m_prevPage = (0:0)                  m_nextPage = (0:0)
pminlen = 4                         m_slotCnt = 10                      m_freeCnt = 6
m_freeData = 8166                   m_reservedCnt = 0                   m_lsn = (544:212:31)
m_xactReserved = 0                  m_xdesId = (0:0)                    m_ghostRecCnt = 0
m_tornBits = 0                      DB Frag ID = 1                     

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED             
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED 

I have highlighted a couple of key header attributes above.  Firstly, m_slotCnt confirms we have 10 records on the page.  Secondly, m_freeCnt shows we have 6 bytes remaining – this fits perfectly with our calculation above.

So if we add one more record then check again, we should see another page added:

insert into dbo.CompressionTest
(ID, Filler)
select 0x0000000B, cast(replicate(0x01,790) as varbinary(790));

go

declare @db sysname;
set @db = db_name();

dbcc ind(@db,'dbo.CompressionTest',1);

go
image

And that’s it confirmed.  Page 1300 has been added, and also an index page (PageType 2) which is now needed due to there being more than one data page.

So, to the nitty gritty.  Let’s re-run the steps above but with page compression turned on:

-- drop if exists
if object_id('dbo.CompressionTest') is not null
   
drop table dbo.CompressionTest;
go

-- create table uncompressed
create table dbo.CompressionTest
(
    
ID varbinary(4) not null
   
,Filler varbinary(790) not null
   
,constraint pk_CompressionTest_ID primary key clustered (ID)
) with (data_compression = page);

go

-- add 10 rows
insert into dbo.CompressionTest
(ID, Filler)
select 0x00000001, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000002, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000003, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000004, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000005, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000006, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000007, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000008, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x00000009, cast(replicate(0x01,790) as varbinary(790)) union all
select 0x0000000A, cast(replicate(0x01,790) as varbinary(790));

go

declare @db sysname;
set @db = db_name();

dbcc ind(@db,'dbo.CompressionTest',1);

go

I’ve highlighted the only difference, and also included the DBCC IND step.  The output is as we expect with only one data page created:

image

So what about the page header?

PAGE: (1:1297)


BUFFER:


BUF @0x000000037C2A2100

bpage = 0x000000036F0D4000          bhash = 0x0000000000000000          bpageno = (1:1297)
bdbid = 5                           breferences = 0                     bcputicks = 0
bsampleCount = 0                    bUse1 = 17272                       bstat = 0xb
blog = 0xcccc7acc                   bnext = 0x0000000000000000         

PAGE HEADER:


Page @0x000000036F0D4000

m_pageId = (1:1297)                 m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 278   m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594056146944                               
Metadata: PartitionId = 72057594048806912                                Metadata: IndexId = 1
Metadata: ObjectId = 686625489      m_prevPage = (0:0)                  m_nextPage = (0:0)
pminlen = 3                        m_slotCnt = 10                      m_freeCnt = 56
m_freeData = 8116                   m_reservedCnt = 0                   m_lsn = (544:379:31)
m_xactReserved = 0                  m_xdesId = (0:0)                    m_ghostRecCnt = 0
m_tornBits = 0                      DB Frag ID = 1                     

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED             
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

I’ve highlighted the same attributes.  Notice that for the same number of records, we have 50 extra bytes due to row compression.

Now, as per BOL if we add one more record the page compression should kick in:

insert into dbo.CompressionTest
(ID, Filler)
select 0x0000000B, cast(replicate(0x01,790) as varbinary(790));

go

declare @db sysname;
set @db = db_name();

dbcc ind(@db,'dbo.CompressionTest',1);

go

image

Success, we still only have one data page!  It’s worth noting at this point why I used a consistent value in my Filler column – if the values were unique we wouldn’t get any page compression, but by repeating the value across rows we see the compression benefits.

Again, let’s have a look at the page header:

PAGE: (1:1297)


BUFFER:


BUF @0x000000037C2A2100

bpage = 0x000000036F0D4000          bhash = 0x0000000000000000          bpageno = (1:1297)
bdbid = 5                           breferences = 0                     bcputicks = 0
bsampleCount = 0                    bUse1 = 11930                       bstat = 0xb
blog = 0xcc7acccc                   bnext = 0x0000000000000000         

PAGE HEADER:


Page @0x000000036F0D4000

m_pageId = (1:1297)                 m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x80               m_level = 0                         m_flagBits = 0x8000
m_objId (AllocUnitId.idObj) = 279   m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594056212480                               
Metadata: PartitionId = 72057594048872448                                Metadata: IndexId = 1
Metadata: ObjectId = 718625603      m_prevPage = (0:0)                  m_nextPage = (0:0)
pminlen = 3                         m_slotCnt = 11                      m_freeCnt = 7168
m_freeData = 1002                   m_reservedCnt = 0                   m_lsn = (544:459:39)
m_xactReserved = 0                  m_xdesId = (0:0)                    m_ghostRecCnt = 0
m_tornBits = 0                      DB Frag ID = 1                     

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED             
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

So we have 11 records, as we know, and page compression has left us with 7168 free bytes on the page.  Pretty impressive, although it’s obviously data dependent, and remember that I purposely chose row values that would compress well.

How does FILLFACTOR affect this?  To test it out I ran the same steps as above but with the FILLFACTOR set to 50, that is, when inserting records through a DDL operation* a new page will be used once the previous page is 50% full.

* with DML operations SQL Server fills every page to 100% regardless of the FILLFACTOR setting.  Only on creating a new index from existing data, or REBUILDing an index, does the FILLFACTOR apply.  We therefore need to REBUILD an index after inserting data in order to see the effects.

In our case, we insert 6 records without page compression, then REBUILD our clustered index.  I’ll not re-post all of the SQL above, merely highlight the changes from here on:

create table dbo.CompressionTest
(
....
   
,constraint pk_CompressionTest_ID primary key clustered (ID)
       
with (fillfactor = 50)
) with (data_compression = none);
 
....
 
alter index pk_CompressionTest_ID on dbo.CompressionTest
rebuild

go

Looking at DBCC IND we still have only one data page.  If you add one more record, then REBUILD the index once again, you’ll see the FILLFACTOR come into play:

image

PAGE: (1:1297)


BUFFER:


BUF @0x000000037C2A2100

bpage = 0x000000036F0D4000          bhash = 0x0000000000000000          bpageno = (1:1297)
bdbid = 5                           breferences = 0                     bcputicks = 103
bsampleCount = 1                    bUse1 = 16349                       bstat = 0x109
blog = 0xccccdc9a                   bnext = 0x0000000000000000         

PAGE HEADER:


Page @0x000000036F0D4000

m_pageId = (1:1297)                 m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x220
m_objId (AllocUnitId.idObj) = 286   m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594056671232                               
Metadata: PartitionId = 72057594049331200                                Metadata: IndexId = 1
Metadata: ObjectId = 846626059      m_prevPage = (0:0)                  m_nextPage = (1:1299)
pminlen = 4                         m_slotCnt = 6                       m_freeCnt = 3242
m_freeData = 4938                   m_reservedCnt = 0                   m_lsn = (545:329:28)
m_xactReserved = 0                  m_xdesId = (0:5180845)              m_ghostRecCnt = 0
m_tornBits = 196210610              DB Frag ID = 1                     

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED             
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

The initial 6 records have remained on the first page after the REBUILD, and the 7th record has been moved to the next page.

Let’s try this with page compression turned on.  Repeating the steps, here is the final output of DBCC IND and PAGE:

image

PAGE: (1:1299)


BUFFER:


BUF @0x000000037C22D200

bpage = 0x000000036F822000          bhash = 0x0000000000000000          bpageno = (1:1299)
bdbid = 5                           breferences = 0                     bcputicks = 0
bsampleCount = 0                    bUse1 = 16634                       bstat = 0x109
blog = 0x5ab21c9a                   bnext = 0x0000000000000000         

PAGE HEADER:


Page @0x000000036F822000

m_pageId = (1:1299)                 m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x80               m_level = 0                         m_flagBits = 0x220
m_objId (AllocUnitId.idObj) = 289   m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594056867840                               
Metadata: PartitionId = 72057594049527808                                Metadata: IndexId = 1
Metadata: ObjectId = 910626287      m_prevPage = (0:0)                  m_nextPage = (0:0)
pminlen = 3                         m_slotCnt = 7                       m_freeCnt = 7212
m_freeData = 966                    m_reservedCnt = 0                   m_lsn = (545:422:25)
m_xactReserved = 0                  m_xdesId = (0:5180858)              m_ghostRecCnt = 0
m_tornBits = 312998929              DB Frag ID = 1                     

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED             
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

It seems page compression has come into play.  SQL Server has used the FILLFACTOR setting on *uncompressed* data to work out what constitutes a “full” page, at which point page compression has been applied and the extra row added to the same page.

But can we have a “double” FILLFACTOR?  That is, when the compressed page reaches 50% full will a new page be added?  If we insert 5 more records, but this time use *unique* values in the Filler column to stop them compressing:

insert into dbo.CompressionTest
(ID, Filler)
select 0x00000011, cast(replicate(0x11,790) as varbinary(790)) union all
select 0x00000012, cast(replicate(0x12,790) as varbinary(790)) union all
select 0x00000013, cast(replicate(0x13,790) as varbinary(790)) union all
select 0x00000014, cast(replicate(0x14,790) as varbinary(790)) union all
select 0x00000015, cast(replicate(0x15,790) as varbinary(790));

go

Looking at the page header output, we see 12 records inserted and more than half the page is used.  Now running a REBUILD of the index to force the FILLFACTOR setting gives us the following:

image

PAGE: (1:1299)


BUFFER:


BUF @0x000000037C22D200

bpage = 0x000000036F822000          bhash = 0x0000000000000000          bpageno = (1:1299)
bdbid = 5                           breferences = 0                     bcputicks = 0
bsampleCount = 0                    bUse1 = 18109                       bstat = 0x109
blog = 0x1c9adc9a                   bnext = 0x0000000000000000         

PAGE HEADER:


Page @0x000000036F822000

m_pageId = (1:1299)                 m_headerVersion = 1                 m_type = 1
m_typeFlagBits = 0x80               m_level = 0                         m_flagBits = 0x220
m_objId (AllocUnitId.idObj) = 301   m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594057654272                               
Metadata: PartitionId = 72057594050314240                                Metadata: IndexId = 1
Metadata: ObjectId = 1006626629     m_prevPage = (0:0)                  m_nextPage = (1:1325)
pminlen = 3                         m_slotCnt = 11                      m_freeCnt = 4000
m_freeData = 4170                   m_reservedCnt = 0                   m_lsn = (546:350:28)
m_xactReserved = 0                  m_xdesId = (0:5180912)              m_ghostRecCnt = 0
m_tornBits = 1076126332             DB Frag ID = 1                     

Allocation Status

GAM (1:2) = ALLOCATED               SGAM (1:3) = ALLOCATED             
PFS (1:1) = 0x60 MIXED_EXT ALLOCATED   0_PCT_FULL                        DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED

The twelfth record has been moved to a new page, so there it is – proof of what SQL Server considers a “full” page.  It uses the FILLFACTOR based on uncompressed data as a trigger for page compression, then applies a FILLFACTOR once page compression is active.

Predicting storage overhead when compression is involved is always tricky, as it varies so much depending on the data types used and the data stored.  Using sp_estimate_data_compression_savings is still the best option – from experimentation it seems to account for the FILLFACTOR setting.

As a final note – be aware of your FILLFACTOR setting before REBUILDing an index, compressed or not.  You could end up with a much larger data footprint than you expected.

Oh yeah, I ran all of this on SQL 2012 EE SP1.  I’d love to know if anyone sees different results on a different version.