BLOB SQL Server: Integrating document storage
BLOB (Binary Large Object) storage can hold up to 4 gigabytes of data. Learn how to integrate document BLOB storage with SQL Server from Pluralsight today!
Mar 3, 2015 • 3 Minute Read
NoSQL platforms can support highly scalable databases with BLOB attachments (images, documents, and other files), but if you think you need to embrace a NoSQL solution in lieu of SQL Server just because you have a high volume of BLOBs in your database, think again.
If you have good reasons to go with NoSQL, and you can accept the compromises of eventual transactional consistency, it can fit the bill nicely. But critical line-of-business applications can't always afford the relaxed constraints of NoSQL databases, and usually require schemas that are strongly typed, with full transactional integrity; that is, a full-fledged relational database system (RDBMS). However, relational database platforms like SQL Server were originally designed and optimized to work primarily with structured data, not BLOBs. And so historically, it's never been feasible to store large amounts of BLOB data directly in the database. That is, until FILESTREAM.
With FILESTREAM, Microsoft addresses the dilemma of storing BLOBs within the relational database. My new Pluralsight course, SQL Server 2012-2014 Native File Streaming, explains this innovative feature in detail. In this post, we'll cover how FILESTREAM (and its related technologies) can be used to implement a highly-scalable BLOB storage solution that's fully integrated with a relational SQL Server database. You'll also find live demos on everything covered by this post in the course.
Introducing FILESTREAM
Although SQL Server was never originally intended to handle BLOBs in large scale, this is no longer the case with FILESTREAM (introduced in SQL Server 2008). Before FILESTREAM, SQL Server was forced to shove BLOBs into the standard database filegroups, which are really optimized for storing structured row data in 8k pages. Because BLOBs don't fit naturally within this structure, they must be pushed into off-row storage, which bloats the structured filegroups, and ultimately kills performance.
FILESTREAM changes all that. First, to be clear, FILESTREAM is not actually a data type. Rather, it's an attribute that you apply to the varbinary(max) data type, the same data type you would use to store BLOBs directly inside the row. But by merely appending the FILESTREAM attribute to the varbinary(max) data type, SQL Server takes a radically different approach to physical BLOB storage.
Rather than inundating the standard database filegroups with BLOBs, SQL Server stores BLOB content as files in the file system where they belong. At the same time, it establishes and maintains reference points between the rows in the standard filegroups and the files in the file system that are tied to varbinary(max) columns in those rows. All this magic occurs behind the scenes, and is totally transparent to any existing code that works with ordinary varbinary(max) columns.
In this manner, the BLOB data is physically stored separately from structured row data, but it's an integral part of the database. So for example, backing up the database includes the BLOB data, with the option of performing a partial backup that excludes the FILESTREAM filegroup when you want to create smaller backups that don't include BLOB data.
Furthermore, this solution provides full transactional consistency, because FILESTREAM integrates with the NTFS file system, and NTFS is a transactional file system. So, when you start a database transaction and insert a row, and that row includes BLOB data stored in a varbinary(max) FILESTREAM column, then SQL Server automatically initiates an NTFS file system transaction over that data. Then, the fate of the file system transaction hinges on that of the database transaction. If and when the database transaction commits, SQL Server will also commit the NTFS file system transaction; similarly, rolling back the database transaction automatically rolls back the NTFS transaction.
Accessing BLOBs with T-SQL
With FILESTREAM, you can treat BLOBs as ordinary varbinary(max) columns in T-SQL. For example, you can use the OPENROWSET function with the BULK provider to import an external file into a varbinary(max) column. And if that column is decorated with the FILESTREAM attribute, SQL Server will automatically store a copy of it as a BLOB in the NTFS file system behind the scenes, rather than force-fitting it into the standard database filegroups.
For example:
INSERT INTO PhotoAlbum(PhotoId, PhotoDescription, Photo)
VALUES(3, 'Mountains',
(SELECT BulkColumn FROM OPENROWSET(BULK 'C:\Demo\Ascent.jpg', SINGLE_BLOB) AS x))
To retrieve BLOBs, it's a simple SELECT:
SELECT * FROM PhotoAlbum
Using SqlFileStream and the Streaming API
Although FILESTREAM delivers scalable storage by leveraging the NTFS file system behind the scenes, BLOB access needs to scale as well. It's great that you can enjoy total transparency by just using T-SQL access, but stop for a moment and consider what SQL Server needs to do when retrieving BLOBs with T-SQL. In order to serve up the Photo column in the resultset shown above for the SELECT statement, for example, SQL Server needed to read the entire contents of each BLOB from the NTFS file system it's managing internally, and this can easily and suddenly place a great deal of memory pressure on the server.
To address this concern, FILESTREAM exposes the streaming API. When you use this API, SQL Server still manages the file system behind the scenes, only it shifts the burden and memory requirements of actually reading and writing BLOBs in the file system off itself and onto the client application. This keeps the memory requirements on SQL Server very lean, regardless of how large your BLOBs may be.
The SqlFileStream class is a managed code wrapper around the streaming API, which makes it extremely easy to use from .NET. In C# or VB .NET, you start a database transaction and issue an INSERT statement, but you don't actually include the BLOB content with the INSERT statement. Instead, SQL Server passes you back the information you need to create a SqlFileStream object.
This object inherits from the base System.IO.Stream class, meaning that it supports all the standard read/write methods of standard .NET stream classes, including memory streams, HTTP request/response streams and local file streams. So it's easy to then stream your BLOBs in and out, using buffers in memory allocated to your application, not SQL Server. Then, you just commit the database transaction, and SQL Server automatically commits the NTFS file system transaction at the same time.
In my course, I show you SqlFileStream up close, and demonstrate how to program against the streaming API from a thick client application, a client/server (Web) application, and in an n-tier (WCF) scenario.
Introducing FileTable
The FILESTREAM story only gets better with FileTable, added in SQL Server 2012. While FILESTREAM revolutionizes BLOB storage in SQL Server, it's only accessible to developers and administrators. What about ordinary users? They're certainly not going to write T-SQL or streaming API code to access BLOBs. And there's also no way for ordinary client applications to access FILESTREAM data.
The solution is FileTable, which combines FILESTREAM with the hierarchyid data type to furnish an “emulated” file system (a file system that users and applications can work with, but which is really a FileTable in a SQL Server database). A FileTable is just a regular table, except that it has a fixed schema; specifically, it has pre-determined columns for the metadata of the emulated file system:
Every row in a FileTable represents either a file or a folder (depending on the is_directory column), and the hierarchyid value in the path_locator column is what implements the folder structure of the emulated file system. The hierarchyid data type has methods that you can use to query and manipulate the structure; for example, you can programmatically move entire subtrees from one parent to another.
For rows that represent files, the file_stream column holds the actual BLOB, and this is a varbinary(max) FILESTREAM column. So behind the scenes, it is stored in the NTFS file system just like a varbinary(max) FILESTREAM column of an ordinary table (a non-FileTable) would be.
And so, in addition to being able to use T-SQL or the streaming API with a FileTable, the emulated file system that a FileTable represents also gets exposed to users and client applications via a Windows file share. As a result, changes made to the table in the database are reflected in the emulated file system and, conversely, changes made to the emulated file system by users or client applications are reflected automatically in the database, which ultimately pushes down into the physical NTFS file system being used for BLOB storage behind the scenes.
Summary
With FILESTREAM, line-of-business applications can embrace scalable BLOB integration without being forced to consider a NoSQL alternative to SQL Server. If you're interested in taking a more in-depth look at how it all works, be sure to check out my SQL Server 2012-2014 Native File Streaming course.