Using FOR XML PATH to Combine Multiple Record Values into One Column

Scenario

I have been asked on numerous occasions to provide a data feed that provides data for a customer, or plant or some other entity with a catch, can you take a one to many relationship and make one column of data out of it. That answer, of course, is “Sure, I can do anything!”  but then I get into the how and I start to build a stored procedure that uses temp tables, involves a WHILE loop of some kind and I have to manually code the correct answer. The process is nothing complicated but it definitely involved some extra steps before providing back the results.

My thinking though is that it would be great if I could do that in a SQL statement and let SQL Server do all of the work. Low and behold, I ran across a method that turned my original process into a single query statement, which obviously is much simpler!

Example Data Set

To illustrate, I have the following scenario:

  • Please provide a Customer list by State and include all of the Customer Regions in one column on each state’s row.

Simple request but the issue is that within a state, I have multiple regions depending on a postal code of the address of the customer office.

In my sample database, my data is broken out into three tables: (for the sake of being able to create this locally as an example, I have done everything in tempdb)

create table #tmpCustomer (CustomerId int identity, CustomerName varchar(max))

create table #tmpCustomerAddress (AddressId int identity, CustomerId int, StateAbbreviation varchar(5), PostalCode varchar(10))

create table #tmpPostalCode (PostalCodeId int identity, PostalCode varchar(20), PostalRegion varchar(30))

The relationships are:

  • A customer can have multiple addresses
  • For a state, you can have multiple regions based upon the postal code
  • The postal code belongs to a postal region

These relationships imply that a customer can have multiple regions per state as well, where the postal code of the customer address will map to the region.

For sample data, the following inserts are executed:

insert into #tmpCustomer (CustomerName)

select 'Customer A'

 

 

insert into #tmpCustomerAddress(CustomerId, StateAbbreviation, PostalCode)

select 1, 'PA', '19111'

union

select 1, 'PA', '19103'

union

select 1, 'NJ', '08034'

 

insert into #tmpPostalCode (PostalCode, PostalRegion)

select '19111', 'NE Philly'

union

select '19103', 'Center City'

union

select '08034', 'South Jersey'

Query Solution

Now, if we queried the data using straight joins we would get the following:

select

            c.CustomerName,

            ca.StateAbbreviation,

            pc.PostalRegion

from

            #tmpCustomer c

            inner join #tmpCustomerAddress ca

            on c.CustomerId = ca.CustomerId

           

            inner join #tmpPostalCode pc

            on ca.PostalCode = pc.PostalCode

CustomerName StateAbbreviation PostalRegion
Customer A NJ South Jersey
Customer A PA Center City
Customer A PA NE Philly

However, the above dataset does not provide the solution, the request is to roll up the State Regions for the customer into one row per state.

The following query gives us the answer:

select distinct

            c.CustomerName,

            ca.StateAbbreviation,

            substring(

            (select '; ' + ipc.PostalRegion

            from #tmpPostalCode ipc

                        inner join #tmpCustomerAddress ica

                        on ipc.PostalCode = ica.PostalCode

            where

                        ica.CustomerId = ca.CustomerId

                        and ica.StateAbbreviation = ca.StateAbbreviation

            for xml path('')), 3, 1000) as Region

from

            #tmpCustomer c

            inner join #tmpCustomerAddress ca

            on c.CustomerId = ca.CustomerId

           

            inner join #tmpPostalCode pc

            on ca.PostalCode = pc.PostalCode

Here are the results:

CustomerName StateAbbreviation Region
Customer A NJ South Jersey
Customer A PA Center City; NE Philly


The key of the query is using the for xml path('') syntax as a subquery in the select portion of the field list, as well as using the distinct keyword to remove duplicate entries.

To note, the for xml path('') statement would put in xml tags if the column was named in the subquery but since we left the column name empty when we concatenated the semicolon and the postal region, the xml nodes are empty. The extra substring is just to parse off the preceding delimiter that we start all records with.

The above example could then be used as an inner query tied to other tables within the database for more fields that are also at a single Customer State level.

Conclusion

As the above example showed, the ability to roll up a one to many relationship into a single row set can be be very useful and it would have saved me time in the past if I had come across this method earlier. In addition, having the ability to do this using one T-SQL statement as opposed to having to generate a loop and a procedure to build your dataset is definitely a simpler and in my opinion a better answer.

Database Partitioning with the Entity Framework

 

Problem

Requirements on a recent project called for a cloud-based solution and support for a lot of data. Microsoft Windows Azure was selected as the platform and Microsoft SQL Azure was selected for the database. How much is a lot of data? It was expected to be much greater than 50GB. Now we had a problem, since the maximum size of a SQL Azure database was 50GB and SQL Azure Federation has not yet been released.

 

Solution

An additional technical requirement for the solution was to use the latest Microsoft Entity Framework. A ready-made solution for database partitioning on SQL Azure with the Entity Framework was not to be found, so we developed our own implementation. The key features of the solution included:

1. Minimal impact to the development team, who were already familiar with EF

2. Ability to add additional databases on-the-fly as one or more database approached capacity

3. Self-leveling of the databases: they should effectively grow at the same rate.

 

Partitioning Strategy

Regardless of the implementation, selecting the appropriate partitioning strategy is critical. We chose a horizontal partition. The Atomic Unit for the partition (or shard) was a Customer. All transactions (Orders) and other data related to a Customer are in the same shard. The benefits of horizontal partitioning in this manner include:

1. Performance: Once a customer was selected, all database activity for that customer was in the same database. There was no spanning of databases to store or retrieve data related to the current customer. Only Customer searches needed to span databases.

2. Self-leveling: The logic for which database to insert the next new customer was simple – always add to the smallest database.

3. Positioned for Growth: Additional databases could be added without having to re-partition or divide existing data.

 

Implementation Details

The two main areas of the solution that required enhancements to an otherwise ‘standard’ EF implementation were searches and management of connection strings. Searches had to span databases and return which database a customer was found in as part of the result. Likewise, knowing which database a specific customer belonged to had to be easily passed into EF methods and translated to full connection strings when necessary.

Searches

We chose parallel LINQ queries for searches. See below for an example. An array of connections (how many shard databases we currently have) is maintained in a static ConnectionFactory. So, c.Item1 in the final select below contains the shard where the result was found. c.Item2 contains the full connection string passed into the query.Invoke().

public static IEnumerable<TResult> ExecuteAgainstShards<TArg0, TArg1, TResult, TPreShape>(

this Func<TArg0, TArg1, IEnumerable<TPreShape>> query, TArg1 arg1, Func<string, TPreShape, TResult> shapeResult) where TArg0 : ObjectContext, new()

{

var result = from c in ConnectionFactory.Instance.CreateConnections<TArg0>().AsParallel()

from x in ExecuteWithRetry(() => query.Invoke(c.Item2, arg1))

select shapeResult(c.Item1, x);

return result.ToArray();

}

The web.config contains only the connection string for the first shard and there are some helper methods to parse and create connection strings as needed to access the remaining shards. The web.config (or ServiceConfiguration.cscfg in Windows Azure) also contains the number of active shards used by the ConnectionFactory. This provides the mechanism for adding databases without changing code or adding additional connection strings.

Below is an example compiled query and its calling method that leverage the ExecuteAgainstShards parallel execution.

private readonly Func<MyEntities, CustomerSearchTerms, IQueryable<Data.Person>> customerSearch = CompiledQuery.Compile<MyEntities, CustomerSearchTerms, IQueryable<Data.Person>>

(

(es, search) =>

from p in es.People

where

p.Customers.Count() >= 1 && (search.CustomersId == null ||

p.Customers.Count(e => e. CustomerID == search. CustomerId) > 0) &&

(string.IsNullOrEmpty(search.FirstName) || p.FirstName.StartsWith(search.FirstName)) &&

(string.IsNullOrEmpty(search.LastName) || p.LastName.StartsWith(search.LastName)) &&

select p

);

public IEnumerable<CustomerPersonReference> FindCustomer(CustomerSearchTerms terms)

{

var items = customerSearch.ExecuteAgainstShards

(

terms,

(shardName, person) =>

{

var patient = person.Customers.First();

return new CustomerPersonReference

( CommonRoutines.MapHeader(person, shardName),

});

return items.OrderBy(p => p.Item1.LastName).ThenBy(p => p.Item1.FirstName). ToArray();

}

Connection Strings

As mentioned above, once a Customer has been found in a specific shard, all data related to that Customer will be and must be in that shard. In addition to the shard helper methods for generating and parsing connection string, the development team needed to be disciplined to always use the EF constructor that included a connection string. For example:

using (var entities = new MyEntities(ShardUtilities.GetConnectionString(customerShard))

{

// real work goes here

}

where customerShard is likely the shard returned from the search query above.

 

Summary

Implementation of a database partitioning solution with the Microsoft Entity Framework was necessary to overcome the current SQL Azure database size limitation. But, the solution is certainly not limited to SQL Azure. If you are using a version of Microsoft SQL Server that does not support partitioning or if you want to maintain a database solution that will work with either SQL Azure or on-premise SQL Server, this fairly straightforward implementation might come in handy.