In one of my current projects, Tempng, I needed to be able to find the closest users from the location of a customer on a per request basis. A customer will make a request, which will kick off an Azure WebJob that does the actual work of finding the closest users and processing the results. The best way I've found to do this is to use the SQL Server geography data type.
The Background
Both the customers and users have a set location (their address) that is stored in our database, so that makes it a little easier. Since all of the locations in my project are static, I considered calculating the distance between every customer and every user and storing that in a table, but that would get out of hand very quickly. So I went with the geography
data type. The main reason I chose it was the geospatial proximity calculations it provides. Using LINQ, I can easily query based on the distances between locations.
For instance, say I wanted to find the users less than 10 miles from my customer Bob, the query is as simple as:
var users = from g in context.GeoTests
let distance = g.Location.Distance(bobsLocation)
where distance < 10 * 1609.344 //Makes it miles. 1609.344 is meters in a mile
orderby distance
select new { UserId = g.UserId, DistanceInMiles = (distance / 1609.344) };
Note that the distance
value is calculated in meters, so we divide by the number of meters in a mile (1609.344
) to calculate miles.
But Does it Scale?
Testing at a small scale worked very well, so I was curious about the scaling performance of such queries. I decided to do a few tests. Here's the table structure I used:
Creation Performance
First I wanted to test the performance of creating and inserting the geography
values into my database. I tested this by populating the tables with random US locations. I found the longitude and latitude bounds by very scientifically clicking what I thought looked the northern, southern, western, and eastern most points in Google maps.
- Northern most
- 49.259766, -95.616505
- Southern most
- 26.280313, -97.530646
- Western most
- 40.497140, -124.307168
- Eastern most
- 44.774586, -67.034416
I used the following code to populate the table from my laptop into an Azure SQL Database.
using (var context = new GeoTestEntities())
{
for (var i = 0; i < RECORDS_TO_CREATE; i++)
{
var lon = RandomNumberBetween(-124, -67);
var lat = RandomNumberBetween(26, 50);
var geo = new DAL.GeoTest()
{
Lon = lon,
Lat = lat,
Location = DbGeography.FromText("POINT(" + lon + " " + lat + ")", 4326)
};
context.GeoTests.Add(geo);
context.SaveChanges();
}
}
You'll need to add the System.Data.Entity.Spatial
namespace to get the DbGeography
class. The RandomNumberBetween()
method returns a decimal. The value 4326
is the coordinate system Id, and is the most common system used in mapping and geolocation.
I ran each test 3 times. Here are the averages and a graph showing their relationship:
Number of Records | Time in Seconds |
---|---|
1 | 1.57 |
10 | 2.41 |
100 | 10.38 |
500 | 46.21 |
1,000 | 93.31 |
10,000 | 1214 |
Proximity Query Performance
The real test is querying all this data. I wanted to see how long it would take to perform a proximity query when dealing with more than the 10 locations I originally tested with. If the query doesn't scale well, then we'll run into a huge problem later on down the road.
I created 5 different tables and populated them with the same script from above. Each one has a different number of records. Then I ran a query to find all locations (our users) within 100 miles of Dallas (our customer). I ran it 3 different times over each of the 5 tables and then took the average:
var dallas = DbGeography.FromText("POINT(-96.796988 32.776664)", 4326);
var locations = from g in context.GeoTests
let distance = g.Location.Distance(dallas)
where distance < 100 * 1609.344 //Within 100 miles of Dallas
orderby distance
select new { Id = g.Id, DistanceInMiles = (distance / 1609.344)
Number of Records | Time in Seconds | # Results |
---|---|---|
10,000 | 5.13 | 68 |
20,000 | 8.30 | 129 |
30,000 | 11.78 | 179 |
40,000 | 16.02 | 226 |
100,000 | 35.37 | 588 |
I also wanted to see if the distance parameter I was using made a difference, so I ran the tests again checking for locations within 15 miles instead of 100:
Number of Records | Time in Seconds | # Results |
---|---|---|
10,000 | 4.32 | 0 |
20,000 | 7.50 | 1 |
30,000 | 10.69 | 1 |
40,000 | 14.27 | 7 |
100,000 | 34.03 | 10 |
The Takeaway
Searching through 20,000 user's locations takes less than 10 seconds, and considering that these queries will all be run out of band in WebJobs, I am perfectly happy with the scaling abilities of querying the geography
data.
I could make the query even quicker by restricting the query further by only looking at users in a specific zip code or city or even state, but with the performance I'm seeing that won't be necessary until we get to many tens of thousands of users.