C#: LINQ Proximity Query Performance with the SQL Geography Data Type

In one of my current projects, Tempng, I needed to be able to find the closest users from the location of a customer on a per request basis. A customer will make a request, which will kick off an Azure WebJob that does the actual work of finding the closest users and processing the results. The best way I've found to do this is to use the SQL Server geography data type.

The Background

Both the customers and users have a set location (their address) that is stored in our database, so that makes it a little easier. Since all of the locations in my project are static, I considered calculating the distance between every customer and every user and storing that in a table, but that would get out of hand very quickly. So I went with the geography data type. The main reason I chose it was the geospatial proximity calculations it provides. Using LINQ, I can easily query based on the distances between locations.
For instance, say I wanted to find the users less than 10 miles from my customer Bob, the query is as simple as:

var users = from g in context.GeoTests
    let distance = g.Location.Distance(bobsLocation)
    where distance < 10 * 1609.344    //Makes it miles. 1609.344 is meters in a mile
    orderby distance
    select new { UserId = g.UserId, DistanceInMiles = (distance / 1609.344) };

Note that the distance value is calculated in meters, so we divide by the number of meters in a mile (1609.344) to calculate miles.

But Does it Scale?

Testing at a small scale worked very well, so I was curious about the scaling performance of such queries. I decided to do a few tests. Here's the table structure I used:
GeoTest Table

Creation Performance

First I wanted to test the performance of creating and inserting the geography values into my database. I tested this by populating the tables with random US locations. I found the longitude and latitude bounds by very scientifically clicking what I thought looked the northern, southern, western, and eastern most points in Google maps.

  • Northern most
  • 49.259766, -95.616505
  • Southern most
  • 26.280313, -97.530646
  • Western most
  • 40.497140, -124.307168
  • Eastern most
  • 44.774586, -67.034416

I used the following code to populate the table from my laptop into an Azure SQL Database.

using (var context = new GeoTestEntities())
{
    for (var i = 0; i < RECORDS_TO_CREATE; i++)
    {
        var lon = RandomNumberBetween(-124, -67);
        var lat = RandomNumberBetween(26, 50);
        var geo = new DAL.GeoTest()
        {
            Lon = lon,
            Lat = lat,
            Location = DbGeography.FromText("POINT(" + lon + " " + lat + ")", 4326)
        };
        context.GeoTests.Add(geo);
        context.SaveChanges();
    }
}

You'll need to add the System.Data.Entity.Spatial namespace to get the DbGeography class. The RandomNumberBetween() method returns a decimal. The value 4326 is the coordinate system Id, and is the most common system used in mapping and geolocation.

I ran each test 3 times. Here are the averages and a graph showing their relationship:

Number of RecordsTime in Seconds
11.57
102.41
10010.38
50046.21
1,00093.31
10,0001214
![Populate Table Results](/content/images/2016/06/create-results-5.PNG) So creating the `geography` records is more or less linear. It's pretty slow as far as computing goes, but we'll only be creating 1 of these at a time when the user signs up. On top of that, it runs through a WebJob, so the user won't notice any slowdown at all when entering in their address.
Proximity Query Performance

The real test is querying all this data. I wanted to see how long it would take to perform a proximity query when dealing with more than the 10 locations I originally tested with. If the query doesn't scale well, then we'll run into a huge problem later on down the road.

I created 5 different tables and populated them with the same script from above. Each one has a different number of records. Then I ran a query to find all locations (our users) within 100 miles of Dallas (our customer). I ran it 3 different times over each of the 5 tables and then took the average:

var dallas = DbGeography.FromText("POINT(-96.796988 32.776664)", 4326);
var locations = from g in context.GeoTests
    let distance = g.Location.Distance(dallas)
    where distance < 100 * 1609.344 //Within 100 miles of Dallas
    orderby distance
    select new { Id = g.Id, DistanceInMiles = (distance / 1609.344)
Number of RecordsTime in Seconds# Results
10,0005.1368
20,0008.30129
30,00011.78179
40,00016.02226
100,00035.37588
![100 Mile Query Results](/content/images/2016/06/query-results1-1.PNG) You can see that querying the `geography` type is also linear. In Big O notation it's *O(n)*. Nice! It averages to about 2500 records per second.

I also wanted to see if the distance parameter I was using made a difference, so I ran the tests again checking for locations within 15 miles instead of 100:

Number of RecordsTime in Seconds# Results
10,0004.320
20,0007.501
30,00010.691
40,00014.277
100,00034.0310
![15 Mile Query Results](/content/images/2016/06/query-results2-2.PNG) This took a little less time, averaging 2700 records per second. So while the distance parameter does have an affect, it's not really that noticeable for what I'm doing in my project. Most if not all of my proximity searches will be less than 100 miles.

The Takeaway

Searching through 20,000 user's locations takes less than 10 seconds, and considering that these queries will all be run out of band in WebJobs, I am perfectly happy with the scaling abilities of querying the geography data.
I could make the query even quicker by restricting the query further by only looking at users in a specific zip code or city or even state, but with the performance I'm seeing that won't be necessary until we get to many tens of thousands of users.