Selecting Random Samples From Large SQL Server Tables

sql-server
Author

Martyn Jones

Published

November 6, 2022

Recently, we needed to select 10,000 random rows from a table containing a billion rows.

I couldn’t think of a better solution than ordering by NEWID() and selecting the TOP (10000) rows, but given the size of the table we’re working with, it was very slow and inefficient, so there had to be a better solution.

Thankfully, Brent Ozar had a blog post to get us moving in the right direction: How to Get a Random Row from a Large Table.

As Brent’s post explains, the ordering by NEWID() is so slow because it calculates the NEWID value for every row in the table and then filters the return to the first 10,000 rows.

Ultimately, the best solution from Brent’s post involves:

This does require a table with an integer ID column and also, is not guaranteed to return a record, if the randomly selected ID value doesn’t exist in the table.

For our purposes though, we had a BIGINT identity column and the very vast majority of ID values between 1 and the max value did exist, so this was perfect.

There is also a nod to our required use in Brent’s post:

If you wanted 10 rows, you’d have to call code like this 10 times (or generate 10 random numbers and use an IN clause.)

As we wanted 10,000 rows, we decided to: - Use a Tally table, (called dbo.Numbers in this example) to get the 10,000 rows. - Generate a NEWID() for each of those rows. - Store those 10,000 random numbers in a temporary table.

/* Get the highest ID in the table to use in the modulo calculation. */
DECLARE @MaxID BIGINT;
SELECT @MaxID = MAX(ID)
FROM dbo.TableToSample;

/* Create a table variable to store the randomly selected IDs. */
DECLARE @RandomIDs TABLE (
    SampleID INT NOT NULL,
    RandomID BIGINT NOT NULL,
    PRIMARY KEY CLUSTERED (RandomID, SampleID)
);

/* To allow for any IDs no longer present in the table, select 10% more than needed at this point. */
INSERT INTO @RandomIDs (RandomID)
SELECT  n.N AS SampleID,
        ABS(CHECKSUM(NEWID())) % MaxID AS RandomID
FROM dbo.Numbers AS n
WHERE n.N BETWEEN 1 AND 11000;

/* Select from the table to be sampled inner joined with the temporary table to only return rows 
 * with a randomly selected ID value, and only keep the first 10,000 matches. */
SELECT TOP (10000)
    tts.ID, 
    tts.Col1, 
    tts.Col2
FROM dbo.TableToSample AS tts 
INNER JOIN @RandomIDs AS rnd
    ON tts.ID = rnd.RandomID
ORDER BY rnd.SampleID;

Ultimately, the performance of this approach is significantly faster than the alternative and requires far less IO and CPU time. It solved the problem and is something that has proved useful on a few occasions since.