Discrepancies in large datasets

Why totals in excel downloads for Turtl Docs with large numbers of reads can sometimes vary slightly

Dominic Adams avatar
Written by Dominic Adams
Updated over a week ago

A word about large datasets

When a dataset is very large it can be impractical to load it all into memory at once due to the sheer volume of information. This presents a challenge, particularly where we need to count the number of items matching certain criteria e.g. the number of times a Turtl Doc was accessed from a particular country.

In such cases, it is common practice to use sophisticated estimation techniques to trade a degree of accuracy for faster results. For example, with a very large dataset, it might take 20 seconds to determine the number of times a Turtl Doc was accessed from Ireland to 100% accuracy, but only 2 seconds to do this to an accuracy of 99%. Because Turtl needs to run many queries to arrive at the results we provide, using these techniques becomes necessary to avoid making users wait a long time for their data.

As a result, you may find some small discrepancies between the totals we report when your Turtl Doc has received a lot of traffic. For example, you may find that the total number of reads by location adds up to 26,998 whereas the total number of reads as reported on the dashboard is 27,032. This discrepancy represents an inaccuracy of 0.1% which we consider to be preferable to significantly longer wait times for our users.

These techniques are standard and are used frequently by major analytics services such as Google Analytics.

If you have any queries on this subject, please contact your Customer Success Manager.

Did this answer your question?