Skip to content

Randomize tracking#1253

Closed
LVerneyEC wants to merge 2 commits into
OpenTermsArchive:mainfrom
LVerneyEC:randomized-tracking
Closed

Randomize tracking#1253
LVerneyEC wants to merge 2 commits into
OpenTermsArchive:mainfrom
LVerneyEC:randomized-tracking

Conversation

@LVerneyEC

Copy link
Copy Markdown
Contributor

Hey,

Here is a small proposal for an addition to shuffle the tasks to be processed for scraping/tracking terms in order to increase stealth and not always hit the same service's pages in the exact same order.

Otherwise, the exact same pattern will happen again and again when scraping a service's terms. This means:
1/ Hitting rate limiting because we are sending a batch (up to MAX_CONCURRENCY) calls more or less the same time to the service.
2/ Easier detection of automated scraping, even more when coupled with a cron-based scheduler, as order of scraping is almost entirely predictible.

Randomizing the order of tracking of the terms is a cheap solution for this. The only downside might be a slight decrease in readability when ran interactively, but with concurrency enabled this was already the case. If the feature is worth merging, happy to have a better integration with config option or so.

Best,

Shuffle the tasks to process for scraping/tracking terms to increase
stealth and not always hit the same service's pages in the exact same order.
@MattiSG

MattiSG commented Jun 23, 2026

Copy link
Copy Markdown
Member

Thank you @LVerneyEC for your ongoing work on #166!

We have considered randomising the tracking order, however this raises very strong concerns with regards to debugging. Indeed, true randomisation means that there is no way to reproduce a failed tracking run. As such, we will not merge such a proposal without a reproducible benchmark demonstrating tangible benefits.

We would instead prefer to introduce an algorithmic, deterministic sorting system that manipulates the queue to maximise the space between fetches to the same server. In this way, the same set of declarations would always yield the same fetching order. It would also guarantee that the space is maximised, while randomisation will also yield groupings with high rates of requests to the same server.
In the same vein, you might also want to look into implementing #1105 as a way to decrease bot detection.

Admittedly, this would only solve your point 1 (rate limit). While point 2 (deterministic order) is theoretically possible, we are skeptical of its actual role in triggering bot detection.

One way for us to reconsider the cost-to-benefit ratio (as in, “cost to reproducibility and maintainability to benefit to tracking”) would be to provide reproducible data on tracking success comparing the current naive implementation to a space-maximising algorithm to randomisation.

I will close this PR now, as it will not be considered without such a benchmark. We will happily reopen it if such data is provided and is conclusive, as we would love to increase the tracking success rate!

@MattiSG MattiSG closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants