Randomize tracking by LVerneyEC · Pull Request #1253 · OpenTermsArchive/engine

LVerneyEC · 2026-06-23T08:49:30Z

Hey,

Here is a small proposal for an addition to shuffle the tasks to be processed for scraping/tracking terms in order to increase stealth and not always hit the same service's pages in the exact same order.

Otherwise, the exact same pattern will happen again and again when scraping a service's terms. This means:
1/ Hitting rate limiting because we are sending a batch (up to MAX_CONCURRENCY) calls more or less the same time to the service.
2/ Easier detection of automated scraping, even more when coupled with a cron-based scheduler, as order of scraping is almost entirely predictible.

Randomizing the order of tracking of the terms is a cheap solution for this. The only downside might be a slight decrease in readability when ran interactively, but with concurrency enabled this was already the case. If the feature is worth merging, happy to have a better integration with config option or so.

Best,

Shuffle the tasks to process for scraping/tracking terms to increase stealth and not always hit the same service's pages in the exact same order.

MattiSG · 2026-06-23T09:35:42Z

Thank you @LVerneyEC for your ongoing work on #166!

We have considered randomising the tracking order, however this raises very strong concerns with regards to debugging. Indeed, true randomisation means that there is no way to reproduce a failed tracking run. As such, we will not merge such a proposal without a reproducible benchmark demonstrating tangible benefits.

We would instead prefer to introduce an algorithmic, deterministic sorting system that manipulates the queue to maximise the space between fetches to the same server. In this way, the same set of declarations would always yield the same fetching order. It would also guarantee that the space is maximised, while randomisation will also yield groupings with high rates of requests to the same server.
In the same vein, you might also want to look into implementing #1105 as a way to decrease bot detection.

Admittedly, this would only solve your point 1 (rate limit). While point 2 (deterministic order) is theoretically possible, we are skeptical of its actual role in triggering bot detection.

One way for us to reconsider the cost-to-benefit ratio (as in, “cost to reproducibility and maintainability to benefit to tracking”) would be to provide reproducible data on tracking success comparing the current naive implementation to a space-maximising algorithm to randomisation.

I will close this PR now, as it will not be considered without such a benchmark. We will happily reopen it if such data is provided and is conclusive, as we would love to increase the tracking success rate!

LVerneyEC added 2 commits June 23, 2026 10:45

Randomize tracking

32e7a88

Shuffle the tasks to process for scraping/tracking terms to increase stealth and not always hit the same service's pages in the exact same order.

lint

e9a5525

MattiSG closed this Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Randomize tracking#1253

Randomize tracking#1253
LVerneyEC wants to merge 2 commits into
OpenTermsArchive:mainfrom
LVerneyEC:randomized-tracking

LVerneyEC commented Jun 23, 2026

Uh oh!

MattiSG commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LVerneyEC commented Jun 23, 2026

Uh oh!

MattiSG commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants