Randomize tracking#1253
Conversation
Shuffle the tasks to process for scraping/tracking terms to increase stealth and not always hit the same service's pages in the exact same order.
|
Thank you @LVerneyEC for your ongoing work on #166! We have considered randomising the tracking order, however this raises very strong concerns with regards to debugging. Indeed, true randomisation means that there is no way to reproduce a failed tracking run. As such, we will not merge such a proposal without a reproducible benchmark demonstrating tangible benefits. We would instead prefer to introduce an algorithmic, deterministic sorting system that manipulates the queue to maximise the space between fetches to the same server. In this way, the same set of declarations would always yield the same fetching order. It would also guarantee that the space is maximised, while randomisation will also yield groupings with high rates of requests to the same server. Admittedly, this would only solve your point 1 (rate limit). While point 2 (deterministic order) is theoretically possible, we are skeptical of its actual role in triggering bot detection. One way for us to reconsider the cost-to-benefit ratio (as in, “cost to reproducibility and maintainability to benefit to tracking”) would be to provide reproducible data on tracking success comparing the current naive implementation to a space-maximising algorithm to randomisation. I will close this PR now, as it will not be considered without such a benchmark. We will happily reopen it if such data is provided and is conclusive, as we would love to increase the tracking success rate! |
Hey,
Here is a small proposal for an addition to shuffle the tasks to be processed for scraping/tracking terms in order to increase stealth and not always hit the same service's pages in the exact same order.
Otherwise, the exact same pattern will happen again and again when scraping a service's terms. This means:
1/ Hitting rate limiting because we are sending a batch (up to MAX_CONCURRENCY) calls more or less the same time to the service.
2/ Easier detection of automated scraping, even more when coupled with a cron-based scheduler, as order of scraping is almost entirely predictible.
Randomizing the order of tracking of the terms is a cheap solution for this. The only downside might be a slight decrease in readability when ran interactively, but with concurrency enabled this was already the case. If the feature is worth merging, happy to have a better integration with config option or so.
Best,