Distributed robot attack against the SQLite website

(1) By drh on 2025-08-13 11:27:04 [link] [source]

There is an on-going distributed attack against the SQLite website, and specifically against URLs of the form:

https://sqlite.org/src/file?....

where the "...." is various query parameters which are always different and also always valid. I clicked the box to require anonymous login to access that URL, but the attacks keep on coming, which suggests that the robots are also sending valid anonymous-login cookies. (The web server does not log cookies so I don't know this for certain.) The user-agent strings masquerade as legitimate web browsers, of course.

Each request comes from a different IP address; the average number of requests per IP address is 1.2. I know that a robot is involved, though, since the CSS is not being requested. The IP addresses are from all over the world. So I cannot mitigate the attack by null-routing selected IP addresses.

The SQLite.org server is keeping up for now, but the attack rate is growing. If this continues, I'll probably need to turn off anonymous access to the affected URL. And in the meantime, you might notice that https://sqlite.org/ might be less responsive than normal.

The malefactor behind this attack could just clone the whole SQLite source repository and search all the content on his own machine, at his leisure. But no: Being evil, the culprit feels compelled to ruin it for everyone else. This is why you don't get to keep nice things....

(2) By _blgl_ on 2025-08-13 12:25:02 in reply to 1 [link] [source]

That sounds like the current MO of so-called "AI" web scrapers. If you can think of a low-resource method of serving poison text in response to these requests, please implement it.

(3) By drh on 2025-08-13 12:27:38 in reply to 2 [link] [source]

I don't (yet) have a way to distinguishing these robot requests from legitimate requests.

(4) By ddevienne on 2025-08-13 12:35:38 in reply to 1 [link] [source]

That's the CON of being fully in charge of your stack, Web Server and SCM included.
But I know you wouldn't have it any other way. Still, that's time not on SQLite itself,
which is a shame for all the rest of us.

Would be nice to be able to dependent on a giant like CloudFare for DDoS protection.
And as a CDN too. Although with something as dynamic as Fossil, not sure how that would work.

(5) By spindrift on 2025-08-13 13:00:16 in reply to 1 [link] [source]

Each request comes from a different IP address; the average number of requests per IP address is 1.2. I know that a robot is involved, though, since the CSS is not being requested. The IP addresses are from all over the world. So I cannot mitigate the attack by null-routing selected IP addresses.

That's particularly challenging. Someone with a lot of bots or a massive ability to spoof the source of the connections is clearly toying with you.

Are these Https connections? Or just http? I seem to remember SQLite.org redirects all http to HTTPS, but if not and this traffic is only port 80, you could restrict anonymous access solely on that port.

If it's HTTPS traffic then they must actually have control of the endpoint, not just spoofing it.

That's going to be a tough crowd to weed out.

Althttpd isn't going to let you directly block access to ..../src/file only is it? It's only routing based on the .../src level, as I understand it.

(6) By drh on 2025-08-13 13:12:18 in reply to 5 [link] [source]

Years ago, anonymous login cookies were tied to a particular IP address. But some users complained that they were behind a firewall that caused their IP address to constantly shift, meaning they couldn't stay logged in as anonymous. So I disabled that feature and now the same anonymous login cookie will work from any IP address. Maybe I need to bring back the feature of tying anonymous to a specific IP address....

(7) By slavin on 2025-08-13 13:21:19 in reply to 1 [link] [source]

Just to reassure you, I've been reading about this happening to 40 other sites in the last 24 hours. According to the admins who have reported it, they range all over the map, from business-to-business webstores to hobby sites to … erm, how do I put it … the most-used sites on the internet. Cloudflare has also reported a peak on attacks on many of the sites it runs.

So whatever it is, it's not somebody attacking SQLite specifically.

Nobody has come up with a good specific defence yet. If I see one, I'll post. The best I've seen is a treacle defence: introduce a 3 second delay in all responses to the affected URLs. Humans accept it, but bots won't send more requests until old requests have been serviced.

(8) By drh on 2025-08-13 13:27:07 in reply to 7 [link] [source]

bots won't send more requests until old requests have been serviced.

Remember, we are getting 1.2 requests per IP address. So the individual slave robots are sending no more than 1 request on average anyhow. Convincing them to give up after one delay does not improve the situation.

(9) By stephan on 2025-08-13 13:33:47 in reply to 6 [link] [source]

Maybe I need to bring back the feature of tying anonymous to a specific IP address....

The bigger concern in that specific case was mobile usage was on the increase with the introduction of the forum, and my IP was changing several times on each train commute to/from work. If we go back to IP-based login cookies, mobile use will suffer.

(10) By drh on 2025-08-13 13:36:49 in reply to 9 [link] [source]

IP-based anonymous cookies only - not real-user cookies.

(11) By stephan on 2025-08-13 13:39:23 in reply to 10 [link] [source]

IP-based anonymous cookies only - not real-user cookies.

Oh, of course those are different. The only cases which would be broken by active roaming that way are anonymous posters, but that's not a limitation i'd lose any sleep over.

(12) By olivluca on 2025-08-13 14:14:00 in reply to 8 [link] [source]

Would something like anubis work? https://anubis.techaro.lol/

(13) By stephan on 2025-08-13 14:17:06 in reply to 12 [link] [source]

Would something like anubis work?

As ddevienne alluded to earlier, this project has a long history of "rolling its own" and not depending on third-party services. It will be a sad day in history if it's eventually pressured into hiding behind Anubis, Cloudflare, or the like.

(14) By olivluca on 2025-08-13 14:24:36 in reply to 13 [link] [source]

I could be wrong but I think that anubis is not a third party service, you have to host your own instance.

(15) By drh on 2025-08-13 14:27:26 in reply to 13 [link] [source]

Fossil (which is the system that provides the services being attacked) already does use heuristics to detect and deflect robots. Been doing that for many years. The problem here is that this new robot is impersonating a human with sufficient accuracy to defeat all the heuristics.

(16) By stephan on 2025-08-13 18:13:04 in reply to 14 [link] [source]

I could be wrong but I think that anubis is not a third party service, you have to host your own instance.

Even so, it's a third-party dependency. This project's web presence currently has no third-party dependencies beyond the OS upon which the web server¹ sits, libz, libssl, and common CLI-based software development tools.

^{^} Written by drh

(17) By drh on 2025-08-14 13:07:29 in reply to 1 [link] [source]

The distributed robot attack continues to increase. Therefore, access to https://sqlite.org/src/.. is now mostly restricted unless you first login as anonymous. Sorry for the inconvenience.

(18) By anonymous on 2025-08-14 15:50:56 in reply to 17 [link] [source]

Great. A captcha that is unreadable in Brave. At least I can hear and type it.

And now a fossil pull no longer works: "Error: not authorized to read"

Might as well pack up for the day - or the week.

(19) By drh on 2025-08-14 15:55:06 in reply to 18 [link] [source]

A captcha that is unreadable in Brave.

Works for me in Brave. Send me a screenshot to private email: drh at sqlite dot org.

fossil pull no longer works

I'm working on something better. Please try again in 24 hours, and repeatedly there after until it starts working again.

(20) By anonymous on 2025-08-14 16:31:36 in reply to 19 [link] [source]

It appears to be an Android thing...

Brave on Android = unreadble

Chrome on Android = unreadable

Firefox ESR on Windows = ok

Firefox on Linux = ok

Firefox ESR on Linux = ok

Brave on Linux = ok

I don't think that is worth bothering with - having the captcha read out works.

As for Fossil - obviously nothing to do but wait.

Have you tried matching the offending IPs with those used by Perplexity?

According to this week's "Security Now" podcast Perplexity has been structurally naughty lately.

(21) By RandomCoder on 2025-08-14 18:06:26 in reply to 20 [link] [source]

Have you tried matching the offending IPs with those used by Perplexity?

That likely won't help, they appear to be using different ASNs to avoid such blocks.

(22) By anonymous on 2025-08-14 20:43:09 in reply to 21 [link] [source]

I didn't mean as a way to block them, but as a possible way to identify them if there is an overlap.

(23) By drh on 2025-08-14 21:01:02 in reply to 22 [link] [source]

No. There is no way (yet known) to block the attack.

One theory (which is unproven but makes sense) is that the rogue HTTP requests are coming from Perplexity's Chrome plugin. So normal users on the internet install Perplexity's plugin and go about their business. Meanwhile, in the background, the plugin is surreptitiously harvesting training data, under the guise of being the innocent user who installed the plugin, and forwarding the data on to Perplexity. So the rogue requests are coming from a normal users browser, hence there is no good way for us to distinguish them from legitimate requests.

(24) By droleary on 2025-08-15 02:59:53 in reply to 23 [link] [source]

Well, I think you did a fine job distinguishing them in your first post: they are all essentially one-off deep links. Your reporting the incident made me look closer at my own logs, and I'm finding similar requests (only to a less punishing extent, partly because I don't host anything as major as SQLite and partly because I already have extensive firewall listings of networks that have already been abusing my servers). Even if Fossil has an easy way to detect that sort of traffic, what an effective "block" would look like is the real question, given the one-and-done nature of the DDoS attack.

What about a low-resource "prove you're human" page: For a GET deep link, present something like "Your page is now ready:" followed by a "Load Page" button that does a POST submit instead, which is accepted.

(25) By florian.balmer on 2025-08-15 05:26:31 in reply to 1 [link] [source]

The current antibot-defenses also seem to break the SQLite source repository RSS feed (when loaded from scripts). Is there any chance this can be enabled, again?

(26) By spindrift on 2025-08-15 05:44:13 in reply to 25 [link] [source]

Works fine for me when an anonymous login cookie is presented.

Presumably this is exactly the page type that is vulnerable to the current issue, so it must need either protecting behind some sort of cookie style gate, or hiving out as a static page to minimise load.

Incidentally, all my login cookies seem to have recently been invalidated - possibly fluke or perhaps the user session table has recently been flushed.

So if you were previously presenting an anonymous login cookie for the RSS feed, you may need to regenerate it.

(27) By spindrift on 2025-08-15 05:48:29 in reply to 1 [link] [source]

I see the server load has now dropped significantly from recent numbers.

Is there any sign of abatement?

(28) By florian.balmer on 2025-08-15 05:53:17 in reply to 26 [link] [source]

I'm loading the RSS feed from a script.

(29) By spindrift on 2025-08-15 08:55:02 in reply to 28 [link] [source]

Noted, but depending on exactly what you're willing to do, you can capture an anonymous login cookie and provide that with eg. curl -c cookiefile

I'm not suggesting that's an optimal solution, but it's a workaround that seems fine here.

You're obviously at the mercy of cookie I validation and there is some manual input required.

But while SQLite.org is trying to stay afloat, it might be easier to just temporise rather than waiting expectantly, if this really is an issue for your workflow.

(30) By drh on 2025-08-15 10:23:37 in reply to 27 [link] [source]

The bot-net is still sending in about 1 million "deep" requests per day. By "deep request", I mean some HTTP request that requires following long delta chains, generating complex diffs, and otherwise doing a lot of computation. A "deep" request can take 100 milliseconds or more to answer.

I currently have https://sqlite.org/src configured so that all deep requests are behind a captcha. Deep requests redirect to the captcha, and then the robot gives up. This configuration actually doubles the number of requests coming from the bot-net—there is the initial request itself which now redirects to the captcha page plus the request to the captcha page. But none of the requests are "deep". Each is satisfied with just a few milliseconds of CPU time. And so the server is able to keep up easily.

So while the immediate problem has been temporarily resolved, the rise of bot-nets is a worrying sign. It means that sites like https://sqlite.org/src which provide a huge amounts of information have difficulty existing on the open internet. They now have to be hidden behind captchas or similar to ward off predatory robots. This make it more difficult for people to write scripts to extract the information they need; now they have to interact with the captcha to get through. It complicates the management of websites like https://sqlite.org/src since they now require constant security monitoring. To my mind, this incident is a harbinger of a dystopian future for the internet.

(31) By spindrift on 2025-08-15 11:03:57 in reply to 30 [link] [source]

I'm more hopeful - I suspect it will herald some sort of reputation based client certificate equivalent to server provided Https certificates, signed by some central trusted authority on the basis of some initial arduous criterion.

But it's a tech solution to a tech problem.

In the same way that http is no longer tenable, truly anonymous in the sense of uncorrelated internet communication is probably becoming untenable.

A basic Auth option for scripts is pretty easy to implement though.

(32) By drh on 2025-08-15 11:24:58 in reply to 31 [link] [source]

some sort of reputation based client certificate

The bot-net is recruiting zombie clients to work on its behalf, and the zombies are the ones that own the (presumably good) client certificate.

I don't think you understand what is happening here. The flood of expensive HTTP requests is not coming directly from the bad actor. The requests are being laundered through innocent third parties. That's what makes this so diabolical. The troublesome HTTP requests are coming from ordinary people who are just trying to read their email or order some groceries on-line. Their computer/tablet/phone just happens to have been hijacked by a virus or chrome-extension that exploits their good reputation for nefarious purposes.

(33) By ThanksRyan on 2025-08-15 12:00:44 in reply to 1 [link] [source]

What's the solution to update the sqlite repo now?

$ fossil up trunk
Pull from https://sqlite.org/src
Round-trips: 1   Artifacts sent: 0  received: 0
Error: not authorized to read
Round-trips: 1   Artifacts sent: 0  received: 0
Pull done, wire bytes sent: 304  received: 212  remote: 2600:3c02::f03c:95ff:fe07:695
Autosync failed.

(34) By drh on 2025-08-15 12:19:26 in reply to 33 [link] [source]

Right this moment, you can still sync against the back-up repositories at https://www2.sqlite.org/src and https://www3.sqlite.org/src. But that could change if the bot-net discovers them.

We are actively working on enhancements to Fossil that will permanently resolve your problem, but those enhancements are not ready yet.

(35.1) By slavin on 2025-08-15 12:29:59 edited from 35.0 in reply to 30 [link] [source]

You may have already done this.

At the level of a million requests a day, I would contact my hosting service and ask them whether they can help, or want to help, or even just want to take logs for later analysis. It may be that, although you can't see how traffic is carried from your servers, their network infrastructure may show things that this bot traffic has in common. Blocking the unwanted traffic themselves, if it is possible, may save them significant traffic, power, or money in cooling.

You have the advantage that the pages being requested are public. No harm will be done if you let the host's admins access everything about them.

(36) By stephan on 2025-08-15 12:37:46 in reply to 33 [link] [source]

Error: not authorized to read

We've put a temporary measure in place to keep the sync protocol working (the bot doesn't use it). Thank you for the report!

(37) By spindrift on 2025-08-15 13:14:09 in reply to 32 [link] [source]

hijacked by a virus or chrome-extension that exploits their good reputation

It's a bit off topic, and not the best time to have tangential conversations of course. However, I do understand the nature of this current attack, and I agree about how and why this is especially pernicious and challenging.

The idea of reputation is that the innocent party would lose it, but be advised as to why, and therefore motivated to control their browser extension / virus / whatever is using network bandwidth on their behalf.

Where malware is presenting itself as an innocent user, the only way to differentiate between the malware and the user ultimately is for the user themselves to do so, and somehow be motivated and facilitated to correct the situation.

A new type of malware that impacts not your resources or security but your reputation, such that non-experts are incentivised to keep a clean ship. System checking tools that highlight "reputational risks" to computer users.

But that's the bigger philosophical issue, not the solution to this particular current manifestation of the problem.

(38) By droleary on 2025-08-15 14:27:15 in reply to 30 [link] [source]

To my mind, this incident is a harbinger of a dystopian future for the internet.

Some of us would say that "future" started 20+ years ago, when the Internet became primarily about the web, and the web became primarily about ads. As you said earlier, there's no need to spider your sites at all, or any site that is essentially a rendered repo that could be cloned and used locally. I would fully support you not enabling a web interface by default for any of the high-cost-but-normally-low-usage features that Fossil provides.

(39) By ThanksRyan on 2025-08-15 15:14:51 in reply to 1 [link] [source]

Maybe a stupid suggestion, but can the user nobody not have access to any links available on the timeline, and have it load only a few commits?
Maybe even have it load to a preexisting branch with a single commit when you detect it's from a bot farm.

(40) By stephan on 2025-08-15 15:29:43 in reply to 39 [link] [source]

Maybe a stupid suggestion, but can the user nobody not have access to any links available on the timeline,

Fossil has the ability to disable many links for non-logged-in users, but blanketly disallowing links for such users means that when we post links in the forum, there's a good chance that the folks following the links can't do much with them because link generation on the resulting page will be off for them. The usability (for those not logged in) really suffers.

To quote fossil's docs on the topic:

But requiring a login, even an anonymous login, can be annoying.

(41) By anonymous on 2025-08-16 15:06:46 in reply to 7 [link] [source]

> So whatever it is, it's not somebody attacking SQLite specifically.

How do you reconcile this statement with Richard's:

> > where  the  "...." is  various  query  parameters which  are  always
> > different and also always valid.

Did  "AI"  somwhow  figure  out   exactly  which  query  parameters  are
legitimate and valid for /src/file? and which are not?

That the query parameters are valid suggests to me more "I" than "AI" or
just some random crawler.

There was a recent discussion about dealing with abusive bots on NANOG:

https://marc.info/?t=175268464400001&r=1&w=2

While  there were  few  creative  methods suggested,  I  think the  most
sensible was:

https://marc.info/?l=nanog&m=175272950930062&w=2

(42) By ncruces on 2025-08-16 18:50:44 in reply to 17 [link] [source]

Unfortunately this "breaks" build scripts that (e.g.) use curl to download a tarball to build a given branch (my case bedrock).

Not complaining, completely understandable.

Since I use GitHub to host my project, I now use use your mirror there and their source code archive URLs instead:

Mostly commenting so if anyone else needs this they know they have this option.

(43) By drh on 2025-08-16 19:01:13 in reply to 42 [link] [source]

There is a work-around for this. If you have an account on the Fossil repository, you can visit the /tokens page (ex: https://fossil-scm.org/forum/tokens) and there you can create an "access token" associated with your account. Suppose your token is "0123456789abcdef". Then on your curl URL, you add a query parameter "token=0123456789abcdef" and that will disable the robot screening.

(44) By spindrift on 2025-08-16 20:01:05 in reply to 43 [link] [source]

Cool, that's a good option to have, thank you.

(45) By ncruces on 2025-08-16 20:43:45 in reply to 43 [link] [source]

Thanks, that's a good option to have.

Since I'll be posting the URL to an open repository, I risk either exposing the token, or having to use secrets (which makes things harder for downstream users).

But noted for future use!

(46) By sqweek on 2025-08-18 02:08:57 in reply to 30 [link] [source]

I notice that /src is not excluded in https://sqlite.org/robots.txt -- the paths mentioned here are mostly /cvstrac/ which I think is outdated?

Of course robots.txt is not going to make a difference for the kind of nefarious distributed bot-net that appears to be active at the moment, which appears designed to circumvent the usual restrictions against automated web requests. However in the long term it might be worth keeping this up to date as it represents a "standard" internet convention and as noted elsewhere in this thread¹, there is kickback against agents which do not honour it

¹ https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/

(47) By drh on 2025-08-19 10:43:14 in reply to 1 [source]

Status update: The distributed robot attack appears to have subsided, for now.

On the other hand, I'm guessing it will reappear at some point in the future. I have implemented enhanced robot defenses on the SQLite website (specifically in the Fossil version-control system used by SQLite) to make it more resistant to these kinds of attacks in the future. What doesn't kill you makes you stronger...

(48) By anonymous on 2025-08-19 17:36:55 in reply to 47 [link] [source]

> The distributed robot attack appears to have subsided, for now.

Do you  have a rough idea  (or even exact)  on how many unique  IPs were
involved?

(49) By drh on 2025-08-19 18:14:51 in reply to 48 [link] [source]

It is difficult to tell whether a particular HTTP request was coming from the bot-net versus some guy who just copy/pasted a link to SQLite from Reddit. But as best as I can discern, it was about 1 million distinct IPs per day.

It wasn't that big of an "attack". In fact, I would guess that the bot-net took steps to help ensure that it didn't overwhelm the target and hence call too much attention to itself. The problem is that SQLite.org is a small target, with just a single 4-CPU machine. And even that is plenty of power to handle millions bot request per day for static pages. It is just when they start doing dynamic pages that compute big diffs and source code annotations that take multiple hundreds of milliseconds to compute that it begins to cause problems. The server at sqlite.org has about 345,600,000 milliseconds of compute available per day. So a million requests that take about 250 milliseconds each to process gets us uncomfortably close to saturation.

The defensive measures that were put in place strive to prevent the bot-net from accessing the slow-to-compute pages. So each request ends up using just 3 or 4 milliseconds of CPU time. The existing server can handle that level of traffic without any trouble.

Aside: At the bottom of this page, in the light grey text at the very bottom, you can see how many milliseconds of CPU time were needed to render this forum thread.

(50) By CompuRoot on 2025-08-19 19:56:18 in reply to 49 [link] [source]

Why not to put the actual web server behind Cloudflare that will handle efficiently such abuse free of charge?

(51) By stephan on 2025-08-19 20:04:15 in reply to 50 [link] [source]

Why not to put the actual web server behind Cloudflare that will handle efficiently such abuse free of charge?

That's answered up-thread in /forumpost/add443de2a933a3e and its follow-ups.

(52) By drh on 2025-08-19 20:17:09 in reply to 50 [link] [source]

We were NOT getting 1 million queries to the same URL. We are getting 1 million queries to 1 million distinct URLs. Every request is unique (well, mostly, certainly to a good approximation). I don't see how a cache helps in that scenario.

There are currently more than 34,000 distinct check-ins in the SQLite source tree. We are getting requests for differences between every distinct pair of check-ins. That's already more than a billion combinations. But there's more. Each request might look something like this:

`https://sqlite.org/src/vdiff?from=0fe77341a0f1e869&to=172f4e4772d90f47

Where the from= and to= query parameters specify the specific check-ins to diff between. The values on those query parameters can be any unique prefix of the 64-character SHA3-256 hash that identifies the check-in. Or they can be any tag associated with the check-in. So even for the same diff, there are many different ways to specify the same diff. And there are other optional parameters that can alter the diff output in various ways (ex: side-by-side versus unified)

Add all this together and we are easily into billions and billions of distinct pages. And that's just for the /vdiff page. There are other pages on the SQLite.org/src website with just as many combinations. In the bot-net problem we were having a few days ago, most of the requests were for a different one of those billions and billions of distinct pages.

How does cloudflare help with that?

(53) By CompuRoot on 2025-08-20 12:24:13 in reply to 52 [link] [source]

How does cloudflare help with that?

There still is benefit using CDN.

ClodFlare supports: Edge Cache by query rules, that can be adjusted for specific patterns that shouldn't be aggregated by spiders. Specific query, like vdiff?from=/to= probably should be exposed to authenticated users only. Those links doesn't bring popularity to sqlite from SEO point of view and those who really need such queries most likely will download repository and use fossil-scm to query such links locally.
Bot Mitigation: that will eat up on CF side most of bad bot traffic since they have statistic regarding bad/good. Also "Bot Fight Mode"/"Super Bot Fight Mode" can challenge or block scrapers on CDN side.
Rate limiting: CF already has dynamic statistic regarding offensive IP and even redistributed attack from millions of IP might help to slow down traffic to a target
JavaScript/Captcha challenge: will stop simple scrapers that don’t handle JS on CF side without touching a target

So, Cloudflare’s WAF/bot protection still does work on dynamic pages.

(54) By anonymous on 2025-08-20 16:13:40 in reply to 53 [link] [source]

I'm not a  fan of the trend  to dump everything into  CloudFlare. It's a
nuisance as bad as the "problem" it's purposed to solve.

I'm  constantly told  that my  "we have  detected that  your device  has
engaged in suspicious activity" despite  the fact that this is blatantly
false. Every website that  I go to these days seems to  waste more of my
time  as I  have to  prove  that my  device isn't  engaged in  malicious
activity  by wasting  valuable time  waiting for  something to  occur or
giving me menial tasks to accomplish.

And then there's the SSL issue.

Dystopian future is now.

(55) By anonymous on 2025-08-20 16:16:57 in reply to 52 [link] [source]

> We are getting requests for differences between every distinct pair of
> check-ins. That's already more than a billion combinations

Does this  mean that  the bot  has been taught  how to  recognize Fossil
repositories, and which query parameters can be used to generate content
to crawl? I  don't see a typical crawler being  able to accomplish this.
What is feeding the bot the billion combinations?

(56) By stephan on 2025-08-20 17:18:26 in reply to 55 [link] [source]

What is feeding the bot the billion combinations?

Fossil is. Visit the timeline page of any repo and you will find an inexhaustable supply of links.

(57) By stephan on 2025-08-20 17:23:47 in reply to 54 [link] [source]

I'm constantly told that my "we have detected that your device has engaged in suspicious activity"

It's a clear case of "hate the game, not the player." Bot swarms like the recent one are an existential threat for small-scale web hosting. The people behind these bots are the proverbial reason we can't have nice things (like captcha-free access to websites).

(58) By spindrift on 2025-08-20 17:47:14 in reply to 57 [link] [source]

Absolutely agree!

Having said that, it still might be sensible to add such high resource use pages to the robots.txt exclusion file, which currently contains none of them...

(59) By anonymous on 2025-08-20 18:29:11 in reply to 56 [link] [source]

Which  pages  accessible  from  /timeline provide  generated  links  for
/vdiff?from=&to= that are exposed to crawlers?

I could only find /vpatch?from=&to=  which could be a potential problem.
But that  is a  pretty finite  (and small) set  of URLs  to crawl  in my
opinion.

I guess I'm still confused how  /timeline can provide such a huge number
of links to crawl  without some kind of extra effort on  the part of the
crawler and  knowledge of how  to interpret  commit hashes as  inputs to
from and to parameters for /vdiff to generate billions of combinations.

(60) By anonymous on 2025-08-20 18:41:06 in reply to 56 [link] [source]

It was stated earlier in this thread:

> There are currently more than  34,000 distinct check-ins in the SQLite
> source tree.  We are  getting requests  for differences  between every
> distinct pair of check-ins.

Which part  of Fossil will  generate crawlable links for  every distinct
pair of check-ins?

(61) By spindrift on 2025-08-20 18:46:33 in reply to 59 [link] [source]

I guess I'm still confused how /timeline can provide such a huge number of links to crawl without some kind of extra effort on the part of the crawler and knowledge of how to interpret commit hashes as inputs to from and to parameters for /vdiff to generate billions of combinations.

While an interesting question, this would appear to be rather besides the point, would it not?

This isn't a botnet dynamic website scraping how-to discussion forum, after all.

(62) By stephan on 2025-08-20 18:50:02 in reply to 59 [link] [source]

Which pages accessible from /timeline provide generated links for /vdiff?from=&to= that are exposed to crawlers?

Not only do the direct links lead to incalculable numbers of other links, but tapping on any two of the small dots in the timeline will diff against those specific two versions. With a 50-entry timeline (the default on SQLite's site), that's another huge number of combinations of links, each of which tends to get more expensive to calculate as the distance (in check-ins) between each pair of versions grows.

I guess I'm still confused how /timeline can provide ... for /vdiff to generate billions of combinations

Billions is likely a conservative estimate but, frankly, my brain's not big enough to do even a rough estimate of the math.

Try manually browsing from page to page in any moderately-sized fossil repository and in your natural lifetime you won't be able to follow all of the distinct links.

(63) By stephan on 2025-08-20 18:54:48 in reply to 60 [link] [source]

Which part of Fossil will generate crawlable links for every distinct pair of check-ins?

The timeline does. Clicking any two of the versioned dots on the timeline will diff between those versions, and the display length of the timeline can be controlled via a documented URL parameter (not linked to here to avoid throwing oil on the fire).

(64) By anonymous on 2025-08-20 19:09:50 in reply to 63 [link] [source]

> The timeline does.

Only via JavaScript  and triggering onclick on 2 distinct  nodes will it
generate a crawlable /vdiff URL. Is that what crawlers are doing?

I suppose that's plausible.

(65) By anonymous on 2025-08-20 19:21:10 in reply to 61 [link] [source]

> While an interesting question, this  would appear to be rather besides
> the point, would it not?

Exploring alternative  approaches to the  captcha hammer is  besides the
point?

I was just trying  to figure out Fossil could be adjusted  in a way that
doesn't require captchas.  But to do that one must  first understand the
problem.

It  would seem  to me  that  if Fossil  instead limited  the ability  to
generate /vdiff URLs dynamically from it's own website, then it wouldn't
matter what crawlers did (unless we  think crawlers are being adapted to
crawl  Fossil  hosted  repositories).  For  example,  maybe  only  allow
selecting  2  nodes  in  the  timeline to  logged  in  users  (anonymous
included).

I have  a hard  time believing  that crawlers  know enough  about Fossil
internals to  know that  they can  obtain a list  of commit  hashes from
/timeline  and then  to call  /vdiff?from=&to= using  2 of  those commit
hashes.  I  think it's  more  likely  that  Fossil's own  JavaScript  is
responsible for  generating billions  of URLs to  crawl and  that simply
hiding the  ability we could  do away with  the captcha (at  least until
someone  really does  invent a  crawler that  has a  Fossil plugin  that
enables it to generate it's own URLs).

(66) By drh on 2025-08-20 20:08:12 in reply to 60 [link] [source]

Not all /vdiff links are directly reachable by a crawler. But many are.

Here is an experiment you should run:

Write a crawler (or maybe one already exists that does this—I dunno) that does not save page text, but just remembers all hyperlinks it find on the page and save the hyperlinks in a database, crawling each hyperlink just once. Take care to remove duplicates hyperlinks. Ignore all links that go "off site".
Clone a copy of the SQLite Fossil repository to your local machine.
Run "fossil ui" on the clone so that it presents a full website on http://localhost:8080/
Set your crawler loose on http://localhost:8080/timeline
Let us know how many links it find and how much real-time and CPU time it took to find them all.

(67) By spindrift on 2025-08-20 20:16:36 in reply to 66 [link] [source]

Something like

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://localhost:8080/timeline

should work just fine.

(68) By anonymous on 2025-08-20 20:50:54 in reply to 67 [link] [source]

Could you not identity a crawler by adding some link on each page to an internal url that is not really visible to a normal user but when followed you know it is a crawler? Then you can deny further access. A fair crawler would not check the site because of the robots.txt.

(69) By anonymous on 2025-08-20 20:53:46 in reply to 66 [link] [source]

>  Take care to remove duplicates hyperlinks.

I was planning  on inserting the hyperlinks into a  SQLite DB which will
automatically eliminate  duplicates for me using  something like "INSERT
OR REPLACE INTO urls".

(70) By anonymous on 2025-08-20 21:02:22 in reply to 68 [link] [source]

> Could you not identity  a crawler by adding some link  on each page to
> an internal url that  is not really visible to a  normal user but when
> followed you know it is a crawler?

This wouldn't work because as Richard mentioned above, each IP made only
1.2 requests on average. I suppose you could have the hidden URL issue a
cookie and  if that cookie  is present  in any successive  requests from
other  IPs, that  might be  an indicator  that one  could act  upon, but
otherwise the hidden  URL that only a crawler finds  approach won't work
well with such a distributed crawler.

(71) By drh on 2025-08-20 21:03:03 in reply to 68 [link] [source]

In a bot-net, each HTTP request comes from a different IP address and a different browser. Millions of machines, all over the world, each send a single HTTP request and get back a single reply, and we never hear from them again.

How do you deny further access when no further access is ever even attempted?

(72) By anonymous on 2025-08-20 21:25:56 in reply to 71 [link] [source]

Links on generated pages should then only be allowed to be followed by the original requestor of the generated page. For example by encoding the requestor ip in or other technical means. Then botnets cannot access those links.

(73) By drh on 2025-08-20 22:03:40 in reply to 72 [link] [source]

Links on generated pages should then only be allowed to be followed by the original requestor

So you are no longer allowed to copy/paste a link to Reddit or HN or Twitter/X? or to an email? Seems kinda harsh.

(74) By spindrift on 2025-08-21 05:55:32 in reply to 72 [link] [source]

Don't forget that whatever generates the links is open source software.

Security through obscurity merely delays the inevitable. In the case of Fossil, they would be links to its own source code containing the procedure for generating the links.

And if you are suggesting that, rather than some algorithmic deterministic link creation there is instead a single use random coding system used then you have to save all of that state in a database somewhere to determine if a link is valid.

Your solution is too clunky and expensive.

And also renders makes the utility of sharing links worthless.

(75) By anonymous on 2025-08-21 20:02:52 in reply to 66 [link] [source]

> Write a crawler (or maybe one already exists that does this—I dunno)

One would think that  such a tool already exists, but  it's hard to find
anything  with todays  search engines---probably  because they  scour so
much they don't know how to evaluate  what's of value and what is not. I
cobbled  together  one with  Tcl  and  it's  progressing slowly  (to  be
expected from a single-threaded script).

The results  are far from  scientific (and I  don't really intend  to go
that far), but  so far the /zip and /tarball  endpoints take the longest
to fetch from the repository. For example, at the top is:

/tarball/78b543f85a/SQLite-78b543f85a.tar.gz linked from /info/78b543f85ac6643f

Even more interesting is the following:

/timeline?uf=42b7bf0d02e08b9e77734a47798d1a55a9e0716b linked from /file?name=sqlite.pc.in&ci=trunk

I don't  know how much of  the transpired time is  Fossil generating the
response and  how much  was just  downloading the  content. I  suppose I
could parse  the HTML footer  for the  page generation stat  that Fossil
publishes at least for text/html documents and record that.

So  far it  has discovered  376,489  unique URLs  of which  it has  only
visited 12,781  and OS shows TIME  on CPU as 14:18.21  with elapsed wall
clock time as 95 minutes.

(76) By stephan on 2025-08-21 20:31:59 in reply to 75 [link] [source]

I cobbled together one with Tcl ... So far it has discovered 376,489 unique URLs...

Keep in mind that yours is not processing JavaScript, but industrial-grade bots do (and have been since at least 2010 or 2012). That is: they can interact with content which a non-JS-aware browser can't, like the clickable dots on the timeline. JS code can detect the presence of those and programmatically click them. They're also not hindered by UI elements which are invisible because they don't fully lay out the page, they just crawl the resulting DOM tree using either JS or C or C++ or Rust. (At least, that's how a sophisticated crawler would work, as opposed to one which regexes links out of the HTML (the humble beginnings of all crawlers, of course).)

As you say:

One would think that such a tool already exists, but it's hard to find...

My strong suspicion is that they can be purchased in "gray markets" if one knows where to look.

(77) By anonymous on 2025-08-21 21:03:56 in reply to 76 [link] [source]

> Keep  in   mind  that   yours  is   not  processing   JavaScript,  but
> industrial-grade bots do

This is  true. I was  mostly interested in  what was available  to naive
crawlers and didn't have that much time to invest in this mini-project.

Still, I think it would be good to have a proper /robots.txt at the root
of  sqlite.org.  Crawlers will  never  see  the /src/robots.txt  because
that's not where bots look for it.

(78) By TripeHound on 2025-08-22 06:27:42 in reply to 1 [link] [source]

Related article on The Register published yesterday: AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders.

(79) By spindrift on 2025-08-22 07:18:05 in reply to 77 [link] [source]

It's worth pointing out that I was getting enormous amounts of AI crawling over the brotli test website I setup a while ago, until I added a robots.txt exclusion file to the root.

Following which it all stopped.

I'm sure some crawlers are evil and clever and malign.

But if there is no indication that they should avoid crawling /src (which there currently isn't) then ongoing attention is probably to be expected.

It would appear quite straightforward to combine the current /src/robots.txt file (which is not used by crawlers) with the /robots.txt file (which is used, but doesn't contain any relevant instructions).

(80.6) By midijohnny on 2025-08-22 16:06:14 edited from 80.5 in reply to 1 [link] [source]

Could you generate all links dynamically such that the IP-address of the client is encoded in the URL itself. And then decode all URLs - and check whether the encoded URL and the incoming IP match up - if they don't - redirect to a 'entry' point URL, which would then allocate the URL with IP-encoding in it?

The idea being - if 'browser-A' learns of 'link-x-ip-n.n.n.n.', it can follow it - and so can 'bot-A' - but 'bot-B' (different IP presumably) wouldn't be able to ?

If 'bot-A' is still a nuisance, you could stop it or rate-limit it based on the IP.

Edit: I see this is probably the same idea as one of the anonymous ones above - aside from adding the redirect - to make the shared original link work.

Edit 2: The botnet would just have drop part of the URL to get a freshly valid link though....easily defeated...

Dunno: add a learning model to see if it can learn the shape of the botnet and the path it is following maybe...then automatically rate-limit (incrementally escalating time, like sshguard does - albeit for a single IP) any IPs it thought are participating in the botnet?

(81) By rogerbinns on 2025-08-23 13:37:10 in reply to 80.6 [link] [source]

... such that the IP-address of the client ...

That doesn't work. Many users are behind large ISPs using NAT so requests from the same end user system will appear to come from different IP addresses.

In any event what you are describing can be implemented using signed cookies, which is roughly what the anonymous login does.

The approaches taken so far have been to fingerprint the client (eg user agent, language, crypto suites in TLS etc), and combine that with proof of work: make the client computer spend time and energy calculating something pointless, or make the human spend time and energy calculating something pointless (captcha). This makes for a miserable experience for humans, and keep being overcome by the crawlers leading to a constant arms race.

Many folks have given up and outsourced the problem to Cloudflare who are doing that work, but they have false positives which is really annoying to the affected people. There are no easy solutions.

(82.1) By stephan on 2025-09-02 11:38:15 edited from 82.0 in reply to 65 [link] [source]

It would seem to me that if Fossil instead limited the ability to generate /vdiff URLs dynamically from it's own website, then it wouldn't matter what crawlers did (unless we think crawlers are being adapted to crawl Fossil hosted repositories).

That's precisely what's happened. The first cases of bots logging in to fossil go back to 2010 or so, where they were seen polluting wikis and tickets on repos which were set up to allow anonymous users to append comments to those¹. Nowadays bots run full headless browsers and can do anything with the site a human can do (even more, because they're not limited by the visibility of DOM elements, and catching them interacting with such elements is a strong hint that they're a bot).

I have a hard time believing that crawlers know enough about Fossil internals to know that they can obtain a list of commit hashes from /timeline and then to call /vdiff?from=&to= using 2 of those commit hashes.

They don't need to know about internals for that. They can either use trial-and-error (which is what machine learning is all about) or they can read it in the the docs which accompany every single fossil repo. When one bot learns it, all of its compatriots learn it.

Bots running their own JS engine (which need not play by the same rules as the one in an end-user's browser) can either programmatically detect which elements are clickable or they can just click all elements and see what happens. The next time around, if they're sophisticated enough, they will have learned what can be clicked and what can't.

^{^} Recall that "anonymous" requires a login in fossil, as distinct from "nobody", which is the not-logged-in user.

(83) By anonymous on 2025-09-02 12:52:21 in reply to 3 [link] [source]

I wonder what possibly could be an intent for such attack/misbehavior? Is it to just mine for whatever, DDoS out of malice, trying to crash and "take over"?

What's the expected end-game for the perpetrators? Trying to guess this could help in setting up some honey-pot approach. For example, some links may be always different, yet set up to be circular in nature, thus tying in an attacker infinitely.

(84) By jicman on 2025-09-02 13:58:51 in reply to 83 [link] [source]

It's probably the same reason you get robo-calls trying to sell you stuff. I seldom get crank-calls, but two days ago I received an AI call. I think, it was my first time.

(85) By anonymous on 2025-09-03 21:48:06 in reply to 83 [link] [source]

> For example,  some links  may be  always different, yet  set up  to be
> circular in nature, thus tying in an attacker infinitely.

The problem with  the "attack" was that the crawler  was only making 1.2
requests on  average from  the same  IP. How does  one identify  such an
"attacker" to send it down an infinite honey-pot?

I would only  classify it as an attack if  robots.txt doesn't permit the
kinds of requests that it was making.

On that note, it looks like https://sqlite.org/robots.txt still does not
deny access to the URIs that could be potentially expensive if a crawler
were to get fixated on them.