Crawler policy

ContextaSiteCheck — bot policy and verification

This page documents the polite crawler that powers Site Check's free website scans. If you're a site owner whose WAF, bot manager, or analytics flagged us, you're in the right place.

What the crawler does

When a visitor submits their URL on site-check.contexta.uk, our backend fetches up to 20 pages from the entry origin to produce a performance + security + SEO report. The visitor sees the report; we keep the data to deliver it again if they come back.

How to identify our requests

User-Agent
Mozilla/5.0 (compatible; ContextaSiteCheck/1.0; +https://site-check.contexta.uk/bots)
Method
GET only. Never POST/PUT/DELETE.
Frequency
1 request per second per scan, up to 20 pages.
Depth
BFS depth ≤ 2 from the entry URL.
Same-origin
We only fetch pages on the same host the visitor submitted.
robots.txt
Honored. We parse Disallow: directives and skip those paths.

Paths we never visit

The crawler hard-skips path segments that look transactional, destructive, or auth-related — independent of robots.txt:

/checkout, /buy, /pay, /order, /cart, /delete, /remove, /logout, /signout, /admin, /billing, /subscribe, /unsubscribe, /invoice, /purchase, /confirm-order, /account/delete

Network identity

IPv4 (egress)
20.162.34.229
Reverse DNS (PTR)
crawler.contexta.uk
Forward DNS (A)
crawler.contexta.uk → 20.162.34.229 (DNS-only, not Cloudflare-proxied)
Verify ownership
dig -x 20.162.34.229crawler.contexta.uk, then dig crawler.contexta.uk20.162.34.229. Both directions must match — that's how Google/Bing crawler verification works.

How to allowlist us

If our crawler is being blocked by your edge (Cloudflare, Akamai, Cloudfront, Imperva, Sucuri, DataDome, etc), add a rule that whitelists either the User-Agent above OR the IPv4 + reverse-DNS pair. Bot certification (Cloudflare Verified Bots and the Google equivalent) is in progress; once granted, allowlisting will be automatic for participating edges.

Data we keep

We do not store page bodies long-term once the report is built. We never re-publish, re-sell, or use the captured data for training models.

If we got it wrong

If our crawler caused load, missed a robots.txt rule, or behaved unexpectedly, email [email protected]. We'll investigate within one working day and either fix the bug or back off your origin.

← Back to Site Check