Crawler policy
ContextaSiteCheck — bot policy and verification
This page documents the polite crawler that powers Site Check's free website scans. If you're a site owner whose WAF, bot manager, or analytics flagged us, you're in the right place.
What the crawler does
When a visitor submits their URL on site-check.contexta.uk, our backend fetches up to 20 pages from the entry origin to produce a performance + security + SEO report. The visitor sees the report; we keep the data to deliver it again if they come back.
How to identify our requests
- User-Agent
Mozilla/5.0 (compatible; ContextaSiteCheck/1.0; +https://site-check.contexta.uk/bots)- Method
- GET only. Never POST/PUT/DELETE.
- Frequency
- 1 request per second per scan, up to 20 pages.
- Depth
- BFS depth ≤ 2 from the entry URL.
- Same-origin
- We only fetch pages on the same host the visitor submitted.
- robots.txt
- Honored. We parse
Disallow:directives and skip those paths.
Paths we never visit
The crawler hard-skips path segments that look transactional, destructive, or auth-related — independent of robots.txt:
/checkout, /buy, /pay, /order, /cart, /delete, /remove, /logout, /signout, /admin, /billing, /subscribe, /unsubscribe, /invoice, /purchase, /confirm-order, /account/delete
Network identity
- IPv4 (egress)
20.162.34.229- Reverse DNS (PTR)
crawler.contexta.uk- Forward DNS (A)
crawler.contexta.uk → 20.162.34.229(DNS-only, not Cloudflare-proxied)- Verify ownership
dig -x 20.162.34.229→crawler.contexta.uk, thendig crawler.contexta.uk→20.162.34.229. Both directions must match — that's how Google/Bing crawler verification works.
How to allowlist us
If our crawler is being blocked by your edge (Cloudflare, Akamai, Cloudfront, Imperva, Sucuri, DataDome, etc), add a rule that whitelists either the User-Agent above OR the IPv4 + reverse-DNS pair. Bot certification (Cloudflare Verified Bots and the Google equivalent) is in progress; once granted, allowlisting will be automatic for participating edges.
Data we keep
- The URL the visitor submitted
- HTTP response headers + status codes from the pages we fetched
- The first 100 KB of each page body (used to score titles, security headers, etc — never republished)
- Aggregate score + findings (the scan report)
- The visitor's email if they submit it for the full report
We do not store page bodies long-term once the report is built. We never re-publish, re-sell, or use the captured data for training models.
If we got it wrong
If our crawler caused load, missed a robots.txt rule, or behaved unexpectedly, email [email protected]. We'll investigate within one working day and either fix the bug or back off your origin.