Crawler policy

ContextaSiteCheck — bot policy and verification

This page documents the polite crawler that powers Site Check's free website scans. If you're a site owner whose WAF, bot manager, or analytics flagged us, you're in the right place.

What the crawler does

When a visitor submits their URL on site-check.contexta.uk, our backend fetches up to 20 pages from the entry origin to produce a performance + security + SEO report. The visitor sees the report; we keep the data to deliver it again if they come back.

How to identify our requests

User-Agent: Mozilla/5.0 (compatible; ContextaSiteCheck/1.0; +https://site-check.contexta.uk/bots)
Method: GET only. Never POST/PUT/DELETE.
Frequency: 1 request per second per scan, up to 20 pages.
Depth: BFS depth ≤ 2 from the entry URL.
Same-origin: We only fetch pages on the same host the visitor submitted.
robots.txt: Honored. We parse Disallow: directives and skip those paths.

Paths we never visit

The crawler hard-skips path segments that look transactional, destructive, or auth-related — independent of robots.txt:

/checkout, /buy, /pay, /order, /cart, /delete, /remove, /logout, /signout, /admin, /billing, /subscribe, /unsubscribe, /invoice, /purchase, /confirm-order, /account/delete

Network identity

IPv4 (egress): 20.162.34.229
Reverse DNS (PTR): crawler.contexta.uk
Forward DNS (A): crawler.contexta.uk → 20.162.34.229 (DNS-only, not Cloudflare-proxied)
Verify ownership: dig -x 20.162.34.229 → crawler.contexta.uk, then dig crawler.contexta.uk → 20.162.34.229. Both directions must match — that's how Google/Bing crawler verification works.

How to allowlist us

If our crawler is being blocked by your edge (Cloudflare, Akamai, Cloudfront, Imperva, Sucuri, DataDome, etc), add a rule that whitelists either the User-Agent above OR the IPv4 + reverse-DNS pair. Bot certification (Cloudflare Verified Bots and the Google equivalent) is in progress; once granted, allowlisting will be automatic for participating edges.

Data we keep

The URL the visitor submitted
HTTP response headers + status codes from the pages we fetched
The first 100 KB of each page body (used to score titles, security headers, etc — never republished)
Aggregate score + findings (the scan report)
The visitor's email if they submit it for the full report

We do not store page bodies long-term once the report is built. We never re-publish, re-sell, or use the captured data for training models.

If we got it wrong

If our crawler caused load, missed a robots.txt rule, or behaved unexpectedly, email [email protected]. We'll investigate within one working day and either fix the bug or back off your origin.

← Back to Site Check