Level 0 · Step 4 of 5

Level 0 · Reachable

Don't block the agent path

CORS support, and making sure CAPTCHAs and JS challenges don't categorically lock verified agents out.

12 minBeginnerAuditing + light config

In this lesson

You’ve spent three lessons making the site secure and solid. This one makes sure none of that — or your host’s defaults — accidentally walls off the very agents you want to reach you.

See where your own defenses might be locking legitimate agents out
Tell the difference between blocking abuse and blocking automation
Keep challenges on sensitive actions, not on your public content
Allow cross-origin reads of public resources with CORS
Confirm a non-browser client can actually reach your pages

The Level 0 trap: protecting yourself into invisibility

Security and reachability pull in opposite directions if you’re not careful. The instinct to “keep bots out” is reasonable — but most blanket anti-bot measures don’t distinguish a scraper from a helpful AI assistant. They block all non-browser clients, and that includes the agents you want.

The Level 0 rule is simple: a legitimate agent must have a non-interactive path to your public content. If the only way in is to solve a puzzle or run a browser challenge, an agent can’t follow it — and to that agent, you don’t exist.

Note

This is the one place where “more security” can score you lower. Hardening headers (last lesson) are invisible to clients. A CAPTCHA wall is not — it’s a locked door. The skill here is keeping the locks on the right doors.

Step 1 — Find the walls you didn’t mean to build

Most blocking isn’t something you deliberately set — it’s a default, a plugin, or a checkbox in your CDN. Go looking for these:

A CAPTCHA in front of content. Fine on a login or contact form; a problem when it gates whole pages.
“Checking your browser…” challenges. Cloudflare’s “Under Attack” mode and an over-eager Bot Fight Mode throw a JavaScript interstitial that non-browser clients can’t pass.
Blanket user-agent or IP blocking. Rules that drop anything without a browser-like user-agent, or that ban whole cloud-provider IP ranges — which is exactly where agents run.
Over-tight rate limits. Limits so aggressive that normal reads come back as 429 Too Many Requests.

The fastest way to find them is to stop browsing like a human. Fetch your key pages as a plain, non-browser client and see what comes back:

terminalbash

# Do public pages return a 200 and real content — not a challenge?
curl -s -o /dev/null -w "Status: %{http_code}\n" https://yourdomain.com
curl -s https://yourdomain.com | grep -i "a phrase from your page"

A 200 with your real content means the path is clear. A 403, a 503, or a page full of “enable JavaScript to continue” means something is standing in the way.

Step 2 — Protect the actions, not the whole house

The fix is never “turn off security.” It’s to aim it. Keep your challenges exactly where abuse actually happens — login, signup, checkout, the contact form — and leave your public content freely reachable.

Rate limiting is the same idea: be generous with ordinary reads, and save the tight limits for expensive or sensitive endpoints. And if you do challenge traffic, let known-good agents through rather than fighting everything.

You don’t have to guess who the good agents are. The major crawlers publish their user-agents and IP ranges precisely so you can allow — and verify — them: OpenAI documents GPTBot, OAI-SearchBot, and ChatGPT-User; Anthropic documents ClaudeBot; Google documents Googlebot. The catch is that a user-agent string is trivial to fake, so never trust it alone. Verify a request by reverse-DNS lookup or by matching it against the operator’s published IP ranges before you treat it as the real thing.

Pro / agency note

On Cloudflare, the usual culprit is Bot Fight Mode — it challenges legitimate agents by default. Turn it off (or move to the more precise Super Bot Fight Mode) and lean on the Verified Bots allowlist, which lets recognized crawlers and agents pass while still stopping the junk.

Step 3 — Let public content be read cross-origin (CORS)

One more door, and it’s an invisible one. CORS — Cross-Origin Resource Sharing — is the rule that decides whether one website’s JavaScript is allowed to read a response from your site. By default, browsers block those cross-site reads unless your server opts in with a header.

This matters for the growing class of agents and tools that run inside a browser context. When they try to fetch your public page or feed from a different origin and you haven’t allowed it, the read just fails — quietly.

For genuinely public, anonymous resources, opt in. On nginx:

server blocknginx

# nginx — allow cross-origin reads of PUBLIC resources
add_header Access-Control-Allow-Origin "*" always;

On Apache, in the same .htaccess:

.htaccessapache

# .htaccess — allow cross-origin reads of PUBLIC resources
Header always set Access-Control-Allow-Origin "*"

One important boundary, though — this is the place CORS goes wrong.

Common mistake

Use * only for public, non-credentialed content. The moment an endpoint relies on cookies or auth, you must not pair a wide-open origin with credentials — name specific trusted origins instead. A reflected origin plus Access-Control-Allow-Credentials: true is a real data leak, not a convenience.

With the public path open and the sensitive bits still guarded, do the final check.

Verify it worked

Two confirmations. First, a non-browser client gets a clean 200 and your real content (the curl from Step 1). Second, the CORS header is actually present for a cross-origin reader:

bash

# Is the CORS header coming back for an outside origin?
curl -s -I -H "Origin: https://example.com" https://yourdomain.com | grep -i access-control-allow-origin

Then let the scanner confirm reachability from the outside, the way an agent would see it.

Run the free OARS scanner

With the public path clear for non-interactive clients and the sensitive bits still guarded, here is what keeps your content reachable to legitimate agents.

Recap

A legitimate agent needs a non-interactive path to your public content.
Audit for CAPTCHAs, JS challenges, UA/IP blocks, and harsh rate limits.
Challenge sensitive actions; leave public content reachable. Allow verified bots.
Open CORS for public resources only — never wide-open plus credentials.

Resources

knov.ai — OARS Standard · Level 0 (Reachable) — the official spec section for this level, with the exact, testable requirements this lesson maps to
MDN — Cross-Origin Resource Sharing (CORS) — the authoritative reference on CORS headers and when to name origins versus open them up
Cloudflare — Verified bots — let recognized agents through while blocking junk
OpenAI — crawlers & bots (GPTBot, OAI-SearchBot, ChatGPT-User) — the user agents OpenAI crawlers send, so you can identify and allow them deliberately
Anthropic — ClaudeBot & crawler control — how Anthropic crawls the web and how to allow or block ClaudeBot at your site
Google — verifying Googlebot & crawlers — how to confirm a request is really Googlebot before trusting it, by reverse DNS

You can now tell, with two quick checks, whether a legitimate agent can actually reach your public content — and you have closed the gaps that quietly turn agents away, from CAPTCHAs and JS challenges to UA blocks and credentialed CORS misconfigurations. That completes every build step in Level 0: secure, up, hardened, and open to agents. Next is the Recap, where we tie it together and turn it into something you can sell.

Knowledge check

1. What is the core Level 0 rule about agent access?

Serve agents a stripped-down simplified version of each pageBlock all non-browser clients from public pages by defaultRequire agents to register an account before reading pagesGive legitimate agents a non-interactive path to content

2. Why are blanket anti-bot measures a problem for agents?

They block all non-browser clients, not just scrapersThey expose sensitive endpoints to the publicThey slow the server during peak trafficThey strip security headers from responses

3. Where should you keep challenges like CAPTCHAs, according to the lesson?

Only on pages that load slowlyOn sensitive actions like login and checkoutOn every page to maximize protectionOn public content but not on forms

4. What does a CORS header decide?

Which user-agent strings are permitted to connectWhich TLS versions a connecting client may negotiateWhether another site’s JavaScript may read your responseHow long a browser is allowed to cache your pages

5. When is using Access-Control-Allow-Origin: * a dangerous mistake?

When the response is cached by a CDNWhen the resource is genuinely publicWhen the site already sends HSTSWhen the endpoint relies on cookies or auth

← Baseline security headers Recap →