I wrote the standard for making websites AI-operable. Learn More

OARS Practitioner Course
Level 0 · Step 4 of 5

Course / Level 0 · Reachable / Step 4

Level 0 · Reachable

Don't block the agent path

CORS support, and making sure CAPTCHAs and JS challenges don't categorically lock verified agents out.


12 minBeginnerAuditing + light config

In this lesson

You’ve spent three lessons making the site secure and solid. This one makes sure none of that — or your host’s defaults — accidentally walls off the very agents you want to reach you.

  • See where your own defenses might be locking legitimate agents out
  • Tell the difference between blocking abuse and blocking automation
  • Keep challenges on sensitive actions, not on your public content
  • Allow cross-origin reads of public resources with CORS
  • Confirm a non-browser client can actually reach your pages

The Level 0 trap: protecting yourself into invisibility

Security and reachability pull in opposite directions if you’re not careful. The instinct to “keep bots out” is reasonable — but most blanket anti-bot measures don’t distinguish a scraper from a helpful AI assistant. They block all non-browser clients, and that includes the agents you want.

The Level 0 rule is simple: a legitimate agent must have a non-interactive path to your public content. If the only way in is to solve a puzzle or run a browser challenge, an agent can’t follow it — and to that agent, you don’t exist.

Note

This is the one place where “more security” can score you lower. Hardening headers (last lesson) are invisible to clients. A CAPTCHA wall is not — it’s a locked door. The skill here is keeping the locks on the right doors.

Step 1 — Find the walls you didn’t mean to build

Most blocking isn’t something you deliberately set — it’s a default, a plugin, or a checkbox in your CDN. Go looking for these:

  • A CAPTCHA in front of content. Fine on a login or contact form; a problem when it gates whole pages.
  • “Checking your browser…” challenges. Cloudflare’s “Under Attack” mode and an over-eager Bot Fight Mode throw a JavaScript interstitial that non-browser clients can’t pass.
  • Blanket user-agent or IP blocking. Rules that drop anything without a browser-like user-agent, or that ban whole cloud-provider IP ranges — which is exactly where agents run.
  • Over-tight rate limits. Limits so aggressive that normal reads come back as 429 Too Many Requests.

The fastest way to find them is to stop browsing like a human. Fetch your key pages as a plain, non-browser client and see what comes back:

terminalbash
# Do public pages return a 200 and real content — not a challenge?
curl -s -o /dev/null -w "Status: %{http_code}\n" https://yourdomain.com
curl -s https://yourdomain.com | grep -i "a phrase from your page"

A 200 with your real content means the path is clear. A 403, a 503, or a page full of “enable JavaScript to continue” means something is standing in the way.

Step 2 — Protect the actions, not the whole house

The fix is never “turn off security.” It’s to aim it. Keep your challenges exactly where abuse actually happens — login, signup, checkout, the contact form — and leave your public content freely reachable.

Rate limiting is the same idea: be generous with ordinary reads, and save the tight limits for expensive or sensitive endpoints. And if you do challenge traffic, let known-good agents through rather than fighting everything.

You don’t have to guess who the good agents are. The major crawlers publish their user-agents and IP ranges precisely so you can allow — and verify — them: OpenAI documents GPTBot, OAI-SearchBot, and ChatGPT-User; Anthropic documents ClaudeBot; Google documents Googlebot. The catch is that a user-agent string is trivial to fake, so never trust it alone. Verify a request by reverse-DNS lookup or by matching it against the operator’s published IP ranges before you treat it as the real thing.

Pro / agency note

On Cloudflare, the usual culprit is Bot Fight Mode — it challenges legitimate agents by default. Turn it off (or move to the more precise Super Bot Fight Mode) and lean on the Verified Bots allowlist, which lets recognized crawlers and agents pass while still stopping the junk.

Step 3 — Let public content be read cross-origin (CORS)

One more door, and it’s an invisible one. CORS — Cross-Origin Resource Sharing — is the rule that decides whether one website’s JavaScript is allowed to read a response from your site. By default, browsers block those cross-site reads unless your server opts in with a header.

This matters for the growing class of agents and tools that run inside a browser context. When they try to fetch your public page or feed from a different origin and you haven’t allowed it, the read just fails — quietly.

For genuinely public, anonymous resources, opt in. On nginx:

server blocknginx
# nginx — allow cross-origin reads of PUBLIC resources
add_header Access-Control-Allow-Origin "*" always;

On Apache, in the same .htaccess:

.htaccessapache
# .htaccess — allow cross-origin reads of PUBLIC resources
Header always set Access-Control-Allow-Origin "*"

One important boundary, though — this is the place CORS goes wrong.

Common mistake

Use * only for public, non-credentialed content. The moment an endpoint relies on cookies or auth, you must not pair a wide-open origin with credentials — name specific trusted origins instead. A reflected origin plus Access-Control-Allow-Credentials: true is a real data leak, not a convenience.

With the public path open and the sensitive bits still guarded, do the final check.

Verify it worked

Two confirmations. First, a non-browser client gets a clean 200 and your real content (the curl from Step 1). Second, the CORS header is actually present for a cross-origin reader:

bash
# Is the CORS header coming back for an outside origin?
curl -s -I -H "Origin: https://example.com" https://yourdomain.com | grep -i access-control-allow-origin

Then let the scanner confirm reachability from the outside, the way an agent would see it.

That’s the agent path clear. Here’s the takeaway.

Recap

  • A legitimate agent needs a non-interactive path to your public content.
  • Audit for CAPTCHAs, JS challenges, UA/IP blocks, and harsh rate limits.
  • Challenge sensitive actions; leave public content reachable. Allow verified bots.
  • Open CORS for public resources only — never wide-open plus credentials.

Worth bookmarking as you go:

§Level 0 · Reachable — read the exact requirements

That’s every build step in Level 0 — secure, up, hardened, and open to agents. Next is the Recap, where we tie it together and turn it into something you can sell.

Knowledge check

1. What is the core Level 0 rule about agent access?
2. Why are blanket anti-bot measures a problem for agents?
3. Where should you keep challenges like CAPTCHAs, according to the lesson?
4. What does a CORS header decide?
5. When is using Access-Control-Allow-Origin: * a dangerous mistake?