← Back to blog

We graded 7 major APIs for AI-agent readiness. Most failed.

Stripe's OpenAPI spec contains zero examples. Not "few." Zero, across 587 operations and 1,422 schemas. Error responses are documented on none of those operations.

Stripe's hand-written documentation, the one engineers cite as the gold standard, contains all of this. The machine-readable spec does not.

This matters because the documentation an AI agent sees when it calls Stripe isn't the docs site. It's the spec.

Spec, not docs

The major LLMs were trained on the public internet. Ask a fresh chat session "how do I create a Stripe Charge" and the answer will be roughly correct, because Stripe's docs are baked into the model weights.

Wire that same model up as an agent and the picture changes. Agent frameworks (LangChain, OpenAI function calling, MCP, anything else) build the toolset at runtime from the OpenAPI spec. The function names, descriptions, parameters, and examples the model sees in its context window come from there. Not from the docs.

So when an agent decides between POST /v1/charges and POST /v1/payment_intents, what to put in the body, and how to interpret the response, it's leaning on the spec. The richer the spec, the more reliable the call. The thinner the spec, the more the agent guesses.

One thing worth being precise about: schemas alone are not enough. OpenAPI schemas are often loose by design. A field typed as object with no further constraint passes spec validation while telling the agent nothing about what to send. An example pins down what the schema leaves vague. This is a recurring pattern in what follows.

We ran the public OpenAPI specs of seven well-known APIs through AgenticScore, our six-dimension scoring engine that evaluates examples, semantic clarity, error handling, intent signals, parameter documentation, and pagination. Each rule is weighted by how unrecoverable its absence is at the moment an agent constructs a call. The full methodology is here, including every weight and why.

The leaderboard

Sorted by overall score, out of 100:

#APIScoreGrade
1Plaid63C
2Twilio58D
3Vercel58D
4GitHub55D
5Stripe37F
6OpenAI32F
7Resend19F

No A's. No B's. One C, three D's, three F's. Both Stripe and OpenAI fail outright.

The Stripe paradox

Stripe's docs are the reference founders cite when they describe what good developer documentation looks like. Stripe's spec scores 37, an F.

The disconnect lives entirely in the metadata. Stripe's spec features zero examples and zero documented error responses. To make matters worse, 1,149 of its 1,422 schemas completely lack a description, and no RFC 9457 Problem Detail format is detected anywhere in error responses.

This isn't Stripe being careless. It's symptomatic. Most API teams treat the OpenAPI spec as a downstream artifact, generated from code so SDK generators can do their job. The richness lives in the hand-written docs. Those docs are for humans reading at design time. The spec is for agents calling at runtime. The same content needs to live in both places, and almost nowhere does it.

The OpenAI inversion

OpenAI scored 32, an F. Examples: 1 out of 100. Error handling: 5 out of 100.

241 of 242 operations lack request and response examples. Most omit error documentation. 223 of 242 have missing or very short descriptions. Their scores for parameter documentation (97/100) and pagination (98/100) are excellent. But none of that matters when the agent can't figure out what the payload should look like in the first place.

The company building the LLMs ships an OpenAPI spec that LLMs struggle with. That's a mirror, not a gotcha. The same teams building AI products inherit the same docs-first culture as everyone else. The spec lags.

If you're building an agent

1. Assume the spec is incomplete

Your framework is almost certainly loading the OpenAPI spec as the source of truth for what's callable. Missing examples means the agent is missing the context an LLM relies on most to understand structure. Missing errors means the agent has no recovery vocabulary when something fails.

Patch it in your prompt. Hand-write canonical examples for the operations you actually use, inject them into context. Don't trust the spec to do it for you.

2. Score the specs you depend on

If your agent's success rate is below where you want it, the cause might not be the model or the prompt. It might be the spec. Count how many operations your agent calls that lack documented errors. Each one is a place the agent will improvise. AgenticScore gives you a number to track over time.

3. Pin your spec version

The Stripe spec we scored (2026-05-27.dahlia) is not the spec your agent will see next month. As vendors invest in OpenAPI quality, scores will move. Pin the version your agent uses so you control when behavior changes.

If you maintain an API

Your spec is in the same shape as everyone else's. The gap is cheaper to close than you'd think.

In priority order:

  1. Add examples to your top 10 to 20 operations. Not every operation. The ones agents will actually call. Your hand-written docs probably already contain the examples. Mirror them into the spec.
  2. Document error responses on every operation. Adopt RFC 9457 Problem Detail while you're there. Agents need to recover from failure gracefully or they burn user trust on every glitch.
  3. Replace machine-generated operationIds with verbs. sendEmail beats v1_emails_post_handler or create_87a2b. Agents treat these IDs as semantic map keys. Cryptic or overly rigid names force the model to guess.
  4. Describe your parameters. "limit" is obvious. "q", "after", "expand" are not. Each undescribed parameter is a coin flip.

None of this requires rewriting your API. It's metadata. A focused week on a 200-operation spec can move you from a D to a B.

The bar is under the bar

The interesting result here isn't that scores are low. It's that they're uniformly low, at the companies the rest of the industry models itself on. The bar everyone compares themselves to is itself below the bar.

The volume of API calls being made by AI agents will dwarf the volume made by humans within a few years. The APIs that are easy for agents to call will get called. The ones that aren't will get wrapped, routed around, or replaced.

There's room to be the first API in your category that treats its spec like agents matter.


Full per-API breakdowns: Stripe, GitHub, OpenAI, Plaid, Twilio, Vercel, Resend. The leaderboard updates as specs are re-scored. Methodology: how scoring works.


How does your API score?

Run AgenticScore on your own OpenAPI spec.

npx agenticscore score ./openapi.yaml
Get API Key →