Methodology

How we score OpenAPI specs
for AI agent readiness

The principle, the dimensions, the weights, and what the score deliberately does not measure.


What the score measures

The AgenticScore is a single number, 0 to 100, that estimates how reliably an AI agent can call an API given only its OpenAPI specification. It does not measure the API's design, performance, security, or business value. It measures one thing: how much context the spec gives an agent at the moment it constructs a call.

We chose this scope deliberately. Agent frameworks like LangChain, OpenAI function calling, MCP, and others all consume the OpenAPI spec at runtime to build their tool schema. The spec is what the model actually sees. Everything else (the docs site, the SDK, the support team) exists outside that context window.


The five-stage failure chain

An agent's API call passes through five stages. Each is an opportunity to fail.

  1. Discovery. Does the agent know this endpoint exists?
  2. Selection. Does it pick this endpoint over the alternatives?
  3. Construction. Does it build a valid request body and parameters?
  4. Interpretation. Does it understand the response?
  5. Recovery. When the call fails, can it recover or retry intelligently?

Every dimension we measure addresses one or more of these stages. Every dimension we don't measure either falls outside this chain or is too noisy to score reliably at scale. We document both.


The weighting principle

Each rule is weighted by how unrecoverable its absence is.

If an agent can compensate for a missing signal by reading sibling signals (other parameters, schema names, path structure), the weight is lower. If the missing signal causes immediate failure with no fallback, the weight is higher.

Concretely: missing examples force the agent to guess request shape from schema alone, and OpenAPI schemas are often loose. A field typed as object with no further constraint passes spec validation while telling the agent almost nothing about what to send. An example pins down what the schema leaves vague. There is no sibling signal that fills that gap. Missing pagination parameters, by contrast, affect only list endpoints, and the agent can often retry with common patterns (limit, cursor, page). The first is unrecoverable. The second is recoverable. The weights reflect that.

This is a stated, falsifiable principle. If you disagree with a specific weight, the right argument is "this signal is more (or less) recoverable than the framework claims" and we'll listen.


The rules and their weights

Eleven rules across six categories. Weights sum to 120 points. The overall score is normalized to 0–100.

RuleCategoryStageWeightWhy this weight
Operation examplesExamplesConstruction25Highest weight. Without an example, the agent guesses the request body from schema definitions, and schemas often underspecify on purpose. Fields typed as object or any pass linting while telling the agent nothing about expected structure. An example pins down what the schema leaves loose.
Standard error responsesErrorsRecovery20Second-highest. Without documented errors, the agent has no recovery vocabulary. Multi-step agents cascade silently when a single step fails unexpectedly.
Operation descriptionsSemanticsSelection18Critical for endpoint selection. Partial fallback exists via operationId and parameter names, but only if those are also good.
Parameter descriptionsParametersConstruction12"limit" is obvious. "q", "after", "expand" are not. Some recovery from common naming conventions, so weight is moderate.
Schema examplesExamplesConstruction10Schemas can be reconstructed from their definitions, imperfectly. Examples speed up that reconstruction. Lower weight than operation examples because schemas are typically referenced from multiple operations.
Schema descriptionsSemanticsSelection / Interpretation8Field semantics can usually be inferred from names. Descriptions help, especially for ambiguous fields (status, type, kind).
Operation IDsIntentSelection8Helps endpoint selection. Recoverable from path, method, and description if any of those are present.
Operation summariesSemanticsDiscovery5Useful for fast classification. Fully redundant with description if one exists.
Pagination documentationPaginationConstruction5Narrow scope: applies only to list operations. Recoverable via standard parameter heuristics (limit, offset, cursor).
RFC 9457 Problem DetailErrorsRecovery5Standardized error format. Nice-to-have. Custom error schemas work fine if documented (which the standard-codes rule already measures).
TagsIntentDiscovery4Helpful for clustering related operations. Recoverable from path namespacing.

The full rule implementations are in the scoring library. Open an issue if a weight feels wrong.


Category totals

Aggregating the rules into the six categories you see on the leaderboard and detail pages:

CategorySum of rule weightsShare of total
Examples3529%
Semantics3126%
Errors2521%
Parameters1210%
Intent1210%
Pagination54%

Grade boundaries

  • A: 90 to 100
  • B: 75 to 89
  • C: 60 to 74
  • D: 40 to 59
  • F: below 40

Strict on purpose. A passing grade should mean an agent can rely on the spec, not that the spec exists.


What this score does not measure

We are loudly explicit about the limits.

The spec is not the runtime

An API can have a perfect OpenAPI spec and still behave differently in production. Undocumented rate limits, regional latency, soft deprecation, idempotency edge cases. The score measures the spec as written, not the API as run.

Authentication and security are not graded

How an API handles OAuth flows, API key rotation, or scope granularity is critical for real agents but extremely hard to evaluate from spec alone. We will likely add an authentication dimension in a future version. It is not in v1.

Breaking change history is not graded

An API with a frozen spec scores the same as one with weekly breaking changes. Version stability matters to agents but it is a property of the API's history, not its current spec.

Spec freshness is not weighted

A six-month-old spec scores the same as one updated yesterday. We capture specVersion and the date we scored it. Readers can judge for themselves whether the score is current.

Operation count is not penalized

A 5-operation API with 100% example coverage scores the same as a 5,000-operation API with 100%. We report absolute counts in findings so readers can judge severity, but the score itself is percentage-based. We considered penalizing scale and decided against it for v1. The argument for is real (larger established APIs deserve more scrutiny). The argument against is also real (small APIs would always look better in relative terms even when the experience is identical). If we change this, it will be loud.

Quality of examples is not measured

A schema with a placeholder example ({"foo": "bar"}) scores the same as one with a realistic, complete example. We measure presence, not quality, because quality is subjective and gameable.


How to challenge a score

If a score looks wrong, three options:

  1. The spec we scored may not be your current spec. Specs change. We show the specVersion and date on every detail page so this is verifiable.
  2. A specific rule may be misfiring. The rule implementations are in open source. File an issue with a reproduction case.
  3. The weights may be wrong. We can be argued with. The right argument is structural: "this signal is more (or less) recoverable than the framework claims, here's why." We will update and credit.

Why publish a methodology at all

Most scoring tools that grade other people's work either hide their methodology or hand-wave it. We publish ours because the alternative is asking readers to trust a number with no audit trail. A defensible score is one you can attack in writing.

If the post that brought you here is provocative, this page is the receipts.


Score your own spec

The CLI is free and runs locally. The API is paid.

npx agenticscore score ./openapi.yaml
Get API Key →