Skip to main content

Allocation & Usage Enforcement

GhostPour uses dollar-based allocation instead of request counts. Each tier has a monthly cost limit, and every LLM request's cost is tracked against it.

Per-Request Enforcement

Every POST /v1/chat request passes through this pipeline:

  1. Resolve user from JWT (tier is always read from DB, never from JWT)
  2. Auto model routingprovider: "auto", model: "auto" resolves to the tier's default model
  3. Model access check — is the provider/model allowed for this tier? Are images within limit?
  4. Rate limit — in-memory token bucket per user, tier.requests_per_minute
  5. Quota checkmonthly_used_usd vs monthly_cost_limit_usd
  6. Feature gating — Context Quilt and other feature checks
  7. Route to LLM — send to upstream provider
  8. Calculate cost — LiteLLM pricing with cached token discount
  9. Record cost — increment monthly_used_usd
  10. Return response with allocation headers

Response Headers

Every chat response includes allocation state:

X-Allocation-Percent: "45.3"      // % of monthly limit used
X-Allocation-Warning: "true" // present when >= 80%
X-Monthly-Used: "1.1325" // dollars used this period
X-Monthly-Limit: "2.50" // dollars allowed this period

Your iOS app can use these to show a usage bar or warning indicator without making a separate API call.

Exhausted Response

When allocation runs out, the response is a structured 429:

{
"detail": {
"code": "allocation_exhausted",
"message": "Monthly allocation exhausted ($2.50/$2.50). Upgrade your plan for more hours.",
"details": {
"monthly_used": 2.5,
"monthly_limit": 2.5,
"fallback": "on_device"
}
}
}

Usage Reporting

GET /v1/usage/me returns everything your iOS app needs:

{
"tier": "pro",
"tier_display_name": "Pro",
"allocation": {
"monthly_limit_usd": 2.50,
"monthly_used_usd": 0.4567,
"monthly_remaining_usd": 2.0433,
"percent_used": 18.3,
"resets_at": "2026-04-23T15:30:00+00:00"
},
"hours": {
"used": 9.1,
"limit": 50.0,
"remaining": 40.9
},
"this_month": {
"requests": 42,
"input_tokens": 15234,
"output_tokens": 8901,
"cached_tokens": 1200,
"cost_usd": 0.4567
},
"summary_mode": "delta",
"summary_interval_minutes": 10,
"max_images_per_request": 2,
"features": { "context_quilt": "enabled" },
"is_trial": false,
"trial_end": null
}

Hours conversion: hours = cost / model_cost_per_hour (Haiku = $0.05/hr, Sonnet = $0.19/hr). This gives users an intuitive unit instead of raw dollars.

Tier-Locked Settings

When your iOS app uses GhostPour as the provider, these settings are server-controlled:

SettingBring-Your-Own-KeyGhostPour Managed
Model selectionUser choiceLocked: "auto" (server picks)
Auto-summary intervalUser choice (2-15 min)Locked per tier: 10 or 15 min
Summary modeUser choiceLocked per tier: "delta" or "choice"
Max images per query5Locked per tier: 1-5
Context QuiltUser choiceTier-gated: enabled/teaser/off

The iOS app reads these from the /v1/usage/me response and locks the corresponding UI controls.

Why summary intervals vary by tier

Auto-summaries are full chat requests that consume allocation. Sonnet is ~4x more expensive per token than Haiku, so Sonnet tiers default to 15-minute intervals instead of 10 to prevent users from burning through hours unexpectedly.

On-Device Fallback

When allocation is exhausted, offer a graceful degradation path:

This keeps your app functional even when the user has used all their cloud hours.

Future: Overage Credits

The database infrastructure for overage credit purchases exists but is not yet active:

monthly_allocation → overage_balance → 429 fallback to on-device

The overage_balance_usd column and response fields are in place. When ready, add a StoreKit consumable product for credit packs and wire up the deduction logic.