Show HN: CLI to score AI prompts after a prod failure (opens in new tab)

(costguardai.io)

1 pointstechcam2mo ago1 comments

About six months ago I shipped a customer-facing feature where the system prompt had a subtle ambiguity in the instruction hierarchy. Within two days, users found a natural-language path that caused the model to ignore the safety constraint entirely.

It wasn’t a jailbreak — just phrasing I hadn’t anticipated. The prompt looked fine. It passed code review. It failed in production.

That made me realize how little tooling exists between “write a prompt” and “ship it.”

We have linters for code. We have type checkers. We have static analysis.

For prompts, we mostly have vibes.

So I built CostGuardAI.

npm install -g @camj78/costguardai costguardai analyze my-prompt.txt

It analyzes prompts across a few structural risk dimensions: - jailbreak / prompt injection surface - instruction hierarchy ambiguity - under-constrained outputs (hallucination risk) - conflicting directives - token cost + context usage

It outputs a CostGuardAI Safety Score (0–100, higher = safer) and shows what’s driving the risk.

Example:

CostGuardAI Safety Score: 58 (Warning)

Top Risk Drivers: - instruction ambiguity - missing output constraints - unconstrained role scope

The scoring isn’t trying to predict every failure — it’s closer to static analysis: catching structural patterns that correlate with prompts breaking in production.

If you want to see output before installing: https://costguardai.io/report/demo https://costguardai.io/benchmarks

I’m a solo founder and this is still early, but it’s already caught real issues in my own prompts.

Curious what HN thinks — especially from people working on prompt evals or LLM safety tooling.

1 comments

techcamOP2mo ago

Happy to explain how the scoring works since that’s the obvious first question.

The core idea is:

Safety Score = 100 − riskScore

The risk score is based on structural prompt properties that tend to correlate with failures in production systems:

- instruction hierarchy ambiguity - conflicting directives (system vs user) - missing output constraints - unconstrained response scope - token cost / context pressure

Each factor contributes a weighted amount to the total risk score.

It’s not trying to predict exact model behavior — that’s not possible statically.

The goal is closer to a linter: flagging prompt structures that are more likely to break (injection, hallucination drift, ignored constraints, etc).

There’s also a lightweight pattern registry. If a prompt matches structural patterns seen in real jailbreak/injection cases (e.g. authority ambiguity), the score increases.

One thing that surprised me while building it: instruction hierarchy ambiguity caused more real-world failures than obvious injection patterns.

The CLI runs locally — no prompts are sent anywhere.

If you want to try it:

npm install -g @camj78/costguardai costguardai analyze your-prompt.txt

Curious what failure modes others here have seen in production prompts.

j / k navigate · click thread line to collapse