Input Pre-Filter and Guardrail System Prompt — Prompt Security, Injection & Jailbreak Mitigation Intermediate Task

The Scenario

An enterprise finance company wants to shield its core customer-facing LLM from malicious prompts, toxic inputs, and off-topic requests. Rather than making the core prompt overly complex and slow, they decide to use a dual-LLM architecture: a lightweight pre-filter LLM acts as an Application Firewall to analyze the user input, assign a risk score, and either pass the input or block it with a safe refusal response.

The Brief

Write a system instruction for an input pre-filter LLM. It must read the user query, evaluate it for security risk (prompt injection, jailbreak attempts, toxic/sensitive topics), output a JSON score block, and provide an alert message if the threshold is crossed.

Deliverables

The pre-filter system prompt instructions, including safety policies, risk level scales, and formatting constraints.
A strict JSON schema output by the pre-filter containing fields for `safetyScore`, `riskCategories` array, `action` (allow/block), and `responseFallback`.
Ten test scenarios (5 benign, 5 adversarial inputs) with their expected pre-filter evaluation JSON payloads.
A mitigation plan for false positives (when a legitimate customer query is incorrectly flagged as a security risk).

Submission Guidance

Deliver your solution in a single Markdown file. Ensure the pre-filter prompt uses precise criteria, and format the JSON outputs as code blocks.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.