Guard Brasil: 16 Brazilian PII patterns in 4ms
TL;DR: Guard Brasil is an open-source API that detects 16 Brazilian personal data patterns (CPF, CNPJ, RG, MASP, REDS, SUS card...) in 4ms. Real check digit validation, not just regex. MIT license, self-hostable, free tier of 500 calls per month. Try it now.
The problem: Brazilian data in generic APIs #
When you use Microsoft Presidio or AWS Macie to detect personal data in text, they find emails, phone numbers, and credit cards. But they do not find MASP (functional ID for state employees in Minas Gerais), REDS (police incident report number in MG), SUS card (national health system ID), NIS and PIS (worker registration), or Titulo de Eleitor (voter ID with specific check digits). If you work with Brazilian data in chatbots, ERPs, health systems, or police investigation, these patterns matter as much as CPF and CNPJ. And no global library covers them.
The 16 patterns #
| Category | Patterns | Validation |
|---|---|---|
| Identity | CPF, CNPJ, RG, CNH, Titulo de Eleitor | Real check digit (CPF, CNPJ, Titulo) |
| Health and government | NIS/PIS, SUS Card, MASP | Format and length |
| Investigation | REDS (MG), judicial process number (CNJ) | CNJ/REDS standard format |
| Contact | Email, Phone (landline and mobile BR), CEP | Regex plus BR format |
| Vehicles | Mercosul plate, legacy plate | Format ABC1D23 and ABC-1234 |
| Financial | Credit card | Luhn algorithm |
Each pattern has an associated LGPD (Brazil's data protection law, similar to GDPR) risk level. CPF, CNH, and health data are CRITICAL under Art. 5 and Art. 11. Email and CEP are MEDIUM. The classification follows the ANPD (Brazil's data protection authority) interpretation of sensitive personal data.
Live test #
The API is public. No signup, no credit card:
curl -X POST https://guard.egos.ia.br/v1/inspect -H "Content-Type: application/json" -d '{"text": "Patient CPF: 123.456.789-09, SUS card 898 0016 0045 0004"}'Response in about 4ms:
{
"patterns": [
{"type": "CPF", "value": "123.456.789-09", "valid": true},
{"type": "SUS_CARD", "value": "898 0016 0045 0004"}
],
"lgpd_risk": "CRITICAL",
"has_sensitive_data": true,
"latency_ms": 4
}The field valid: true on the CPF means the check digits pass. This is what separates Guard from pure regex: 000.000.000-00 would return valid: false because it fails the algorithm. Pattern matching without validation creates false positives. Check digit validation is the difference.
Real use cases #
| Use case | How Guard helps | Risk without it |
|---|---|---|
| LLM chatbot | Inspect user input before sending to model | CPF or CNH leaks to third-party API |
| ETL pipeline | Classify PII fields before writing to data lake | Sensitive data in table with no access control |
| Police investigation (our case) | Audit trail of who accessed investigation data | No LGPD Art. 37 compliance |
| Healthtech | Detect health data (Art. 11) in free-text fields | ANPD fine for irregular sensitive data treatment |
| Log sanitization | Find PII in application logs | Personal data in Elasticsearch without protection |
Compliance, not masking #
Guard Brasil does not mask data from operators by default. This is intentional. In a police precinct, the investigator needs to see the suspect's CPF. In a hospital, the doctor needs to see the patient's SUS number. Masking that data would break their work. What Guard does is generate the audit trail: who accessed, when, what type of data, what risk level. That is what LGPD Art. 37 requires, a record of processing operations, not blocking legitimate access. Each call to the API internally generates a SHA-256 hash of the evidence as a provenance receipt, usable as auditable proof if ANPD requests it.
Guard Brasil versus alternatives #
| Metric | Guard Brasil | Presidio | AWS Macie |
|---|---|---|---|
| Latency p95 | 4ms | ~50ms (Python NLP) | Batch (minutes) |
| Native BR patterns | 16 | 2-3 if configured | Generic |
| Check digit validation | CPF, CNPJ, Titulo | Regex only | N/A |
| Self-hostable | Yes (MIT) | Yes (MIT) | No (AWS only) |
| LGPD classification | Native (Art. 5, 11) | Generic (GDPR) | Generic |
| Cost | Free tier 500/month | Free (self-host) | Pay-per-GB |
Guard does not replace Presidio or Macie. For global patterns (SSN, passport), use Presidio. For Brazilian structured data with real validation, use Guard Brasil. Running both in sequence is a valid architecture.
What did not work #
- Free-text name detection: Guard detects structured patterns, not names or addresses in free text. For unstructured PII, combine with an NLP approach.
- Partial masking heuristics: partially redacted data like 123.XXX.XXX-00 is not detected as CPF. Structural PII without digits is outside scope.
- Volume at free tier: 500 calls per month is tight for high-traffic apps. Self-hosting is the intended path for production scale.
Open questions #
- How to audit a Drive with thousands of files for PII retroactively at reasonable cost?
- What is the right granularity for LGPD risk levels in a multi-tenant system where tenants have different compliance needs?
- When does self-hosting Guard Brasil make more sense than the hosted API?
Files referenced in this article #
- packages/guard-brasil/ — Guard Brasil source (16 pattern modules, validators, classifier)
- packages/guard-brasil/src/index.ts — entry point, exports guard.inspect()
Related in EGOS #
- Wrong Altitude — why Guard Brasil is a tool, not the central product
- Documentation lies — the manifest that monitors Guard Brasil endpoints automatically
Open source. Everything here is available at github.com/enioxt/egos. If you are building something similar or want to apply this in your context, reach out on X: @eniorocha_. Building in public.