Using the CREPDL Validator
This guide explains what CREPDL is, how to use CREPDL·CHECK to validate text, and how to write your own CREPDL schema to describe any character repertoire.
What is CREPDL?
CREPDL (Character Repertoire Description Language) is an ISO/IEC standard (19757-7:2020) for describing exactly which Unicode characters are permitted in a given context — a language, a document type, a form field, or any other constrained text domain.
A CREPDL schema is a small XML file. It lists the permitted code points using
<char> elements for individual characters or ranges, combined with set
operations like <union>, <intersection>, and
<difference>. CREPDL·CHECK reads this schema and checks whether every
character in your text belongs to the declared repertoire.
Common uses include: checking that user-submitted text in a web form only contains characters from a specific script; validating that a localised document does not accidentally contain characters from the wrong language; or auditing legacy data for encoding problems.
http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0.
This must appear as the default namespace on the root element of every schema you write.
How to validate text
Load a CREPDL schema
Paste a CREPDL XML schema directly into the CREPDL Schema panel on the left, or click Upload .crepdl file to load one from disk. The schema panel accepts any well-formed XML in either the current or legacy CREPDL namespace.
Parse the schema
Click Parse Schema. The application checks the XML for syntax errors and counts the declared character definitions. A green ✓ Schema loaded badge confirms success. Any structural warnings appear beneath the schema panel.
Enter or upload your text
Type or paste the text you want to check into the Text to Validate panel, or click Upload text file. The character counter updates live. There is no length limit enforced by the interface.
Run validation
Click ▶ Validate Against Repertoire. The text is sent to the server together with the schema. Results appear immediately below — a green banner if the text is fully compliant, or a red banner listing every out-of-repertoire character found.
Reading the results
After validation, three tabs present the findings:
Violations — a table of every distinct code point found in the text that
is not in the permitted repertoire. Each row shows the character glyph, its Unicode code
point (e.g. U+00E9), a description, and how many times it appears.
Preview — the full text with every violating character highlighted with a red wavy underline. Hovering over a highlighted character shows a tooltip with the code point and description.
Statistics — a summary grid showing total characters checked, violation count, unique violating code points, violation rate, permitted code point count, and compliant character count.
Schema structure
Every CREPDL schema is a well-formed XML document. The root element is a collection
expression — most commonly <union> — in the CREPDL namespace.
There is no outer wrapper element; the collection is the document.
<?xml version="1.0" encoding="UTF-8"?>
<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
<!-- A single character -->
<char>U+0041</char>
<!-- An inclusive range of characters -->
<char>U+0061-U+007A</char>
</union>
The root element must carry the namespace declaration
xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0".
Without it, the schema will be rejected.
The root can be any collection element: <union>,
<intersection>, <difference>, or even a bare
<char> if the entire repertoire is a single character or range.
In practice almost all schemas use <union> as the root because
most repertoires are additive.
The <char> element
<char> is the atomic element of CREPDL. Its text content specifies
one or more Unicode code points using standard U+XXXX notation.
| Format | Example | Meaning |
|---|---|---|
| <char>U+XXXX</char> | <char>U+0041</char> |
Permits exactly the one character at code point U+0041 (LATIN CAPITAL LETTER A) |
| <char>U+XXXX-U+YYYY</char> | <char>U+0061-U+007A</char> |
Permits every code point from U+0061 to U+007A inclusive (a–z, 26 characters) |
Code point values are written in uppercase hexadecimal with the U+ prefix.
They must be zero-padded to at least four digits: U+0041, not
U+41. Supplementary-plane characters use five or six digits:
U+1F600.
In a range, the start code point must be less than or equal to the end code point. The range is always inclusive at both ends.
<!-- Individual characters -->
<char>U+0009</char> <!-- TAB -->
<char>U+000A</char> <!-- LINE FEED -->
<char>U+00A0</char> <!-- NO-BREAK SPACE -->
<char>U+20AC</char> <!-- EURO SIGN -->
<!-- Ranges -->
<char>U+0020-U+007E</char> <!-- Printable Basic Latin (space through tilde) -->
<char>U+00C0-U+00FF</char> <!-- Latin-1 Supplement letters -->
<char>U+4E00-U+9FFF</char> <!-- CJK Unified Ideographs -->
<char>U+1F600-U+1F64F</char> <!-- Emoticons (supplementary plane) -->
<char/> as a self-closing empty element — it will be ignored
with a warning. The code point value must always be the text content of the element.
Set operations
CREPDL provides three set-operation elements. Each takes two or more child collection elements as operands. You can nest them to build complex repertoires.
| Element | Operation | Result |
|---|---|---|
| <union> | A ∪ B ∪ C … | All code points that appear in any child — the most commonly used element |
| <intersection> | A ∩ B ∩ C … | Only code points that appear in every child simultaneously |
| <difference> | A − B − C … | Code points in the first child minus those in all subsequent children |
<!-- DIFFERENCE: Basic Latin letters only (no digits, no punctuation) -->
<difference xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
<char>U+0020-U+007E</char> <!-- start: all printable Basic Latin -->
<char>U+0030-U+0039</char> <!-- subtract: digits 0–9 -->
<char>U+0020-U+002F</char> <!-- subtract: punctuation block -->
</difference>
<!-- INTERSECTION: Cyrillic characters that are also in a reference set -->
<intersection xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
<char>U+0400-U+04FF</char> <!-- Cyrillic block -->
<ref href="russian.crepdl"/> <!-- intersect with Russian repertoire -->
</intersection>
<!-- NESTING: Combining operations -->
<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
<char>U+0020-U+007E</char>
<difference>
<char>U+00C0-U+00FF</char>
<char>U+00D7</char> <!-- exclude × MULTIPLICATION SIGN -->
<char>U+00F7</char> <!-- exclude ÷ DIVISION SIGN -->
</difference>
</union>
External references
The <ref> element references another CREPDL file by URI, allowing
repertoires to be composed from smaller reusable pieces.
<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
<ref href="en-english.crepdl"/>
<ref href="fr-french.crepdl"/>
<ref href="de-german.crepdl"/>
</union>
<ref> elements are valid CREPDL syntax but the validator
currently only processes inline <char> definitions. A referenced file
will be noted as a warning and its characters will not be included in the repertoire.
To work around this, copy the contents of referenced files inline using
<union>.
Complete examples
<union xmlns="http://purl.oclc.org/
dsdl/crepdl/ns/structure/2.0">
<char>U+0009</char>
<char>U+000A</char>
<char>U+000D</char>
<char>U+0020-U+007E</char>
<char>U+00A0</char>
<char>U+00C0-U+00FF</char>
<char>U+2013-U+2014</char>
<char>U+2018-U+201E</char>
<char>U+2026</char>
</union>
<union xmlns="http://purl.oclc.org/
dsdl/crepdl/ns/structure/2.0">
<!-- Shared whitespace/controls -->
<char>U+0009</char>
<char>U+000A</char>
<char>U+0020-U+007E</char>
<!-- Arabic blocks -->
<char>U+0600-U+06FF</char>
<char>U+0750-U+077F</char>
<char>U+FB50-U+FDFF</char>
<char>U+FE70-U+FEFF</char>
<char>U+200F</char>
</union>
<difference xmlns="http://purl.oclc.org/
dsdl/crepdl/ns/structure/2.0">
<!-- Start: Basic Latin -->
<union>
<char>U+0020</char>
<char>U+0041-U+005A</char>
<char>U+0061-U+007A</char>
<char>U+0030-U+0039</char>
</union>
<!-- Subtract: digits -->
<char>U+0030-U+0039</char>
</difference>
<union xmlns="http://purl.oclc.org/
dsdl/crepdl/ns/structure/2.0">
<char>U+0020-U+007E</char>
<!-- Hiragana -->
<char>U+3040-U+309F</char>
<!-- Katakana -->
<char>U+30A0-U+30FF</char>
<!-- CJK Unified Ideographs -->
<char>U+4E00-U+9FFF</char>
<char>U+3400-U+4DBF</char>
<!-- CJK Symbols and Punctuation -->
<char>U+3000-U+303F</char>
</union>
Common code-point ranges
| Range | Block / Description |
|---|---|
| U+0009, U+000A, U+000D | Tab, Line Feed, Carriage Return — include in almost every schema |
| U+0020-U+007E | Printable Basic Latin — space through tilde; all ASCII printable characters |
| U+00A0 | No-Break Space — often needed alongside U+0020 |
| U+00C0-U+00FF | Latin-1 Supplement — accented Latin letters (à, é, ñ, ü …) |
| U+0100-U+017F | Latin Extended-A — additional accented letters for Central/Eastern European languages |
| U+0370-U+03FF | Greek and Coptic |
| U+0400-U+04FF | Cyrillic |
| U+0600-U+06FF | Arabic |
| U+0900-U+097F | Devanagari (Hindi, Marathi, Sanskrit …) |
| U+0E00-U+0E7F | Thai |
| U+1E00-U+1EFF | Latin Extended Additional — including Vietnamese precomposed characters |
| U+2013-U+2014 | En dash, Em dash — recommended for most Western-language schemas |
| U+2018-U+201F | Typographic quotation marks (' ' " " „) |
| U+2026 | Horizontal Ellipsis (…) |
| U+20AC | Euro Sign (€) |
| U+3040-U+309F | Hiragana |
| U+30A0-U+30FF | Katakana |
| U+4E00-U+9FFF | CJK Unified Ideographs (main block) |
| U+AC00-U+D7A3 | Hangul Syllables (Korean) |
| U+FB50-U+FDFF | Arabic Presentation Forms-A |
| U+FE70-U+FEFF | Arabic Presentation Forms-B |
| U+1F300-U+1F9FF | Miscellaneous emoji and symbols (supplementary plane) |
Troubleshooting
| Symptom | Likely cause and fix |
|---|---|
| Parse Schema returns "XML syntax error" | The XML is not well-formed. Check for unclosed tags, unescaped & or < characters in text, or mismatched quotes around attribute values. |
| "No <char> elements found" | The schema parsed successfully but contains no character definitions. Ensure your <char> elements contain text content (<char>U+0041</char>) rather than being self-closing (<char/>). |
| All characters reported as violations | The namespace on the root element is missing or incorrect. It must be exactly http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0. |
| Obvious valid characters flagged as violations | The range may not cover the expected code points. For example, U+0020-U+007E covers printable ASCII but not tab (U+0009) or newline (U+000A) — add those separately. |
| Warning: "<ref> external reference not supported" | The schema uses <ref href="…"/> to reference another file. This is valid CREPDL but not yet supported by this validator. Copy the referenced file's contents inline. |
| Schema parses but shows 0 permitted code points | Check for an empty <union></union> with no children, or a <difference> where the subtracted set is larger than the starting set. |
| Supplementary-plane characters always flagged | Use five- or six-digit code points: <char>U+1F600-U+1F64F</char>, not four-digit notation. |