Tutorial & Reference Guide

Using the CREPDL Validator

This guide explains what CREPDL is, how to use CREPDL·CHECK to validate text, and how to write your own CREPDL schema to describe any character repertoire.

What is CREPDL?

CREPDL (Character Repertoire Description Language) is an ISO/IEC standard (19757-7:2020) for describing exactly which Unicode characters are permitted in a given context — a language, a document type, a form field, or any other constrained text domain.

A CREPDL schema is a small XML file. It lists the permitted code points using <char> elements for individual characters or ranges, combined with set operations like <union>, <intersection>, and <difference>. CREPDL·CHECK reads this schema and checks whether every character in your text belongs to the declared repertoire.

Common uses include: checking that user-submitted text in a web form only contains characters from a specific script; validating that a localised document does not accidentally contain characters from the wrong language; or auditing legacy data for encoding problems.

Namespace CREPDL·CHECK uses the current 2020 schema namespace: http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0. This must appear as the default namespace on the root element of every schema you write.

How to validate text

Load a CREPDL schema

Paste a CREPDL XML schema directly into the CREPDL Schema panel on the left, or click Upload .crepdl file to load one from disk. The schema panel accepts any well-formed XML in either the current or legacy CREPDL namespace.

Parse the schema

Click Parse Schema. The application checks the XML for syntax errors and counts the declared character definitions. A green ✓ Schema loaded badge confirms success. Any structural warnings appear beneath the schema panel.

Enter or upload your text

Type or paste the text you want to check into the Text to Validate panel, or click Upload text file. The character counter updates live. There is no length limit enforced by the interface.

Run validation

Click ▶ Validate Against Repertoire. The text is sent to the server together with the schema. Results appear immediately below — a green banner if the text is fully compliant, or a red banner listing every out-of-repertoire character found.

Reading the results

After validation, three tabs present the findings:

Violations — a table of every distinct code point found in the text that is not in the permitted repertoire. Each row shows the character glyph, its Unicode code point (e.g. U+00E9), a description, and how many times it appears.

Preview — the full text with every violating character highlighted with a red wavy underline. Hovering over a highlighted character shows a tooltip with the code point and description.

Statistics — a summary grid showing total characters checked, violation count, unique violating code points, violation rate, permitted code point count, and compliant character count.

Example response values
Compliant✓ Yes — all characters permitted
Total characters312
Permitted code points173
Violations4 occurrences · 3 distinct code points
Violation rate1.28%
💡
Tip The violation rate is calculated as a percentage of total Unicode code points in the text, not bytes or UTF-16 code units. Supplementary-plane characters (emoji, some CJK extensions) count as one code point each.

Schema structure

Every CREPDL schema is a well-formed XML document. The root element is a collection expression — most commonly <union> — in the CREPDL namespace. There is no outer wrapper element; the collection is the document.

<?xml version="1.0" encoding="UTF-8"?>
<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">

  <!-- A single character -->
  <char>U+0041</char>

  <!-- An inclusive range of characters -->
  <char>U+0061-U+007A</char>

</union>

The root element must carry the namespace declaration xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0". Without it, the schema will be rejected.

The root can be any collection element: <union>, <intersection>, <difference>, or even a bare <char> if the entire repertoire is a single character or range. In practice almost all schemas use <union> as the root because most repertoires are additive.

The <char> element

<char> is the atomic element of CREPDL. Its text content specifies one or more Unicode code points using standard U+XXXX notation.

Format Example Meaning
<char>U+XXXX</char> <char>U+0041</char> Permits exactly the one character at code point U+0041 (LATIN CAPITAL LETTER A)
<char>U+XXXX-U+YYYY</char> <char>U+0061-U+007A</char> Permits every code point from U+0061 to U+007A inclusive (a–z, 26 characters)

Code point values are written in uppercase hexadecimal with the U+ prefix. They must be zero-padded to at least four digits: U+0041, not U+41. Supplementary-plane characters use five or six digits: U+1F600.

In a range, the start code point must be less than or equal to the end code point. The range is always inclusive at both ends.

<!-- Individual characters -->
<char>U+0009</char>   <!-- TAB -->
<char>U+000A</char>   <!-- LINE FEED -->
<char>U+00A0</char>   <!-- NO-BREAK SPACE -->
<char>U+20AC</char>   <!-- EURO SIGN -->

<!-- Ranges -->
<char>U+0020-U+007E</char>   <!-- Printable Basic Latin (space through tilde) -->
<char>U+00C0-U+00FF</char>   <!-- Latin-1 Supplement letters -->
<char>U+4E00-U+9FFF</char>   <!-- CJK Unified Ideographs -->
<char>U+1F600-U+1F64F</char> <!-- Emoticons (supplementary plane) -->
Common mistake Do not write <char/> as a self-closing empty element — it will be ignored with a warning. The code point value must always be the text content of the element.

Set operations

CREPDL provides three set-operation elements. Each takes two or more child collection elements as operands. You can nest them to build complex repertoires.

Element Operation Result
<union> A ∪ B ∪ C … All code points that appear in any child — the most commonly used element
<intersection> A ∩ B ∩ C … Only code points that appear in every child simultaneously
<difference> A − B − C … Code points in the first child minus those in all subsequent children
<!-- DIFFERENCE: Basic Latin letters only (no digits, no punctuation) -->
<difference xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
  <char>U+0020-U+007E</char>   <!-- start: all printable Basic Latin -->
  <char>U+0030-U+0039</char>   <!-- subtract: digits 0–9 -->
  <char>U+0020-U+002F</char>   <!-- subtract: punctuation block -->
</difference>

<!-- INTERSECTION: Cyrillic characters that are also in a reference set -->
<intersection xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
  <char>U+0400-U+04FF</char>   <!-- Cyrillic block -->
  <ref href="russian.crepdl"/>   <!-- intersect with Russian repertoire -->
</intersection>

<!-- NESTING: Combining operations -->
<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
  <char>U+0020-U+007E</char>
  <difference>
    <char>U+00C0-U+00FF</char>
    <char>U+00D7</char>   <!-- exclude × MULTIPLICATION SIGN -->
    <char>U+00F7</char>   <!-- exclude ÷ DIVISION SIGN -->
  </difference>
</union>

External references

The <ref> element references another CREPDL file by URI, allowing repertoires to be composed from smaller reusable pieces.

<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
  <ref href="en-english.crepdl"/>
  <ref href="fr-french.crepdl"/>
  <ref href="de-german.crepdl"/>
</union>
Not yet supported in CREPDL·CHECK External <ref> elements are valid CREPDL syntax but the validator currently only processes inline <char> definitions. A referenced file will be noted as a warning and its characters will not be included in the repertoire. To work around this, copy the contents of referenced files inline using <union>.

Complete examples

Basic Irish (Gaeilge)
<union xmlns="http://purl.oclc.org/
  dsdl/crepdl/ns/structure/2.0">
  <char>U+0009</char>
  <char>U+000A</char>
  <char>U+000D</char>
  <char>U+0020-U+007E</char>
  <char>U+00A0</char>
  <char>U+00C0-U+00FF</char>
  <char>U+2013-U+2014</char>
  <char>U+2018-U+201E</char>
  <char>U+2026</char>
</union>
Multi-script Arabic + Latin
<union xmlns="http://purl.oclc.org/
  dsdl/crepdl/ns/structure/2.0">
  <!-- Shared whitespace/controls -->
  <char>U+0009</char>
  <char>U+000A</char>
  <char>U+0020-U+007E</char>
  <!-- Arabic blocks -->
  <char>U+0600-U+06FF</char>
  <char>U+0750-U+077F</char>
  <char>U+FB50-U+FDFF</char>
  <char>U+FE70-U+FEFF</char>
  <char>U+200F</char>
</union>
Set operation Letters only (no digits)
<difference xmlns="http://purl.oclc.org/
  dsdl/crepdl/ns/structure/2.0">
  <!-- Start: Basic Latin -->
  <union>
    <char>U+0020</char>
    <char>U+0041-U+005A</char>
    <char>U+0061-U+007A</char>
    <char>U+0030-U+0039</char>
  </union>
  <!-- Subtract: digits -->
  <char>U+0030-U+0039</char>
</difference>
CJK Japanese (Nihongo)
<union xmlns="http://purl.oclc.org/
  dsdl/crepdl/ns/structure/2.0">
  <char>U+0020-U+007E</char>
  <!-- Hiragana -->
  <char>U+3040-U+309F</char>
  <!-- Katakana -->
  <char>U+30A0-U+30FF</char>
  <!-- CJK Unified Ideographs -->
  <char>U+4E00-U+9FFF</char>
  <char>U+3400-U+4DBF</char>
  <!-- CJK Symbols and Punctuation -->
  <char>U+3000-U+303F</char>
</union>

Common code-point ranges

Range Block / Description
U+0009, U+000A, U+000DTab, Line Feed, Carriage Return — include in almost every schema
U+0020-U+007EPrintable Basic Latin — space through tilde; all ASCII printable characters
U+00A0No-Break Space — often needed alongside U+0020
U+00C0-U+00FFLatin-1 Supplement — accented Latin letters (à, é, ñ, ü …)
U+0100-U+017FLatin Extended-A — additional accented letters for Central/Eastern European languages
U+0370-U+03FFGreek and Coptic
U+0400-U+04FFCyrillic
U+0600-U+06FFArabic
U+0900-U+097FDevanagari (Hindi, Marathi, Sanskrit …)
U+0E00-U+0E7FThai
U+1E00-U+1EFFLatin Extended Additional — including Vietnamese precomposed characters
U+2013-U+2014En dash, Em dash — recommended for most Western-language schemas
U+2018-U+201FTypographic quotation marks (' ' " " „)
U+2026Horizontal Ellipsis (…)
U+20ACEuro Sign (€)
U+3040-U+309FHiragana
U+30A0-U+30FFKatakana
U+4E00-U+9FFFCJK Unified Ideographs (main block)
U+AC00-U+D7A3Hangul Syllables (Korean)
U+FB50-U+FDFFArabic Presentation Forms-A
U+FE70-U+FEFFArabic Presentation Forms-B
U+1F300-U+1F9FFMiscellaneous emoji and symbols (supplementary plane)
📖
Finding code points The Unicode character search at unicode.org/charts is the authoritative reference for all block ranges. Each block chart PDF lists every assigned and reserved code point with its official name.

Troubleshooting

Symptom Likely cause and fix
Parse Schema returns "XML syntax error" The XML is not well-formed. Check for unclosed tags, unescaped & or < characters in text, or mismatched quotes around attribute values.
"No <char> elements found" The schema parsed successfully but contains no character definitions. Ensure your <char> elements contain text content (<char>U+0041</char>) rather than being self-closing (<char/>).
All characters reported as violations The namespace on the root element is missing or incorrect. It must be exactly http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0.
Obvious valid characters flagged as violations The range may not cover the expected code points. For example, U+0020-U+007E covers printable ASCII but not tab (U+0009) or newline (U+000A) — add those separately.
Warning: "<ref> external reference not supported" The schema uses <ref href="…"/> to reference another file. This is valid CREPDL but not yet supported by this validator. Copy the referenced file's contents inline.
Schema parses but shows 0 permitted code points Check for an empty <union></union> with no children, or a <difference> where the subtracted set is larger than the starting set.
Supplementary-plane characters always flagged Use five- or six-digit code points: <char>U+1F600-U+1F64F</char>, not four-digit notation.