CREPDL·CHECK — Tutorial & Guide

What is CREPDL?

CREPDL (Character Repertoire Description Language) is an ISO/IEC standard (19757-7:2020) for describing exactly which Unicode characters are permitted in a given context — a language, a document type, a form field, or any other constrained text domain.

A CREPDL schema is a small XML file. It lists the permitted code points using <char> elements for individual characters or ranges, combined with set operations like <union>, <intersection>, and <difference>. CREPDL·CHECK reads this schema and checks whether every character in your text belongs to the declared repertoire.

Common uses include: checking that user-submitted text in a web form only contains characters from a specific script; validating that a localised document does not accidentally contain characters from the wrong language; or auditing legacy data for encoding problems.

ℹ

Namespace CREPDL·CHECK uses the current 2020 schema namespace: http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0. This must appear as the default namespace on the root element of every schema you write.

How to validate text

Load a CREPDL schema

Paste a CREPDL XML schema directly into the CREPDL Schema panel on the left, or click Upload .crepdl file to load one from disk. The schema panel accepts any well-formed XML in either the current or legacy CREPDL namespace.

Parse the schema

Click Parse Schema. The application checks the XML for syntax errors and counts the declared character definitions. A green ✓ Schema loaded badge confirms success. Any structural warnings appear beneath the schema panel.

Enter or upload your text

Type or paste the text you want to check into the Text to Validate panel, or click Upload text file. The character counter updates live. There is no length limit enforced by the interface.

Run validation

Click ▶ Validate Against Repertoire. The text is sent to the server together with the schema. Results appear immediately below — a green banner if the text is fully compliant, or a red banner listing every out-of-repertoire character found.

Reading the results

After validation, three tabs present the findings:

Violations — a table of every distinct code point found in the text that is not in the permitted repertoire. Each row shows the character glyph, its Unicode code point (e.g. U+00E9), a description, and how many times it appears.

Preview — the full text with every violating character highlighted with a red wavy underline. Hovering over a highlighted character shows a tooltip with the code point and description.

Statistics — a summary grid showing total characters checked, violation count, unique violating code points, violation rate, permitted code point count, and compliant character count.

Example response values

Compliant✓ Yes — all characters permitted

Total characters312

Permitted code points173

Violations4 occurrences · 3 distinct code points

Violation rate1.28%

💡

Tip The violation rate is calculated as a percentage of total Unicode code points in the text, not bytes or UTF-16 code units. Supplementary-plane characters (emoji, some CJK extensions) count as one code point each.

Schema structure

Every CREPDL schema is a well-formed XML document. The root element is a collection expression — most commonly <union> — in the CREPDL namespace. There is no outer wrapper element; the collection is the document.

<?xml version="1.0" encoding="UTF-8"?>
<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">

  <!-- A single character -->
  <char>U+0041</char>

  <!-- An inclusive range of characters -->
  <char>U+0061-U+007A</char>

</union>

The root element must carry the namespace declaration xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0". Without it, the schema will be rejected.

The root can be any collection element: <union>, <intersection>, <difference>, or even a bare <char> if the entire repertoire is a single character or range. In practice almost all schemas use <union> as the root because most repertoires are additive.

The <char> element

<char> is the atomic element of CREPDL. Its text content specifies one or more Unicode code points using standard U+XXXX notation.

Format	Example	Meaning
<char>U+XXXX</char>	`<char>U+0041</char>`	Permits exactly the one character at code point U+0041 (LATIN CAPITAL LETTER A)
<char>U+XXXX-U+YYYY</char>	`<char>U+0061-U+007A</char>`	Permits every code point from U+0061 to U+007A inclusive (a–z, 26 characters)

Code point values are written in uppercase hexadecimal with the U+ prefix. They must be zero-padded to at least four digits: U+0041, not U+41. Supplementary-plane characters use five or six digits: U+1F600.

In a range, the start code point must be less than or equal to the end code point. The range is always inclusive at both ends.

<!-- Individual characters -->
<char>U+0009</char>   <!-- TAB -->
<char>U+000A</char>   <!-- LINE FEED -->
<char>U+00A0</char>   <!-- NO-BREAK SPACE -->
<char>U+20AC</char>   <!-- EURO SIGN -->

<!-- Ranges -->
<char>U+0020-U+007E</char>   <!-- Printable Basic Latin (space through tilde) -->
<char>U+00C0-U+00FF</char>   <!-- Latin-1 Supplement letters -->
<char>U+4E00-U+9FFF</char>   <!-- CJK Unified Ideographs -->
<char>U+1F600-U+1F64F</char> <!-- Emoticons (supplementary plane) -->

⚠

Common mistake Do not write <char/> as a self-closing empty element — it will be ignored with a warning. The code point value must always be the text content of the element.

Set operations

CREPDL provides three set-operation elements. Each takes two or more child collection elements as operands. You can nest them to build complex repertoires.

Element	Operation	Result
<union>	A ∪ B ∪ C …	All code points that appear in any child — the most commonly used element
<intersection>	A ∩ B ∩ C …	Only code points that appear in every child simultaneously
<difference>	A − B − C …	Code points in the first child minus those in all subsequent children

<!-- DIFFERENCE: Basic Latin letters only (no digits, no punctuation) -->
<difference xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
  <char>U+0020-U+007E</char>   <!-- start: all printable Basic Latin -->
  <char>U+0030-U+0039</char>   <!-- subtract: digits 0–9 -->
  <char>U+0020-U+002F</char>   <!-- subtract: punctuation block -->
</difference>

<!-- INTERSECTION: Cyrillic characters that are also in a reference set -->
<intersection xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
  <char>U+0400-U+04FF</char>   <!-- Cyrillic block -->
  <ref href="russian.crepdl"/>   <!-- intersect with Russian repertoire -->
</intersection>

<!-- NESTING: Combining operations -->
<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
  <char>U+0020-U+007E</char>
  <difference>
    <char>U+00C0-U+00FF</char>
    <char>U+00D7</char>   <!-- exclude × MULTIPLICATION SIGN -->
    <char>U+00F7</char>   <!-- exclude ÷ DIVISION SIGN -->
  </difference>
</union>

External references

The <ref> element references another CREPDL file by URI, allowing repertoires to be composed from smaller reusable pieces.

<union xmlns="http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0">
  <ref href="en-english.crepdl"/>
  <ref href="fr-french.crepdl"/>
  <ref href="de-german.crepdl"/>
</union>

⚠

Not yet supported in CREPDL·CHECK External <ref> elements are valid CREPDL syntax but the validator currently only processes inline <char> definitions. A referenced file will be noted as a warning and its characters will not be included in the repertoire. To work around this, copy the contents of referenced files inline using <union>.

Complete examples

Basic Irish (Gaeilge)

<union xmlns="http://purl.oclc.org/
  dsdl/crepdl/ns/structure/2.0">
  <char>U+0009</char>
  <char>U+000A</char>
  <char>U+000D</char>
  <char>U+0020-U+007E</char>
  <char>U+00A0</char>
  <char>U+00C0-U+00FF</char>
  <char>U+2013-U+2014</char>
  <char>U+2018-U+201E</char>
  <char>U+2026</char>
</union>

Multi-script Arabic + Latin

<union xmlns="http://purl.oclc.org/
  dsdl/crepdl/ns/structure/2.0">
  <!-- Shared whitespace/controls -->
  <char>U+0009</char>
  <char>U+000A</char>
  <char>U+0020-U+007E</char>
  <!-- Arabic blocks -->
  <char>U+0600-U+06FF</char>
  <char>U+0750-U+077F</char>
  <char>U+FB50-U+FDFF</char>
  <char>U+FE70-U+FEFF</char>
  <char>U+200F</char>
</union>

Set operation Letters only (no digits)

<difference xmlns="http://purl.oclc.org/
  dsdl/crepdl/ns/structure/2.0">
  <!-- Start: Basic Latin -->
  <union>
    <char>U+0020</char>
    <char>U+0041-U+005A</char>
    <char>U+0061-U+007A</char>
    <char>U+0030-U+0039</char>
  </union>
  <!-- Subtract: digits -->
  <char>U+0030-U+0039</char>
</difference>

CJK Japanese (Nihongo)

<union xmlns="http://purl.oclc.org/
  dsdl/crepdl/ns/structure/2.0">
  <char>U+0020-U+007E</char>
  <!-- Hiragana -->
  <char>U+3040-U+309F</char>
  <!-- Katakana -->
  <char>U+30A0-U+30FF</char>
  <!-- CJK Unified Ideographs -->
  <char>U+4E00-U+9FFF</char>
  <char>U+3400-U+4DBF</char>
  <!-- CJK Symbols and Punctuation -->
  <char>U+3000-U+303F</char>
</union>

Common code-point ranges

Range	Block / Description
U+0009, U+000A, U+000D	Tab, Line Feed, Carriage Return — include in almost every schema
U+0020-U+007E	Printable Basic Latin — space through tilde; all ASCII printable characters
U+00A0	No-Break Space — often needed alongside U+0020
U+00C0-U+00FF	Latin-1 Supplement — accented Latin letters (à, é, ñ, ü …)
U+0100-U+017F	Latin Extended-A — additional accented letters for Central/Eastern European languages
U+0370-U+03FF	Greek and Coptic
U+0400-U+04FF	Cyrillic
U+0600-U+06FF	Arabic
U+0900-U+097F	Devanagari (Hindi, Marathi, Sanskrit …)
U+0E00-U+0E7F	Thai
U+1E00-U+1EFF	Latin Extended Additional — including Vietnamese precomposed characters
U+2013-U+2014	En dash, Em dash — recommended for most Western-language schemas
U+2018-U+201F	Typographic quotation marks (' ' " " „)
U+2026	Horizontal Ellipsis (…)
U+20AC	Euro Sign (€)
U+3040-U+309F	Hiragana
U+30A0-U+30FF	Katakana
U+4E00-U+9FFF	CJK Unified Ideographs (main block)
U+AC00-U+D7A3	Hangul Syllables (Korean)
U+FB50-U+FDFF	Arabic Presentation Forms-A
U+FE70-U+FEFF	Arabic Presentation Forms-B
U+1F300-U+1F9FF	Miscellaneous emoji and symbols (supplementary plane)

📖

Finding code points The Unicode character search at unicode.org/charts is the authoritative reference for all block ranges. Each block chart PDF lists every assigned and reserved code point with its official name.

Troubleshooting

Symptom	Likely cause and fix
Parse Schema returns "XML syntax error"	The XML is not well-formed. Check for unclosed tags, unescaped `&` or `<` characters in text, or mismatched quotes around attribute values.
"No <char> elements found"	The schema parsed successfully but contains no character definitions. Ensure your `<char>` elements contain text content (`<char>U+0041</char>`) rather than being self-closing (`<char/>`).
All characters reported as violations	The namespace on the root element is missing or incorrect. It must be exactly `http://purl.oclc.org/dsdl/crepdl/ns/structure/2.0`.
Obvious valid characters flagged as violations	The range may not cover the expected code points. For example, U+0020-U+007E covers printable ASCII but not tab (U+0009) or newline (U+000A) — add those separately.
Warning: "<ref> external reference not supported"	The schema uses `<ref href="…"/>` to reference another file. This is valid CREPDL but not yet supported by this validator. Copy the referenced file's contents inline.
Schema parses but shows 0 permitted code points	Check for an empty `<union></union>` with no children, or a `<difference>` where the subtracted set is larger than the starting set.
Supplementary-plane characters always flagged	Use five- or six-digit code points: `<char>U+1F600-U+1F64F</char>`, not four-digit notation.

Using the CREPDL Validator

What is CREPDL?

How to validate text

Load a CREPDL schema

Parse the schema

Enter or upload your text

Run validation

Reading the results

Schema structure

The <char> element

Set operations

External references

Complete examples

Common code-point ranges

Troubleshooting