Interesting Character Classes


My goal with this page is to assemble a collection of interesting (and potentially useful) regex character classes. I will try to organize the collection into themes.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

How do these Character Classes Work?
Useful ASCII Ranges
Obnoxious Ranges
Strange or Beautiful Ranges
Line-Break-Related
Language-Related


(direct link)

How do these Character Classes Work?

Before we start, I want to make sure you don't feel confused when you stumble on something like [!-~]. Remember that the hyphen defines a range between two characters in the ASCII table (or between two Unicode code points, depending on the engine). But a range does not have to look like [a-z]

If you consult the ASCII table, you will see that [!-~] is a valid range—and a useful one too.

Sometimes, instead of a straight character class, you'll see something like (?![aeiou])[a-z]. The first part is a negative lookahead that asserts that the following character is not one of those in a given range. This is a way to perform character class subtraction in regex engines that don't support that operation—and that's most of them. In this example, the resulting character class is that of English lower-case consonants, since we have removed the vowels [aeiou] from the range of letters [a-z]. You may, by the way, notice that the letter a appears in both classes: we could have written this (?![eiou])[b-z]


(direct link)

Useful ASCII Ranges

All Printable Characters in the ASCII Table
[ -~]
All Printable Characters in the ASCII Table—Except the Space Character
[!-~]
All "Special Characters" in the ASCII Table
(?![a-zA-Z0-9])[!-~]
All "Special Characters" in the ASCII Table—Without Using Lookahead
[!-/:-@\[-`{-~]
All Latin and Accented Characters
(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])
All English Consonants
[b-df-hj-np-tv-z]


(direct link)

Obnoxious Ranges

Alphanumeric Characters
[^\W_] This is an interesting class for engines that don't support the POSIX [[:alnum:]]. It makes use of the fact that \w is very close to what we want. [^\W] is a double negation that matches the same as \w. By adding _ to the negated class, we are left with ASCII digits and numbers. Watch out, though: in Python and .NET, \w matches any unicode letter. But frankly... Just use [a-zA-Z0-9]. See also Any White-Space Except Newline.

Binary Number
[^\D2-9]+ This is the same idea as the regex above to match alphanumeric characters. In most engines, the character class only matches digits 0 or 1. The + quantifier makes this an obnoxious regex to match a binary number—if you want to do that, [01]+ is all you need. Note that in .NET and Python 3 some engines \d matches any digit in any script, so the meaning in those engines would be "any digit in any script, except ASCII digits 2 through 9".


(direct link)

Strange or Beautiful Ranges

Square Brackets
This will work in .NET, Perl, PCRE and Python.
[][]
The crazy thing is that there is a lot of variation among engines as to which brackets need to be escaped. While [\]\[] will work everywhere, in JavaScript you can use [[\]], and in Java you can use []\[].

Words you can Type with your Left Hand
(But you'll need a QWERTY keyboard.)
(?i)\b[a-fq-tv-xz]+\b
Words you can Type with your Right Hand (QWERTY keyboard)
(?i)\b[ug-py]+\b
Words that only use Letters from the Top Row (QWERTY keyboard)
(?i)\b[eio-rtuwy]+\b


(direct link)

Line-Break-Related

Any Character Including Line Breaks
These are ways to replicate the behavior of the dot in DOTALL mode (by default, the dot does not match line breaks): [\S\s] or [\D\d] or [\w\W]. Note that in each of these classes, I have tried to place in first position the token that has the greatest chance of matching first (which of course would depend on the target text).

Any White-Space Character Except the Newline Character
You may not have a use for this, but it's an interesting class making use of double negation. We're negating \S, so that's the same as all white-space characters \s. But the \n removes itself from the set.
[^\S\n]
Alternative to [\r\n] for Java and Ruby 2+
(?![ \t\cK\f])\s
This rather pointless regex (except as a learning device) relies on the fact that in these three engines \s matches an ASCII space, a tab, a line feed, a carriage return, a vertical tab or a form feed: the negative lookahead removes all of those characters except the newline and carriage return.


(direct link)

Language-Related

French Letters
[a-zA-ZàâäôéèëêïîçùûüÿæœÀÂÄÔÉÈËÊÏΟÇÙÛÜÆŒ]
German Letters
The controversial capital letter for ß, now included in unicode, is missing in many fonts, so it might show on your screen as a question mark.
[a-zA-ZäöüßÄÖÜẞ]
Polish Letters
[a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ] Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]

Italian Letters
[a-zA-ZàèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]
Spanish Letters
[a-zA-ZáéíñóúüÁÉÍÑÓÚÜ]





Smiles,

Rex



Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.