Interesting Character Classes
My goal with this page is to assemble a collection of interesting (and potentially useful) regex character classes. I will try to organize the collection into themes.
For easy navigation, here are some jumping points to various sections of the page:
✽ How do these Character Classes Work?
✽ Useful ASCII Ranges
✽ Obnoxious Ranges
✽ Strange or Beautiful Ranges
How do these Character Classes Work?Before we start, I want to make sure you don't feel confused when you stumble on something like [!-~]. Remember that the hyphen defines a range between two characters in the ASCII table (or between two Unicode code points, depending on the engine). But a range does not have to look like [a-z]…
If you consult the ASCII table, you will see that [!-~] is a valid range—and a useful one too.
Sometimes, instead of a straight character class, you'll see something like (?![aeiou])[a-z]. The first part is a negative lookahead that asserts that the following character is not one of those in a given range. This is a way to perform character class subtraction in regex engines that don't support that operation—and that's most of them. In this example, the resulting character class is that of English lower-case consonants, since we have removed the vowels [aeiou] from the range of letters [a-z]. You may, by the way, notice that the letter a appears in both classes: we could have written this (?![eiou])[b-z]
Useful ASCII RangesAll Printable Characters in the ASCII Table
All Printable Characters in the ASCII Table—Except the Space Character
All "Special Characters" in the ASCII Table
All "Special Characters" in the ASCII Table—Without Using Lookahead
All Latin and Accented Characters
All English Consonants
Obnoxious RangesAlphanumeric Characters
[^\W_]This is an interesting class for engines that don't support the POSIX [[:alnum:]]. It makes use of the fact that \w is very close to what we want. [^\W] is a double negation that matches the same as \w. By adding _ to the negated class, we are left with ASCII digits and numbers. Watch out, though: in Python and .NET, \w matches any unicode letter. But frankly... Just use [a-zA-Z0-9]. See also Any White-Space Except Newline.
[^\D2-9]+This is the same idea as the regex above to match alphanumeric characters. In most engines, the character class only matches digits 0 or 1. The + quantifier makes this an obnoxious regex to match a binary number—if you want to do that, + is all you need. Note that in .NET and Python 3 some engines \d matches any digit in any script, so the meaning in those engines would be "any digit in any script, except ASCII digits 2 through 9".
Strange or Beautiful RangesSquare Brackets
This will work in .NET, Perl, PCRE and Python.
Words you can Type with your Left Hand
(But you'll need a QWERTY keyboard.)
Words you can Type with your Right Hand (QWERTY keyboard)
Words that only use Letters from the Top Row (QWERTY keyboard)
Line-Break-RelatedAny Character Including Line Breaks
These are ways to replicate the behavior of the dot in DOTALL mode (by default, the dot does not match line breaks): [\S\s] or [\D\d] or [\w\W]. Note that in each of these classes, I have tried to place in first position the token that has the greatest chance of matching first (which of course would depend on the target text).
Any White-Space Character Except the Newline Character
You may not have a use for this, but it's an interesting class making use of double negation. We're negating \S, so that's the same as all white-space characters \s. But the \n removes itself from the set.
Alternative to [\r\n] for Java and Ruby 2+
This rather pointless regex (except as a learning device) relies on the fact that in these three engines \s matches an ASCII space, a tab, a line feed, a carriage return, a vertical tab or a form feed: the negative lookahead removes all of those characters except the newline and carriage return.
The controversial capital letter for ß, now included in unicode, is missing in many fonts, so it might show on your screen as a question mark.
[a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]