Interesting Regex Character Classes (2024)


Interesting Regex Character Classes (1)


My goal with this page is to assemble a collection of interesting (and potentially useful) regex character classes. I will try to organize the collection into themes.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

How do these Character Classes Work?
Useful ASCII Ranges
Obnoxious Ranges
Strange or Beautiful Ranges
Line-Break-Related
Language-Related

(direct link)

How do these Character Classes Work?

Before we start, I want to make sure you don't feel confused when you stumble on something like [!-~]. Remember that the hyphen defines a range between two characters in the ASCII table (or between two Unicode code points, depending on the engine). But a range does not have to look like [a-z]

If you consult the ASCII table, you will see that [!-~] is a valid range—and a useful one too.

Sometimes, instead of a straight character class, you'll see something like (?![aeiou])[a-z]. The first part is a negative lookahead that asserts that the following character is not one of those in a given range. This is a way to perform character class subtraction in regex engines that don't support that operation—and that's most of them. In this example, the resulting character class is that of English lower-case consonants, since we have removed the vowels [aeiou] from the range of letters [a-z]. You may, by the way, notice that the letter a appears in both classes: we could have written this (?![eiou])[b-z]

(direct link)

Useful ASCII Ranges

All Printable Characters in the ASCII Table
[ -~]
All Printable Characters in the ASCII Table—Except the Space Character
[!-~]
All "Special Characters" in the ASCII Table
(?![a-zA-Z0-9])[!-~]
All "Special Characters" in the ASCII Table—Without Using Lookahead
[!-/:-@\[-`{-~]
All Latin and Accented Characters
(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])
All English Consonants
[b-df-hj-np-tv-z]

(direct link)

Obnoxious Ranges

Alphanumeric Characters
[^\W_]This is an interesting class for engines that don't support the POSIX [[:alnum:]]. It makes use of the fact that \w is very close to what we want. [^\W] is a double negation that matches the same as \w. By adding _ to the negated class, we are left with ASCII digits and numbers. Watch out, though: in Python and .NET, \w matches any unicode letter. But frankly... Just use [a-zA-Z0-9]. See also Any White-Space Except Newline.

Binary Number
[^\D2-9]+This is the same idea as the regex above to match alphanumeric characters. In most engines, the character class only matches digits 0 or 1. The + quantifier makes this an obnoxious regex to match a binary number—if you want to do that, [01]+ is all you need. Note that in .NET and Python 3 some engines \d matches any digit in any script, so the meaning in those engines would be "any digit in any script, except ASCII digits 2 through 9".

(direct link)

Strange or Beautiful Ranges

Square Brackets
This will work in .NET, Perl, PCRE and Python.
[][]
The crazy thing is that there is a lot of variation among engines as to which brackets need to be escaped. While [\]\[] will work everywhere, in JavaScript you can use [[\]], and in Java you can use []\[].

Words you can Type with your Left Hand
(But you'll need a QWERTY keyboard.)
(?i)\b[a-fq-tv-xz]+\b
Words you can Type with your Right Hand (QWERTY keyboard)
(?i)\b[ug-py]+\b
Words that only use Letters from the Top Row (QWERTY keyboard)
(?i)\b[eio-rtuwy]+\b

(direct link)

Line-Break-Related

Any Character Including Line Breaks
These are ways to replicate the behavior of the dot in DOTALL mode (by default, the dot does not match line breaks): [\S\s] or [\D\d] or [\w\W]. Note that in each of these classes, I have tried to place in first position the token that has the greatest chance of matching first (which of course would depend on the target text).

Any White-Space Character Except the Newline Character
You may not have a use for this, but it's an interesting class making use of double negation. We're negating \S, so that's the same as all white-space characters \s. But the \n removes itself from the set.
[^\S\n]
Alternative to [\r\n] for Java and Ruby 2+
(?![ \t\cK\f])\s
This rather pointless regex (except as a learning device) relies on the fact that in these three engines \s matches an ASCII space, a tab, a line feed, a carriage return, a vertical tab or a form feed: the negative lookahead removes all of those characters except the newline and carriage return.

(direct link)

Language-Related

French Letters
[a-zA-ZàâäôéèëêïîçùûüÿæœÀÂÄÔÉÈËÊÏΟÇÙÛÜÆŒ]
German Letters
The controversial capital letter for ß, now included in unicode, is missing in many fonts, so it might show on your screen as a question mark.
[a-zA-ZäöüßÄÖÜẞ]
Polish Letters
[a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]

Italian Letters
[a-zA-ZàèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]
Spanish Letters
[a-zA-ZáéíñóúüÁÉÍÑÓÚÜ]

Don't Miss The Regex Style Guide

and The Best Regex Trick Ever!!!


Smiles,

Rex

Ask Rex

Interesting Regex Character Classes (3)


Leave a Comment

1-1 of 1 Threads

cesar – ali_Escobar2003@yahoo.com

August 26, 2019 - 12:15

Subject: Spanish


Thx!! I couldnt figure out how to keep my spanish characters while cleaning up some tweets.

Reply to cesar

Rex

August 26, 2019 - 12:18

Subject: RE: Spanish


Hola Cesar,Me encanta oír que hayas podido resolver tu problema. Deseándote un buenísimo día, -Rex

Leave a Comment

Interesting Regex Character Classes (4)


Interesting Regex Character Classes (2024)
Top Articles
Latest Posts
Article information

Author: Jonah Leffler

Last Updated:

Views: 5860

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Jonah Leffler

Birthday: 1997-10-27

Address: 8987 Kieth Ports, Luettgenland, CT 54657-9808

Phone: +2611128251586

Job: Mining Supervisor

Hobby: Worldbuilding, Electronics, Amateur radio, Skiing, Cycling, Jogging, Taxidermy

Introduction: My name is Jonah Leffler, I am a determined, faithful, outstanding, inexpensive, cheerful, determined, smiling person who loves writing and wants to share my knowledge and understanding with you.