Interesting Regex Character Classes (2024)


Interesting Regex Character Classes (1)


My goal with this page is to assemble a collection of interesting (and potentially useful) regex character classes. I will try to organize the collection into themes.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

How do these Character Classes Work?
Useful ASCII Ranges
Obnoxious Ranges
Strange or Beautiful Ranges
Line-Break-Related
Language-Related

(direct link)

How do these Character Classes Work?

Before we start, I want to make sure you don't feel confused when you stumble on something like [!-~]. Remember that the hyphen defines a range between two characters in the ASCII table (or between two Unicode code points, depending on the engine). But a range does not have to look like [a-z]

If you consult the ASCII table, you will see that [!-~] is a valid range—and a useful one too.

Sometimes, instead of a straight character class, you'll see something like (?![aeiou])[a-z]. The first part is a negative lookahead that asserts that the following character is not one of those in a given range. This is a way to perform character class subtraction in regex engines that don't support that operation—and that's most of them. In this example, the resulting character class is that of English lower-case consonants, since we have removed the vowels [aeiou] from the range of letters [a-z]. You may, by the way, notice that the letter a appears in both classes: we could have written this (?![eiou])[b-z]

(direct link)

Useful ASCII Ranges

All Printable Characters in the ASCII Table
[ -~]
All Printable Characters in the ASCII Table—Except the Space Character
[!-~]
All "Special Characters" in the ASCII Table
(?![a-zA-Z0-9])[!-~]
All "Special Characters" in the ASCII Table—Without Using Lookahead
[!-/:-@\[-`{-~]
All Latin and Accented Characters
(?i)(?:(?![×Þß÷þø])[a-zÀ-ÿ])
All English Consonants
[b-df-hj-np-tv-z]

(direct link)

Obnoxious Ranges

Alphanumeric Characters
[^\W_]This is an interesting class for engines that don't support the POSIX [[:alnum:]]. It makes use of the fact that \w is very close to what we want. [^\W] is a double negation that matches the same as \w. By adding _ to the negated class, we are left with ASCII digits and numbers. Watch out, though: in Python and .NET, \w matches any unicode letter. But frankly... Just use [a-zA-Z0-9]. See also Any White-Space Except Newline.

Binary Number
[^\D2-9]+This is the same idea as the regex above to match alphanumeric characters. In most engines, the character class only matches digits 0 or 1. The + quantifier makes this an obnoxious regex to match a binary number—if you want to do that, [01]+ is all you need. Note that in .NET and Python 3 some engines \d matches any digit in any script, so the meaning in those engines would be "any digit in any script, except ASCII digits 2 through 9".

(direct link)

Strange or Beautiful Ranges

Square Brackets
This will work in .NET, Perl, PCRE and Python.
[][]
The crazy thing is that there is a lot of variation among engines as to which brackets need to be escaped. While [\]\[] will work everywhere, in JavaScript you can use [[\]], and in Java you can use []\[].

Words you can Type with your Left Hand
(But you'll need a QWERTY keyboard.)
(?i)\b[a-fq-tv-xz]+\b
Words you can Type with your Right Hand (QWERTY keyboard)
(?i)\b[ug-py]+\b
Words that only use Letters from the Top Row (QWERTY keyboard)
(?i)\b[eio-rtuwy]+\b

(direct link)

Line-Break-Related

Any Character Including Line Breaks
These are ways to replicate the behavior of the dot in DOTALL mode (by default, the dot does not match line breaks): [\S\s] or [\D\d] or [\w\W]. Note that in each of these classes, I have tried to place in first position the token that has the greatest chance of matching first (which of course would depend on the target text).

Any White-Space Character Except the Newline Character
You may not have a use for this, but it's an interesting class making use of double negation. We're negating \S, so that's the same as all white-space characters \s. But the \n removes itself from the set.
[^\S\n]
Alternative to [\r\n] for Java and Ruby 2+
(?![ \t\cK\f])\s
This rather pointless regex (except as a learning device) relies on the fact that in these three engines \s matches an ASCII space, a tab, a line feed, a carriage return, a vertical tab or a form feed: the negative lookahead removes all of those characters except the newline and carriage return.

(direct link)

Language-Related

French Letters
[a-zA-ZàâäôéèëêïîçùûüÿæœÀÂÄÔÉÈËÊÏΟÇÙÛÜÆŒ]
German Letters
The controversial capital letter for ß, now included in unicode, is missing in many fonts, so it might show on your screen as a question mark.
[a-zA-ZäöüßÄÖÜẞ]
Polish Letters
[a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]

Italian Letters
[a-zA-ZàèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]
Spanish Letters
[a-zA-ZáéíñóúüÁÉÍÑÓÚÜ]

Don't Miss The Regex Style Guide

and The Best Regex Trick Ever!!!


Smiles,

Rex

Ask Rex

Interesting Regex Character Classes (3)


Leave a Comment

1-1 of 1 Threads

cesar – ali_Escobar2003@yahoo.com

August 26, 2019 - 12:15

Subject: Spanish


Thx!! I couldnt figure out how to keep my spanish characters while cleaning up some tweets.

Reply to cesar

Rex

August 26, 2019 - 12:18

Subject: RE: Spanish


Hola Cesar,Me encanta oír que hayas podido resolver tu problema. Deseándote un buenísimo día, -Rex

Leave a Comment

Interesting Regex Character Classes (4)


Interesting Regex Character Classes (2024)

FAQs

What are the special character classes in regex? ›

In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket ], the backslash \, the caret ^, and the hyphen -. The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.

Why is regex so complicated? ›

Overloaded syntax and context

Regular expressions use a small set of symbols, and so some of these symbols to double duty. For example, symbols take on different meanings inside and outside of character classes. (See point #4 here.) Extensions to the basic syntax are worse.

Is regex a valuable skill? ›

It provides a versatile way to find and manipulate patterns within strings, making it an essential tool for tasks like data validation, text processing, and data extraction. Knowing how to use regex effectively can greatly enhance your programming skills and streamline your development process.

What does the plus character [+] do in regex? ›

The plus ( + ) is a quantifier that matches one or more occurrences of the preceding element. The plus is similar to the asterisk ( * ) in that many occurrences are acceptable, but unlike the asterisk in that at least one occurrence is required.

How to find special characters in regex? ›

^(\w+) matches the beginning of a line and then one or more word characters. Encloses an expression that matches set of characters. [\s] matches a whitespace character or a digit. [a-z0-9] matches “a” through “z” and numbers “0” through “9”.

How do you allow all special characters in regex? ›

Special characters allowed in regular expressions
*Matches zero or more characters
[acd]Matches character a, c, or d (case-sensitive)
[^acd]Matches any character except a, c, or d (case-sensitive)
[a-z]Matches any character between a and z (lower case letter)
[^0-9]Matches any character not between 0 and 9 (not a number)
3 more rows

Is regex outdated? ›

No, its a tool that has its applications. Most of the time using a regex is shorter than writing a matching code explicitly, some people consider that difficult to understand, but it is just a matter of getting used to it, e.g. in Perl RegEx is used quite often.

When should you not use regex? ›

When Not to Use Regular Expressions?
  1. Overview. In this tutorial, we discuss in what cases should we avoid using regular expressions when working with text. ...
  2. When Working With HTML or XML. ...
  3. When a Simple Search Works Well. ...
  4. When in an Adversarial Context. ...
  5. When Boolean RegExes Evaluate to False. ...
  6. Conclusions.
Mar 18, 2024

Is regex lazy or greedy? ›

and the string to match is all HTML tags. Greedy search — will try to match the longest possible string. The above regex matches the whole string ( <h1>Hello World</h1> ) because by default Regular Expression uses the Greedy algorithm & hence it finds the longest match.

What's better than regex? ›

contains(String) is much faster than explicitly using Regex .

How long does it take to learn regex? ›

How long does it take to learn Regex? - Quora. There's a lot to regex if you want to learn it deeply. But you can learn enough to be useful in just a few minutes: most characters are literal.

What are the disadvantages of regex? ›

They can be hard to understand and maintain, especially for complex or long patterns. They can also be prone to errors and bugs, such as typos, syntax errors, or unintended matches. Another disadvantage of regex is that they are not very efficient or scalable.

What is the regex for 1000 to 9999? ›

You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.

What does pipe mean in regex? ›

A pipe character ( | ) is used in regular expressions to specify an OR condition. For example, A or B is expressed as A | B.

What does star mean in regex? ›

The asterisk ( * ) is a quantifier that applies to the preceding regular expression element. It specifies that the preceding element may occur zero or more times.

What is the list of special characters? ›

Alphanumeric, national, and special characters
  • ampersand &
  • asterisk *
  • blank.
  • braces { }
  • brackets [ ]
  • comma ,
  • equal sign =
  • hyphen -

What is \q and \e in regex? ›

\Q and \E are respectively the start and end of a literal string in a regex literal; they instruct the regex engine to not interpret the text inbetween those two "markers" as regexes. For instance, in order to match two stars, you could have this in your regex: \Q**\E.

What are the special characters in regex square brackets? ›

In regular expressions, square brackets ( [ ] ) are characters that have a special meaning of their own. Since they have a special meaning, you can call them "metacharacter". Square brackets help you to define a character set, which is a set of characters you want to match.

What are the special characters in UTF encoding? ›

UTF-8 represents ASCII invariant characters a-z, A-Z, 0-9, and certain special characters such as ' @ , . + - = / * ( ) the same way that they are represented in ASCII. UTF-16 represents these characters as NX'00 nn ' , where X' nn ' is the representation of the character in ASCII.

Top Articles
Latest Posts
Article information

Author: Jonah Leffler

Last Updated:

Views: 5860

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Jonah Leffler

Birthday: 1997-10-27

Address: 8987 Kieth Ports, Luettgenland, CT 54657-9808

Phone: +2611128251586

Job: Mining Supervisor

Hobby: Worldbuilding, Electronics, Amateur radio, Skiing, Cycling, Jogging, Taxidermy

Introduction: My name is Jonah Leffler, I am a determined, faithful, outstanding, inexpensive, cheerful, determined, smiling person who loves writing and wants to share my knowledge and understanding with you.