Python Regex Special Sequences and Character classes (2024)

In this article, we will see how to use regex special sequences and character classes in Python. Python regex special sequence represents some special characters to enhance the capability of a regulars expression.

Table of contents

  • Special sequence
  • Character classes
  • Special Sequence \A and \Z
  • Special sequence \dand \D
  • Special Sequence \wand \W
  • Special Sequence \sand \S
  • Special Sequence \band \B
  • Create custom character classes
    • Simple character classes
    • Use negation to construct character classes
    • Use ranges to construct character classes

Special sequence

The special sequence represents the basic predefined character classes, which have a unique meaning. Each special sequence makes specific common patterns more comfortable to use.

For example, you can use \d sequence as a simplified definition for character class [0-9], which means match any digit from 0 to 9.

Let’s see the list of regex special sequences and their meaning. The special sequences consist of '\' (backlash) and a character from the table below.

Special SequenceMeaning
\AMatches pattern only at the start of the string
\ZMatches pattern only at the end of the string
\dMatches to any digit.
Short for character classes[0-9]
\DMatches to any non-digit.
short for[^0-9]
\sMatches any whitespace character.
short for character class [ \t\n\x0b\r\f]
\SMatches any non-whitespace character.
short for [^ \t\n\x0b\r\f]
\wMatches any alphanumeric character.
short for character class [a-zA-Z_0-9]
\WMatches any non-alphanumeric character.
short for [^a-zA-Z_0-9]
\bMatches the empty string, but only at the beginning or end of a word. Matches a word boundary where a word character is [a-zA-Z0-9_].
For example, ‘\bJessa\b' matches ‘Jessa’, ‘Jessa.’, ‘(Jessa)’, ‘Jessa Emma Kelly’ but not ‘JessaKelly’ or ‘Jessa5’.
\BOpposite of a \b. Matches the empty string, but only when it is not at the beginning or end of a word

Character classes

In Python, regex character classes are sets of characters or ranges of characters enclosed by square brackets [].

For example, [a-z] it means match any lowercase letter from a to z.

Let’s see some of the most common character classes used inside regular expression patterns.

Character ClassDescription
[abc]Match the letter a or b or c
[abc][pq]Match letter a or b or c followed by either p or q.
[^abc]Match any letter except a, b, or c (negation)
[0-9]Match any digit from 0 to 9. inclusive (range)
[a-z]Match any lowercase letters from a to z. inclusive (range)
[A-Z]Match any UPPERCASE letters from A to Z. inclusive (range)
[a-zA-z]Match any lowercase or UPPERCASE letter. inclusive (range)
[m-p2-8]Ranges: matches a letter between m and p and digits from 2 to 8, but not p2
[a-zA-Z0-9_]Match any alphanumeric character

Now Let’s see how to use each special sequence and character classes in Python regular expression.

Special Sequence \A and \Z

Backslash A ( \A )

The \A sequences only match the beginning of the string. It works the same as the caret (^) metacharacter.

On the other hand, if we do have a multi-line string, then \A will still match only at the beginning of the string, while the caret will match at the beginning of each new line of the string.

Backslash Z ( \Z ) sequences only match the end of the string. It works the same as the dollar ($) metacharacter.

Example

import retarget_str = "Jessa is a Python developer, and her salary is 8000"# \A to match at the start of a string# match word starts with capital letterresult = re.findall(r"\A([A-Z].*?)\s", target_str)print("Matching value", result)# Output ['Jessa']# \Z to match at the end of a string# match number at the end of the stringresult = re.findall(r"\d.*?\Z", target_str)print("Matching value", result)# Output ['8000']

Special sequence \dand \D

Backslash d ( \d )

  • The \d matches any digits from 0 to 9 inside the target string.
  • This special sequence is equivalent to character class [0-9] .
  • Use either \d or [0-9].

Backslash capital D ( \D )

  • This sequence is the exact opposite of \d, and it matches any non-digit character.
  • Any character in the target string that is not a digit would be the equivalent of the \D.
  • Also, you can write \D using character class [^0-9] (caret ^ at the beginning of the character class denotes negation).

Example

Now let’s do the followings

  1. Use a special sequence \d inside a regex pattern to find a 4-digit number in our target string.
  2. Use a special sequence \D inside a regex pattern to find all the non-digit characters.
import retarget_str = "8000 dollar"# \d to match all digitsresult = re.findall(r"\d", target_str)print(result)# Output ['8', '0', '0', '0']# \d to match all numbersresult = re.findall(r"\d+", target_str)print(result)# Output ['8000']# \D to match non-digitsresult = re.findall(r"\D", target_str)print(result)# Output [' ', 'd', 'o', 'l', 'l', 'a', 'r']

Special Sequence \wand \W

Backslash w ( \w )

  • The \w matches any alphanumeric character, also called a word character.
  • This includes lowercase and uppercase letters, the digits 0 to 9, and the underscore character.
  • Equivalent to character class [a-zA-z0-9_].
  • You can use either \wor [a-zA-z0-9_].

Backslash capital W ( \W )

  • This sequence is the exact opposite of \w, i.e., It matches any NON-alphanumeric character.
  • Any character in the target string that is not alphanumeric would be the equivalent of the \W.
  • You can write \W using character class [^a-zA-z0-9_] .

Example

Now let’s do the followings

  1. Use a special sequence \w inside a regex pattern to find all alphanumeric character in the string
  2. Use a special sequence \W inside a regex pattern to find all the non-alphanumeric characters.
import retarget_str = "Jessa and Kelly!!"# \w to match all alphanumeric charactersresult = re.findall(r"\w", target_str)print(result)# Output ['J', 'e', 's', 's', 'a', 'a', 'n', 'd', 'K', 'e', 'l', 'l', 'y']# \w{5} to 5-letter wordresult = re.findall(r"\w{5}", target_str)print(result)# Output ['Jessa', 'Kelly']# \W to match NON-alphanumericresult = re.findall(r"\W", target_str)print(result)# Output [' ', ' ', '!', '!']

Special Sequence \sand \S

Backslash lowercase s ( \s )

The \s matches any whitespace character inside the target string. Whitespace characters covered by this sequence are as follows

  • common space generated by the space key from the keyboard. (" ")
  • Tab character (\t)
  • Newline character (\n)
  • Carriage return (\r)
  • form feed (\f)
  • Vertical tab (\v)

Also, this special sequence is equivalent to character class [ \t\n\x0b\r\f] . So you can use either \s or [ \t\n\x0b\r\f].

Backslash capital S ( \S )

This sequence is the exact opposite of \s, and it matches any NON-whitespace characters. Any character in the target string that is not whitespace would be the equivalent of the \S.

Also, you can write \S using character class [^ \t\n\x0b\r\f] .

Example

Now let’s do the followings

  1. Use a special sequence \s inside a regex pattern to find all whitespace character in our target string
  2. Use a special sequence \S inside a regex pattern to find all the NON-whitespace character
import retarget_str = "Jessa \t \n "# \s to match any whitespaceresult = re.findall(r"\s", target_str)print(result)# Output [' ', ' ', '\t', ' ', '\n', ' ', ' ']# \S to match non-whitespaceresult = re.findall(r"\S", target_str)print(result)# Output ['J', 'e', 's', 's', 'a']# split on white-spacesresult = re.split(r"\s+", "Jessa and Kelly")print(result)# Output ['Jessa', 'and', 'Kelly']# remove all multiple white-spaces with single spaceresult = re.sub(r"\s+", " ", "Jessa and \t \t Kelly ")print(result)# Output 'Jessa and Kelly '

Special Sequence \band \B

Backslash lowercase b ( \b )

The \b special sequence matches the empty strings bordering the word. The backslash \b is used in regular expression patterns to signal word boundaries, or in other words, the borders or edges of a word.

Note: A word is a set of alphanumeric characters surrounded by non-alphanumeric characters (such as space).

Example

Let’s try to match all 6-letter word using a special sequence \w and \b

import retarget_str = " Jessa salary is 8000$ She is Python developer"# \b to word boundary# \w{6} to match six-letter wordresult = re.findall(r"\b\w{6}\b", target_str)print(result)# Output ['salary', 'Python']# \b need separate word not part of a wordresult = re.findall(r"\bthon\b", target_str)print(result)# Output []

Note:

One essential thing to keep in mind here is that the match will be made only for the complete and separate word itself. No match will be returned if the word is contained inside another word.

For instance, considering the same target string, we can search for the word “ssa” using a \b special sequence like this "\bssa\b". But we will not get a match because non-alphanumeric characters do not border it on both sides.

Moreover, the \b sequence always matches the empty string or boundary between an alphanumeric character and a non-alphanumeric character.

Therefore keep in mind that the word you’re trying to match with the help of the \b special sequence should be separate, not part of a word.

Backslash capital B ( \B )

This sequence is the exact opposite of \b.

On the other hand, the special sequence \B matches the empty string or the border between two alphanumeric characters or two non-alphanumeric characters only when it is not at the beginning or at the end of a word.

So this sequence can be useful for matching and locating some strings in a specific word.

For example, let’s use \B to check whether the string ‘thon‘ is inside the target string but not at the beginning of a word. So ‘thon‘ should be part of a larger word in our string, but not at the beginning of the word.

Example

import retarget_str = "Jessa salary is 8000$ She is Python developer"# \Bresult = re.findall(r"\Bthon", target_str)print(result)# Output ['thon']

And indeed, we have a match of "thon" inside the word “Python” not being at the beginning of the word. What if we want to check that "thon" is part of a word in the target string but not at the end of that word.

Well, we have to move the \B sequence at the end of the pattern. Let’s try this also.

result = re.findall(r"thon\B", target_str)

Create custom character classes

We can construct the character classes using the following ways

  1. Simple Classes
  2. Negation
  3. ranges

Simple character classes

The most basic form of a character class is to place a set of characters side-by-side within square brackets.

For example, the regular expression [phl]ot will match the words “pot”, “hot”, or “lot” because it defines a character class accepting either ‘p’, ‘h’, or ‘l’ as its first character followed by ‘ot’.

Let’s see the Python example of how to use simple character classes in the regular expression pattern.

import retarget_string = "Jessa loves Python. and her salary is 8000$"# simple character Class [jds]# Match the letter J or d or eresult = re.findall(r"[Jde]", target_string)print(result)# Output ['J', 'e', 'e', 'd', 'e']# simple character Class [0-9]# Match any digitresult = re.findall(r"[0-9]", target_string)print(result)# Output ['8', '0', '0', '0']# character Class [abc][pq]# Match Match p or y or t followed by either h or s.result = re.findall(r"[Pyt][hs]", target_string)print(result)# Output ['th']

Use negation to construct character classes

To match all characters except those listed inside a square bracket, insert the "^" metacharacter at the character class’s beginning. This technique is known asnegation.

  1. [^abc] matches any character except a, b, or c
  2. [^0-9] matches any character except digits

Example:

import retarget_string = "abcde25"result = re.findall(r"[^abc]", target_string)print(result)# Output ['d', 'e', '2', '5']# match any character except digitsresult = re.findall(r"[^0-9]", target_string)print(result)# Output ['a', 'b', 'c', 'd', 'e']

Use ranges to construct character classes

Sometimes you’ll want to define a character class that includes a range of values, such as the letters “m through p” or the numbers “2 through 6“. To specify a range, simply insert the "-" metacharacter between the first and last character to be matched, such as [m-p] or [2-6].

Let’s see how to use ranges to construct regex character classes.

  • [a-z] matches any lowercase letters from a to z
  • [A-Z] matches any UPPERCASE letters from A to Z
  • [2-6] matches any digit from 2 to 6

You can also place different ranges beside each other within the class to further increase the match possibilities.

For example, [a-zA-Z] will match any letter of the alphabet: a to z (lowercase) or A to Z (uppercase).

Example:

import retarget_string = "ABCDefg29"print(re.findall(r"[a-z]", target_string))# Output ['e', 'f', 'g']print(re.findall(r"[A-Z]", target_string))# Output ['A', 'B', 'C', 'D']print(re.findall(r"[a-zA-Z]", target_string))# Output ['A', 'B', 'C', 'D', 'e', 'f', 'g']print(re.findall(r"[2-6]", target_string))# Output ['2']print(re.findall(r"[A-C2-8]", target_string))# Output ['A', 'B', 'C', '2']

Previous:

Python Regex Metacharacters

Next:

Python Regex Flags

Python Regex Special Sequences and Character classes (2024)
Top Articles
Latest Posts
Article information

Author: Aracelis Kilback

Last Updated:

Views: 5858

Rating: 4.3 / 5 (44 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Aracelis Kilback

Birthday: 1994-11-22

Address: Apt. 895 30151 Green Plain, Lake Mariela, RI 98141

Phone: +5992291857476

Job: Legal Officer

Hobby: LARPing, role-playing games, Slacklining, Reading, Inline skating, Brazilian jiu-jitsu, Dance

Introduction: My name is Aracelis Kilback, I am a nice, gentle, agreeable, joyous, attractive, combative, gifted person who loves writing and wants to share my knowledge and understanding with you.