Python Regular Expressions

Regular expressions are a special character sequence, it can help you to easily check whether a string matches a pattern.

Python re module increases since version 1.5, regular expression pattern that provides Perl-style.

re module allows Python language has all the features of regular expressions.

compile function to generate a regular expression object from a pattern string and optional parameter flags. This object has a set of methods for regular expression matching and substitution.

re module also provides a method consistent with these features functions that use a pattern string as their first argument.

This section introduces the common Python regular expression processing functions.

re.match function

re.match tries to match a pattern from the starting position of the string, if not the starting position matching is successful, match () returns none.

Function syntax:

re.match(pattern, string, flags=0)

Function parameters:

parameter	description
pattern	Match regular expression
string	The string to match.
flags	Flag, the regular expression matching is used to control, such as: whether the case-sensitive, multi-line matches, and so on.

Successful match re.match method returns a match object, otherwise None.

We can use the group (num) or groups () function to get the matching objects match expressions.

Matching object methods	description
group (num = 0)	The entire expression string matching, group () can enter more than one group number, in which case it will return a value corresponding to those groups of tuples.
groups ()	It returns a tuple of all groups of the string, from 1 to the number contained in the group.

Example 1:

#!/usr/bin/python
# -*- coding: UTF-8 -*- 

import re
print(re.match('www', 'w3big.com').span())  # 在起始位置匹配
print(re.match('com', 'w3big.com'))         # 不在起始位置匹配

Run the above example output is:

(0, 3)
None

Example 2:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

The results of the above examples are as follows:

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

re.search method

re.search scan the whole string and returns the first successful match.

Function syntax:

re.search(pattern, string, flags=0)

Function parameters:

parameter	description
pattern	Match regular expression
string	The string to match.
flags	Flag, the regular expression matching is used to control, such as: whether the case-sensitive, multi-line matches, and so on.

Successful match re.search method returns a match object, otherwise None.

We can use the group (num) or groups () function to get the matching objects match expressions.

Matching object methods	description
group (num = 0)	The entire expression string matching, group () can enter more than one group number, in which case it will return a value corresponding to those groups of tuples.
groups ()	It returns a tuple of all groups of the string, from 1 to the number contained in the group.

Example 1:

#!/usr/bin/python
# -*- coding: UTF-8 -*- 

import re
print(re.search('www', 'w3big.com').span())  # 在起始位置匹配
print(re.search('com', 'w3big.com').span())         # 不在起始位置匹配

Run the above example output is:

(0, 3)
(11, 14)

Example 2:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
   print "searchObj.group() : ", searchObj.group()
   print "searchObj.group(1) : ", searchObj.group(1)
   print "searchObj.group(2) : ", searchObj.group(2)
else:
   print "Nothing found!!"

The results of the above examples are as follows:

searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter

The difference re.match and re.search

re.match matches only the beginning of the string, if the beginning of the string does not meet the regular expression, the match fails, the function returns None; and re.search match the entire string, until it finds a match.

Example:

#!/usr/bin/python
import re

line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print "match --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print "search --> matchObj.group() : ", matchObj.group()
else:
   print "No match!!"

Examples of the above results are as follows:

No match!!
search --> matchObj.group() :  dogs

Search and replace

Python's re module provides re.sub for the replacement string match.

grammar:

re.sub(pattern, repl, string, max=0)

The returned string is the string with the leftmost RE matches will not be repeated to replace. If the pattern is not found, characters will be returned unchanged.

Optional parameter count is the maximum number of times a pattern matching replacement; count must be a non-negative integer. The default value is 0 means to replace all occurrences.

Example:

#!/usr/bin/python
import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print "Phone Num : ", num

The results of the above examples are as follows:

Phone Num :  2004-959-559
Phone Num :  2004959559

Regex modifier - optional flag

Regular expressions can contain optional flags modifiers to control the match mode. Modifier is specified as an optional flag. (|) To specify multiple flags which can be bitwise OR through. As re.I | re.M is set to I and M flags:

Modifiers	description
re.I	So that matching is not case sensitive
re.L	Do localization identification (locale-aware) matching
re.M	Multi-line matching, affecting ^ and $
re.S	So., Including newlines match all characters
re.U	According to resolve Unicode character set characters. This flag affects \ w, \ W, \ b, \ B.
re.X	This flag by giving you more flexible format so that you will write regular expressions easier to understand.

Regular expression pattern

Pattern string using a special syntax to denote a regular expression:

Letters and numerals themselves. A regular expression pattern of letters and numbers match the same string.

Most of the letters and numbers will have a different meaning when preceded by a backslash.

Punctuation is escaped only when the match itself, or they represent a special meaning.

Backslash itself needs to use the backslash escape.

Since regular expressions usually contain backslashes, so you'd better use the original string to represent them. Schema elements (such as r '/ t', equivalent to '// t') matches the corresponding special characters.

The following table lists the regular expression pattern syntax specific elements. If your usage patterns while providing optional flags argument, the meaning of certain elements of the pattern will change.

mode	description
^	Matches the beginning of the string
$	Matches the end of the string.
.	Matches any character except newline, when re.DOTALL flag is specified, you can match any character including newline.
[...]	It used to represent a group of characters, listed separately: [amk] match 'a', 'm' or 'k'
[^ ...]	Not [] characters: [^ abc] matches in addition to the a, b, c characters.
re *	0 or more of expression matching.
re +	One or more of the matching expressions.
re?	Match 0 or 1 by the foregoing regular expressions to define segments, non-greedy way
re {n}
re {n,}	An exact match of n preceding expression.
re {n, m}	Match n to m times by the foregoing regular expressions to define segments, greedy way
a \| b	A match or b
(Re)	G match expression within the brackets, also represents a group
(? Imx)	Regular expression consists of three optional flags: i, m, or x. It affects only the area in parentheses.
(? -imx)	Regular expressions Close i, m, or x optional flag. It affects only the area in parentheses.
(:? Re)	Similar (...), but does not represent a group
(Imx:? Re)	I use in parentheses, m, or x optional flag
(-imx:? Re)	Do not use i, m in parenthesis, or x optional flag
(? # ...)	Note.
(? = Re)	Forward sure delimiter. If the contained regular expression, represented here by ..., successfully matches at the current location, and fails otherwise. However, once the contained expression has been tried, the matching engine does not advance; the remainder of the pattern is even try delimiter right.
(?! Re)	Forward negation delimiter. And certainly contrary delimiter; successful when the contained expression does not match the current position in the string
(?> Re)	Independent pattern matching, eliminating backtracking.
\ W	Match alphanumeric and underscores
\ W	Match non-alphanumeric and underscores
\ S	Matches any whitespace character, equivalent to [\ t \ n \ r \ f].
\ S	Matches any non-blank character
\ D	Matches any number that is equivalent to [0-9].
\ D	Matches any non-numeric
\ A	Matches the start of the string
\Z	Match string end, if it exists newline, just before the end of the string to match newline. c
\z	Match string end
\ G	Match Match completed last position.
\ B	Matches a word boundary, that is, it refers to the location and spaces between words. For example, 'er \ b' can match the "never" in the 'er', but can not match the "verb" in the 'er'.
\ B	Match non-word boundary. 'Er \ B' can match the "verb" in the 'er', but can not match "never" in the 'er'.
\ N, \ t, and the like.	Matches a newline. Matches a tab character. Wait
\ 1 ... \ 9	Matching sub-expression n-th packet.
\ 10	Match the first n packets subexpression if it is after a match. Otherwise, the expression refers to the octal character code.

Examples of regular expressions

Character matches

Examples	description
python	Matching "python".

Character Classes

Examples	description
[Pp] ython	Matching "Python" or "python"
rub [ye]	Match "ruby" or "rube"
[Aeiou]	Any one of the letters in parentheses matching
[0-9]	Matches any digit. Similar to [0123456789]
[Az]	Matches any lowercase letters
[AZ]	Matches any uppercase
[A-zA-Z0-9]	Matches any letters and numbers
[^ Aeiou]	In addition to all the characters other than letters aeiou
[^ 0-9]	Matching character except figures

Special character classes

Examples	description
.	Matches any single character except "\ n" is. To match including '\ n', including any characters, like the use of '[. \ N]' mode.
\ D	Matches a digit character. Equivalent to [0-9].
\ D	Match a non-numeric characters. It is equivalent to [^ 0-9].
\ S	Matches any whitespace characters, including spaces, tabs, page breaks, and so on. Is equivalent to [\ f \ n \ r \ t \ v].
\ S	Matches any non-whitespace characters. Is equivalent to [^ \ f \ n \ r \ t \ v].
\ W	Match any word character including underscore. It is equivalent to '[A-Za-z0-9_]'.
\ W	Matches any non-word character. It is equivalent to '[^ A-Za-z0-9_]'.

Previous: Python object-oriented

Next: Python CGI Programming

Python Tutorial

Python Advanced Tutorial

Python Regular Expressions

re.match function

re.search method

The difference re.match and re.search

Search and replace

Regex modifier - optional flag

Regular expression pattern

Examples of regular expressions

Character matches

Character Classes

Special character classes