Python Regular Expressions
Regular expressions are a special character sequence, it can help you to easily check whether a string matches a pattern.
Python re module increases since version 1.5, regular expression pattern that provides Perl-style.
re module allows Python language has all the features of regular expressions.
compile function to generate a regular expression object from a pattern string and optional parameter flags. This object has a set of methods for regular expression matching and substitution.
re module also provides a method consistent with these features functions that use a pattern string as their first argument.
This section introduces the common Python regular expression processing functions.
re.match function
re.match tries to match a pattern from the starting position of the string, if not the starting position matching is successful, match () returns none.
Function syntax:
re.match(pattern, string, flags=0)
Function parameters:
parameter | description |
---|---|
pattern | Match regular expression |
string | The string to match. |
flags | Flag, the regular expression matching is used to control, such as: whether the case-sensitive, multi-line matches, and so on. |
Successful match re.match method returns a match object, otherwise None.
We can use the group (num) or groups () function to get the matching objects match expressions.
Matching object methods | description |
---|---|
group (num = 0) | The entire expression string matching, group () can enter more than one group number, in which case it will return a value corresponding to those groups of tuples. |
groups () | It returns a tuple of all groups of the string, from 1 to the number contained in the group. |
Example 1:
#!/usr/bin/python # -*- coding: UTF-8 -*- import re print(re.match('www', 'w3big.com').span()) # 在起始位置匹配 print(re.match('com', 'w3big.com')) # 不在起始位置匹配
Run the above example output is:
(0, 3) None
Example 2:
#!/usr/bin/python import re line = "Cats are smarter than dogs" matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I) if matchObj: print "matchObj.group() : ", matchObj.group() print "matchObj.group(1) : ", matchObj.group(1) print "matchObj.group(2) : ", matchObj.group(2) else: print "No match!!"
The results of the above examples are as follows:
matchObj.group() : Cats are smarter than dogs matchObj.group(1) : Cats matchObj.group(2) : smarter
re.search method
re.search scan the whole string and returns the first successful match.
Function syntax:
re.search(pattern, string, flags=0)
Function parameters:
parameter | description |
---|---|
pattern | Match regular expression |
string | The string to match. |
flags | Flag, the regular expression matching is used to control, such as: whether the case-sensitive, multi-line matches, and so on. |
Successful match re.search method returns a match object, otherwise None.
We can use the group (num) or groups () function to get the matching objects match expressions.
Matching object methods | description |
---|---|
group (num = 0) | The entire expression string matching, group () can enter more than one group number, in which case it will return a value corresponding to those groups of tuples. |
groups () | It returns a tuple of all groups of the string, from 1 to the number contained in the group. |
Example 1:
#!/usr/bin/python # -*- coding: UTF-8 -*- import re print(re.search('www', 'w3big.com').span()) # 在起始位置匹配 print(re.search('com', 'w3big.com').span()) # 不在起始位置匹配
Run the above example output is:
(0, 3) (11, 14)
Example 2:
#!/usr/bin/python import re line = "Cats are smarter than dogs"; searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I) if searchObj: print "searchObj.group() : ", searchObj.group() print "searchObj.group(1) : ", searchObj.group(1) print "searchObj.group(2) : ", searchObj.group(2) else: print "Nothing found!!"The results of the above examples are as follows:
searchObj.group() : Cats are smarter than dogs searchObj.group(1) : Cats searchObj.group(2) : smarter
The difference re.match and re.search
re.match matches only the beginning of the string, if the beginning of the string does not meet the regular expression, the match fails, the function returns None; and re.search match the entire string, until it finds a match.
Example:
#!/usr/bin/python import re line = "Cats are smarter than dogs"; matchObj = re.match( r'dogs', line, re.M|re.I) if matchObj: print "match --> matchObj.group() : ", matchObj.group() else: print "No match!!" matchObj = re.search( r'dogs', line, re.M|re.I) if matchObj: print "search --> matchObj.group() : ", matchObj.group() else: print "No match!!"Examples of the above results are as follows:
No match!! search --> matchObj.group() : dogs
Search and replace
Python's re module provides re.sub for the replacement string match.
grammar:
re.sub(pattern, repl, string, max=0)
The returned string is the string with the leftmost RE matches will not be repeated to replace. If the pattern is not found, characters will be returned unchanged.
Optional parameter count is the maximum number of times a pattern matching replacement; count must be a non-negative integer. The default value is 0 means to replace all occurrences.
Example:
#!/usr/bin/python import re phone = "2004-959-559 # This is Phone Number" # Delete Python-style comments num = re.sub(r'#.*$', "", phone) print "Phone Num : ", num # Remove anything other than digits num = re.sub(r'\D', "", phone) print "Phone Num : ", numThe results of the above examples are as follows:
Phone Num : 2004-959-559 Phone Num : 2004959559
Regex modifier - optional flag
Regular expressions can contain optional flags modifiers to control the match mode. Modifier is specified as an optional flag. (|) To specify multiple flags which can be bitwise OR through. As re.I | re.M is set to I and M flags:
Modifiers | description |
---|---|
re.I | So that matching is not case sensitive |
re.L | Do localization identification (locale-aware) matching |
re.M | Multi-line matching, affecting ^ and $ |
re.S | So., Including newlines match all characters |
re.U | According to resolve Unicode character set characters. This flag affects \ w, \ W, \ b, \ B. |
re.X | This flag by giving you more flexible format so that you will write regular expressions easier to understand. |
Regular expression pattern
Pattern string using a special syntax to denote a regular expression:
Letters and numerals themselves. A regular expression pattern of letters and numbers match the same string.
Most of the letters and numbers will have a different meaning when preceded by a backslash.
Punctuation is escaped only when the match itself, or they represent a special meaning.
Backslash itself needs to use the backslash escape.
Since regular expressions usually contain backslashes, so you'd better use the original string to represent them. Schema elements (such as r '/ t', equivalent to '// t') matches the corresponding special characters.
The following table lists the regular expression pattern syntax specific elements. If your usage patterns while providing optional flags argument, the meaning of certain elements of the pattern will change.
mode | description |
---|---|
^ | Matches the beginning of the string |
$ | Matches the end of the string. |
. | Matches any character except newline, when re.DOTALL flag is specified, you can match any character including newline. |
[...] | It used to represent a group of characters, listed separately: [amk] match 'a', 'm' or 'k' |
[^ ...] | Not [] characters: [^ abc] matches in addition to the a, b, c characters. |
re * | 0 or more of expression matching. |
re + | One or more of the matching expressions. |
re? | Match 0 or 1 by the foregoing regular expressions to define segments, non-greedy way |
re {n} | |
re {n,} | An exact match of n preceding expression. |
re {n, m} | Match n to m times by the foregoing regular expressions to define segments, greedy way |
a | b | A match or b |
(Re) | G match expression within the brackets, also represents a group |
(? Imx) | Regular expression consists of three optional flags: i, m, or x. It affects only the area in parentheses. |
(? -imx) | Regular expressions Close i, m, or x optional flag. It affects only the area in parentheses. |
(:? Re) | Similar (...), but does not represent a group |
(Imx:? Re) | I use in parentheses, m, or x optional flag |
(-imx:? Re) | Do not use i, m in parenthesis, or x optional flag |
(? # ...) | Note. |
(? = Re) | Forward sure delimiter. If the contained regular expression, represented here by ..., successfully matches at the current location, and fails otherwise. However, once the contained expression has been tried, the matching engine does not advance; the remainder of the pattern is even try delimiter right. |
(?! Re) | Forward negation delimiter. And certainly contrary delimiter; successful when the contained expression does not match the current position in the string |
(?> Re) | Independent pattern matching, eliminating backtracking. |
\ W | Match alphanumeric and underscores |
\ W | Match non-alphanumeric and underscores |
\ S | Matches any whitespace character, equivalent to [\ t \ n \ r \ f]. |
\ S | Matches any non-blank character |
\ D | Matches any number that is equivalent to [0-9]. |
\ D | Matches any non-numeric |
\ A | Matches the start of the string |
\Z | Match string end, if it exists newline, just before the end of the string to match newline. c |
\z | Match string end |
\ G | Match Match completed last position. |
\ B | Matches a word boundary, that is, it refers to the location and spaces between words. For example, 'er \ b' can match the "never" in the 'er', but can not match the "verb" in the 'er'. |
\ B | Match non-word boundary. 'Er \ B' can match the "verb" in the 'er', but can not match "never" in the 'er'. |
\ N, \ t, and the like. | Matches a newline. Matches a tab character. Wait |
\ 1 ... \ 9 | Matching sub-expression n-th packet. |
\ 10 | Match the first n packets subexpression if it is after a match. Otherwise, the expression refers to the octal character code. |
Examples of regular expressions
Character matches
Examples | description |
---|---|
python | Matching "python". |
Character Classes
Examples | description |
---|---|
[Pp] ython | Matching "Python" or "python" |
rub [ye] | Match "ruby" or "rube" |
[Aeiou] | Any one of the letters in parentheses matching |
[0-9] | Matches any digit. Similar to [0123456789] |
[Az] | Matches any lowercase letters |
[AZ] | Matches any uppercase |
[A-zA-Z0-9] | Matches any letters and numbers |
[^ Aeiou] | In addition to all the characters other than letters aeiou |
[^ 0-9] | Matching character except figures |
Special character classes
Examples | description |
---|---|
. | Matches any single character except "\ n" is. To match including '\ n', including any characters, like the use of '[. \ N]' mode. |
\ D | Matches a digit character. Equivalent to [0-9]. |
\ D | Match a non-numeric characters. It is equivalent to [^ 0-9]. |
\ S | Matches any whitespace characters, including spaces, tabs, page breaks, and so on. Is equivalent to [\ f \ n \ r \ t \ v]. |
\ S | Matches any non-whitespace characters. Is equivalent to [^ \ f \ n \ r \ t \ v]. |
\ W | Match any word character including underscore. It is equivalent to '[A-Za-z0-9_]'. |
\ W | Matches any non-word character. It is equivalent to '[^ A-Za-z0-9_]'. |