python基础

发布日期: 2017-07-22

re模块

常用的正则表达式符号

'.'          默认匹配除\n之外的任意单个字符，若指定flag DOTALL，则匹配任意字符，包括换行

'^'          匹配字符串开头

'$'          匹配字符串的结尾，或e.search("foo$","bfoo\nsdfsf",flags=re.MULTILINE).group()也可以

'*'          匹配*号前的字符0次或多次，re.findall("ab*","cabb3abcbbac") 结果为[‘abb’,'ab','a']

'+'          匹配前一个字符1次或多次，re.findall("ab+","ab+cd+abb+bba") 结果为['ab','abb']

'?'          匹配前一个字符1次或0次

'{m}'        匹配前一个字符m次

'{n,m}'      匹配前一个字符n到m次，re.findall("ab{1,3}","abb abc abbcbbb") 结果['abb','ab','abb']

'|'          匹配|左或|右的字符，re.search("abc|ABC","ABCBabcCD").group() 结果 'ABC'

'(...)'      分组匹配，re.search("(abc){2}a(123|456)c","abcabc456c").group() 结果 abcabca456c

'\A'         只从字符开头匹配，re.search("\Aabc","alexabc")是匹配不到的

'\Z'         匹配字符结尾，同$

'\d'         匹配数字0-9

'\D'         匹配非数字

'\w'         匹配[A-Za-z0-9]

'\W'         匹配非[A-Za-z0-9]

's'          匹配空白字符、\t、\n、\r，re.search("\s+","ab\tc1\n3").group() 结果 '\t'

'(?P<name>...)'  分组匹配 re.search("(?P<province>[0-9]{4})(?P<city>[0-9]{2})(?P<birthday>[0-9]{4})","371481199306143242").groupdict("city") 结果{'province': '3714', 'city': '81', 'birthday': '1993'}

python中的 \ 困惑

正则表达式里使用”"作为转义字符，这就可能造成一些问题，比如我们需要匹配文件中的字符”"，那么使用编程语言表示的正则表达式里将需要4个反斜杠”\\“：前两个和后两个分别用于在编程语言里转义成反斜杠，转换成两个反斜杠后再在正则表达式里转义成一个反斜杠。Python里的原生字符串很好地解决了这个问题，这个例子中的正则表达式可以使用r”\“表示。同样，匹配一个数字的”\d”可以写成r”\d”。有了原生字符串，你再也不用担心是不是漏写了反斜杠，写出来的表达式也更直观。

re模块中常用的方法

re.search

re.search 方法会在字符串内查找模式匹配，直到找到第一个匹配立刻返回，如果字符串中没有匹配，则返回None

>>>help(re.search)
search(pattern,string,flags=0)

通过扫描字符串查找与正则表达式模式相匹配的第一个位置，并返回相应的匹配对象，如果字符串中没有位置匹配到模式，就返回None；注意这不同于在字符串中的某一点找到零长度匹配

第一个参数：正则匹配规则

第二个参数：表示要匹配的字符串

第三个参数：标志位，用于控制正则表达式的匹配方式

举例：

>>>name = "Hello world, I am coming!"
>>>s = re.search(r'(\w+) (\w+)',name)
>>>if s:
...    print(s.group(0),'\n',s.group(1))
...else:
...    print("not search")
Hello world 
Hello

re.match

re.match(pattern,string,flags=0)

如果在字符串的开始有0或更多字符匹配于正则表达式模式，就返回一个匹配对象，如果字符串的开始不匹配于模式，就返回None; 注意这不同于一个0长度的匹配

Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

If you want to locate a match anywhere in string, use search() instead

第一个参数：匹配规则

第二个参数：表示要匹配的字符串

第三个参数：标志位，用于控制正则表达式的匹配方式

>>>name = "Hello world, I am coming!"
>>>s = re.match(r'(H..)',name)
>>>if s:
...    print(s.group(0),'\n',s.group(1))
...else:
...    print("not match!")
Hel
Hel

re.match与re.search的不同：re.match按照模式只匹配字符串的开始，匹配到返回一个match object,匹配不到返回None,而re.search从开始往后扫描搜索整个字符串开始匹配，直到匹配到第一个莫斯字串

re.findall

re.findall(pattern,string,flags=0)

返回字串中全部的非重叠的匹配模式，作为一个字串的列表，这字符串从左到右开始扫描，匹配到的按照找到的次序返回。如果模式中存在一个或多个组，则返回组的列表，如果模式有多个组，这将是一个元组的列表。

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match

>>>mail = '<user01@gmail.com <user02@gmail.com> user03@gmail.com'  #注意字符串
>>>s = re.findall(r'(\w+@gm....[a-z]{3})',mail)
...['user01@gmail.com','user02@gmail.com','user03@gmail.com']

re.sub

re.sub(pattern,repl,string,count=0,flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \nis converted to a single newline character,\r is converted to a carriage return, and so forth. Unknown escapes such as \jare left alone. Backreferences, such as\6, are replaced with the substring matched by group 6 in the pattern.

For example:

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
...        r'static PyObject*\npy_\1(void)\n{',
...        'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.

For example:

>>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'

The pattern may be a string or an RE object.

第一个参数：匹配规则

第二个参数：替换后的字符串

第三个参数：字符串

第四个参数：替换个数。默认为0，表示每个匹配项都替换

第五个参数：标志位

>>> test="Hi, nice to meet you where are you from?"
>>> re.sub(r'\s','-',test)
'Hi,-nice-to-meet-you-where-are-you-from?'
>>> re.sub(r'\s','-',test,5)                      #替换至第5个
'Hi,-nice-to-meet-you-where are you from?'

re.split

re.split(pattern,string,maxsplit=0)

第一个参数：匹配规则

第二个参数：字符串

第三个参数：最大分割字符串，默认为0，表示每项都分隔

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

>>> test="Hi, nice to meet you where are you from?"
>>> re.split(r"\s+",test)
['Hi,', 'nice', 'to', 'meet', 'you', 'where', 'are', 'you', 'from?']
>>> re.split(r"\s+",test,5)                  #分割前5个
['Hi,', 'nice', 'to', 'meet', 'you', where are you from?']

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:

如果分隔符中存在捕获组，并且在字符串开头匹配，则结果将以空字符串开头。字符串的结尾同样适用：

for example:

>>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

Note that split will never split a string on an empty pattern match. For example:

请注意，split不会在空模式匹配上拆分字符串。例如：

>>> re.split('x*', 'foo')
['foo']
>>> re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']

re.compile

re.complie(pattern,flags=0)

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match()and search()methods, described below.

The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

The sequence

>>>prog = re.compile(pattern)
>>>result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

但是当单个程序中的表达式被多次使用时，使用re.compile（）和保存生成的正则表达式对象进行重用会更有效率。

Note The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

注意传递给re.compile（）和模块级匹配函数的最新模式的编译版本被缓存，因此一次只使用少数正则表达式的程序不必担心编译正则表达式。

阿培