《Python Cookbook》读书笔记.
chapter 2: Strings and Text
2.1. Splitting Strings on Any of Multiple Delimiters
By us re.split and the regexp is r'[,;\s]\s*'
difference between str.split and re.split
str.split only accept simple seperator re.split accept regulare expression.
return value of re.split
- if there are no capture group, then the same as str.split
- if there are capture group, then all matched data will also be returned.
then the value will be rst[::2], the seperator will be rst[1::2]
s = "I, you; a seperater. haha" import re a = re.split(r'[,;.\s]\s*', s) print(a) a = re.split(r'([,;.\s]\s*)',s) print(a, a[::2], a[1::2])
iterate on two lists, by first zip the two to one
looks nice!
# Reform the line using the same delimiters
''.join(v+d for v,d in zip(values, delimiters))
'asdf fjdk;afed,fjek,asdf,foo'
regexp noncapture group, by r'(?:…)'
2.2. Matching Text at the Start or End of a String, by str.startswith() or str.endswith() method
filename = "aaaa.txt"
filename.endswith(".txt")
# pass a tuple to check against multiple choices
filename.endswith((".c", ".h"))
from urllib.request import urlopen
def read_data(name):
if name.startswith(('http:', 'https:', 'ftp:')):
return urlopen(name).read()
else:
with open(name) as f:
return f.read()
The parameter is simple string.
Compared to re.match, str.startswith looks nice.
2.3. Matching Strings Using Shell Wildcard Patterns, with fnmatch.fnmatch(), fnmatch.fnmatchcase()
Shell wildcard:
- [] : a charset
- * : match any length of chars
- ? : match only one char
from fnmatch import fnmatch
print(fnmatch("data 1.txt", "*[0-9]*"))
- the pattern must match the whole string
- compares to startswith(), fnmatch can match at any position
- compares to regexp, fnmatch looks nice
- fnmatch will use the same case-sensitive rule as the OS, fnmatchcase will always respect case.
- between simpe string and full power of regexp
2.4. Matching and Searching for Text Patterns
What's the difference between matching and searching
the str.find() function: find the start index of a substring
s = "Hello xxx bbbb"
print(s.find("xx"))
re.compile() function: compile a regexp strinng to a regexp object, for performance
If you use the regexp many times, then first compile it is good. But if you only use it for one time, then don't use the compile function
difference between r'\d' and '\d'
if the string is prefixed by a 'r', then the '\' in the string will not be intepreted by the string parser. So the second regexp is actually r'd'.
re.findall() function, find all matched data as a list
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
import re
rg = r'\d+/(?:\d+)/(?:\d+)'
a = re.match('Today', text)
print(a.group(0))
a = re.findall(rg, text)
print(a)
print(type(a[0]))
The return value: if there are capture groups, then the return value is the captured data, and if the capture group number is one, it will be a string, else be a tuple of strings. if no capture groups, then the return value is all matched data.
re.finditer(), find all matched data as a iterater
Seems the return value is different from re.findall(), it will return a matched object , the same as re.match() Seems strange, and highly inconsistent.
re.match() function, always match at the start of a string
re.match() function, return value
rst.group(0): the matched data rst.group(1): the first captured data rst.groups(): all captured data as a tuple
2.5. Searching and Replacing Text
the str.replace function, replace all occurence in a string
str.replcae(pattern, replacement)
text = 'yeah, but no, but yeah, but no, but yeah'
print(text.replace('yeah', 'yep'))
# 'yep, but no, but yep, but no, but yep'
the re.sub(pattern, replacement, text) function, will also replace all occurence in a string
use r'\1' to refer to the first captured group
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
import re
print(re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text))
# 'Today is 2012-11-27. PyCon starts 2013-3-13.'
the re.sub(pattern, callback, text) function, will also replace all occurence in a string
The second parameter can also be a function, the parameter to this function is a match object(the same returned by re.match function).
The same example as the above one:
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
import re
def foo(m):
(m, d, y) = m.groups()
return '-'.join([y,m,d])
print(re.sub(r'(\d+)/(\d+)/(\d+)', foo, text))
the re.subn(…) function, same as re.sub, but also return subsitution counts also
2.6. Searching and Replacing Case-Insensitive Text
To do case-insensitive operations, you must use regexp with the re.IGNORECASE flags keyword parameter
replace words in a string with original case preserved
a excenlent example of replacing with 原始的大小写规则. 并且是一个很好的高阶函数的例子。
def matchcase(word):
def replace(m):
text = m.group()
if text.isupper():
return word.upper()
elif text.islower():
return word.lower()
elif text[0].isupper():
return word.capitalize()
else:
return word
return replace
text = 'UPPER PYTHON, lower python, Mixed Python'
import re
print(re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE))
# 'UPPER SNAKE, lower snake, Mixed Snake'
2.7. Specifying a Regular Expression(regexp) for the Shortest Match, by using modifier '?', no-greedy match
By default, * will match longest data. if appended with a '?' then it will match the shortest
import re
text1 = 'Computer says "no."'
r= re.findall(r'"(.*)"', text1)
print(r)
text2 = 'Computer says "no." Phone says "yes."'
r= re.findall(r'"(.*)"', text2)
print(r)
# Now add a '?' after '*', no greedy match
r= re.findall(r'"(.*?)"', text2)
print(r)
2.8. Writing a Regular Expression for Multiline Patterns
By default, '.' will not match a new line character. there are two choices to let '.' match a new line character:
- by alternative. change r'.*' to r'(?:.|\n)*'
- by use the re.DOTALL flag
s = '''/* aaaa bbbb cccc */''' import re r = re.findall(r'/\*.*\*/', s, flags=re.DOTALL) r = re.findall(r'/\*(?:.|\n)*\*/', s, flags=re.DOTALL) print(r)
the re.DOTALL flag: let '.' match a newline character
2.9. Normalizing Unicode Text to a Standard Representation, by unicodedata.normalize('NFC', str)
unicode may have more than one representation, see example in the book
normalizing means make sth. has the uniform format/type
2.11. Stripping Unwanted Characters from Strings
str.strip() function. lstrip(), rstrip(), delete whitespaces characters at begining or ending
s = " a b c \n ";
print(s.strip())
print(s.lstrip())
print(s.rstrip())
* delete characters in middle of string, by str.replace(), or re.sub()
s = " hello word ";
print(s.replace(" ", ""))
import re
print(re.sub("\s+", " ", s))
* create a generator object by an expression, by '(' instead of '[', like lazy evaluation on other languages
s = '''
import os.path
rst = ""
if os.path.isfile(""):
with open("", "r") as f:
rst = f.read()
'''
ss = s.split("\n")
s1 = (s.strip() for s in ss)
print(s1)
for s in s1:
print(s)
2.12. Sanitizing and Cleaning Up Text
str.translate() function, change characters given a table/dictionary, the book given much unicode examples
2.13. Aligning Text Strings
the str.ljust(), str.rjust(), str.center() functions
accept a number, and an optionall character
print("aaa".ljust(20, "b"))
print("aaa".rjust(20, "-"))
print("aaa".center(20, "="))
print("aaa".center(20))
the format function and the str.format methods
print(format("aaa", ">20")) # same as rjust
print(format("aaa", "=<20")) # same as ljust
print(format("aaa", "^20")) # same as center
print("{} {:=^10}".format("abc", 123))
"%s %s" % (a, b) is old way, now should use the new way.
2.14. Combining and Concatenating Strings
by str.join
by + operator
by print function's 'sep' parameter
by format function
2.15. Interpolating Variables in Strings, by str.format() or str.formatmap() method
Note: formatmap doesn't exist in python 2.7
print("{name} is {age} years old".format(name="Tom", age=16))
name = "Jim"
age = 18
# print("{name} is {age} years old".format_map(vars()))
formatmap accept a dictionay, while format accept keywords parameters
the vars() function, the same as locals() if no parameter
if pass one parameter, then it is the same as obj._dict__
s = 'abc'
d = 123
print(vars())
print(locals())
# print(vars(s))
the dict._missing_(self, key) method will be called when a key not exists, then KeyError will not be raised.
If this method is defined, then when a key not exists, it will be called and return the value. Else a KeyError will be raised.
class safedict(dict):
def __missing__(self, key):
return '{'+key+'}'
d = safedict();
print(d['name'])
d1 = dict();
# print(d1['name'])
a function that will do variable interpolating from env, just like $var in perl, by str.formatmap
class safedict(dict):
def __missing__(self, key):
return '{'+key+'}'
import sys
def sub(text):
return text.format_map(safedict(sys._getframe(1).f_locals))
name="Jim"
age=18
print(sub("{name} is {age} years old"))
people = {
'name': ['John', 'Peter'],
'age': [56, 64]
}
for i in range(2):
print('My name is {{name[{0}]}}, I am {{age[{0}]}} years old.'.format(i).format_map(people))
sys.getframe([depth]): like calls in perl, get the stack frame
depth default to 0, means current stack frame. flocals attribute is used to get all local variabls. flineno attribute is the line number.
import sys
print(sys._getframe().f_locals)
print(sys._getframe().f_globals)
print(dir(sys._getframe().f_code))
print(sys._getframe().f_code.co_filename)
print(sys._getframe().f_lineno)
2.16. Reformatting Text to a Fixed Number of Columns, by textwrap.fill(astr, columns, initialindent='', subsquentindent='')
import textwrap
s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."
print(s)
print(textwrap.fill(s, 60))
get terminal column size, by os.getterminalsize().columns
import os
print(os.get_terminal_size().columns)
2.17. Handling HTML and XML Entities in Text
the html.escape(astr, quote=True) function:
escape means convert special characters to
s = '<a>this is </a>'
import html
print(html.escape(s))
the str.encode('ascii', errors='xmlcharrefreplace') function: encode a string to ascii
s = 'Spicy Jalapeño'
print(s.encode('ascii', errors='xmlcharrefreplace'))