Python Cookbook 读书笔记(二)

Python Cookbook》读书笔记.

chapter 2: Strings and Text

2.1. Splitting Strings on Any of Multiple Delimiters

By us re.split and the regexp is r'[,;\s]\s*'

difference between str.split and re.split

str.split only accept simple seperator re.split accept regulare expression.

return value of re.split

  1. if there are no capture group, then the same as str.split
  2. if there are capture group, then all matched data will also be returned. then the value will be rst[::2], the seperator will be rst[1::2]
    s = "I, you; a  seperater.   haha"
    import re
    a = re.split(r'[,;.\s]\s*', s)
    print(a)
    
    a = re.split(r'([,;.\s]\s*)',s)
    print(a, a[::2], a[1::2])
    

iterate on two lists, by first zip the two to one

looks nice!

# Reform the line using the same delimiters
''.join(v+d for v,d in zip(values, delimiters))
'asdf fjdk;afed,fjek,asdf,foo'

regexp noncapture group, by r'(?:…)'

2.2. Matching Text at the Start or End of a String, by str.startswith() or str.endswith() method

filename = "aaaa.txt"
filename.endswith(".txt")
# pass a tuple to check against multiple choices
filename.endswith((".c", ".h"))
from urllib.request import urlopen
def read_data(name):
    if name.startswith(('http:', 'https:', 'ftp:')):
	return urlopen(name).read()
    else:
	with open(name) as f:
	    return f.read()

The parameter is simple string.

Compared to re.match, str.startswith looks nice.

2.3. Matching Strings Using Shell Wildcard Patterns, with fnmatch.fnmatch(), fnmatch.fnmatchcase()

Shell wildcard:

  • [] : a charset
  • * : match any length of chars
  • ? : match only one char
from fnmatch import fnmatch
print(fnmatch("data 1.txt", "*[0-9]*"))
  1. the pattern must match the whole string
  2. compares to startswith(), fnmatch can match at any position
  3. compares to regexp, fnmatch looks nice
  4. fnmatch will use the same case-sensitive rule as the OS, fnmatchcase will always respect case.
  5. between simpe string and full power of regexp

2.4. Matching and Searching for Text Patterns

What's the difference between matching and searching

the str.find() function: find the start index of a substring

s = "Hello xxx bbbb"
print(s.find("xx"))

re.compile() function: compile a regexp strinng to a regexp object, for performance

If you use the regexp many times, then first compile it is good. But if you only use it for one time, then don't use the compile function

difference between r'\d' and '\d'

if the string is prefixed by a 'r', then the '\' in the string will not be intepreted by the string parser. So the second regexp is actually r'd'.

re.findall() function, find all matched data as a list

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
import re
rg = r'\d+/(?:\d+)/(?:\d+)'
a = re.match('Today', text)
print(a.group(0))
a = re.findall(rg, text)
print(a)
print(type(a[0]))

The return value: if there are capture groups, then the return value is the captured data, and if the capture group number is one, it will be a string, else be a tuple of strings. if no capture groups, then the return value is all matched data.

re.finditer(), find all matched data as a iterater

Seems the return value is different from re.findall(), it will return a matched object , the same as re.match() Seems strange, and highly inconsistent.

re.match() function, always match at the start of a string

re.match() function, return value

rst.group(0): the matched data rst.group(1): the first captured data rst.groups(): all captured data as a tuple

2.5. Searching and Replacing Text

the str.replace function, replace all occurence in a string

str.replcae(pattern, replacement)

text = 'yeah, but no, but yeah, but no, but yeah'
print(text.replace('yeah', 'yep'))
# 'yep, but no, but yep, but no, but yep'

the re.sub(pattern, replacement, text) function, will also replace all occurence in a string

use r'\1' to refer to the first captured group

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
import re
print(re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text))
# 'Today is 2012-11-27. PyCon starts 2013-3-13.'

the re.sub(pattern, callback, text) function, will also replace all occurence in a string

The second parameter can also be a function, the parameter to this function is a match object(the same returned by re.match function).

The same example as the above one:

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
import re
def foo(m):
    (m, d, y) = m.groups()
    return '-'.join([y,m,d])

print(re.sub(r'(\d+)/(\d+)/(\d+)', foo, text))

the re.subn(…) function, same as re.sub, but also return subsitution counts also

2.6. Searching and Replacing Case-Insensitive Text

To do case-insensitive operations, you must use regexp with the re.IGNORECASE flags keyword parameter

replace words in a string with original case preserved

a excenlent example of replacing with 原始的大小写规则. 并且是一个很好的高阶函数的例子。

def matchcase(word):
    def replace(m):
	text = m.group()
	if text.isupper():
	    return word.upper()
	elif text.islower():
	    return word.lower()
	elif text[0].isupper():
	    return word.capitalize()
	else:
	    return word

    return replace

text = 'UPPER PYTHON, lower python, Mixed Python'
import re
print(re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE))
# 'UPPER SNAKE, lower snake, Mixed Snake'

2.7. Specifying a Regular Expression(regexp) for the Shortest Match, by using modifier '?', no-greedy match

By default, * will match longest data. if appended with a '?' then it will match the shortest

import re
text1 = 'Computer says "no."'
r= re.findall(r'"(.*)"', text1)
print(r)

text2 = 'Computer says "no." Phone says "yes."'
r= re.findall(r'"(.*)"', text2)
print(r)

# Now add a '?' after '*', no greedy match
r= re.findall(r'"(.*?)"', text2)
print(r)

2.8. Writing a Regular Expression for Multiline Patterns

By default, '.' will not match a new line character. there are two choices to let '.' match a new line character:

  1. by alternative. change r'.*' to r'(?:.|\n)*'
  2. by use the re.DOTALL flag
    s = '''/* aaaa
    bbbb
    cccc */'''
    import re
    r = re.findall(r'/\*.*\*/', s, flags=re.DOTALL)
    r = re.findall(r'/\*(?:.|\n)*\*/', s, flags=re.DOTALL)
    print(r)
    

the re.DOTALL flag: let '.' match a newline character

2.9. Normalizing Unicode Text to a Standard Representation, by unicodedata.normalize('NFC', str)

unicode may have more than one representation, see example in the book

normalizing means make sth. has the uniform format/type

2.11. Stripping Unwanted Characters from Strings

str.strip() function. lstrip(), rstrip(), delete whitespaces characters at begining or ending

s = "    a b c \n ";
print(s.strip())
print(s.lstrip())
print(s.rstrip())

* delete characters in middle of string, by str.replace(), or re.sub()

s = "   hello     word    ";
print(s.replace(" ", ""))
import re
print(re.sub("\s+", " ", s))

* create a generator object by an expression, by '(' instead of '[', like lazy evaluation on other languages

s = '''
import os.path
rst = ""
if os.path.isfile(""):
    with open("", "r") as f:
	rst = f.read()
'''
ss = s.split("\n")

s1 = (s.strip() for s in ss)
print(s1)
for s in s1:
    print(s)

2.12. Sanitizing and Cleaning Up Text

str.translate() function, change characters given a table/dictionary, the book given much unicode examples

2.13. Aligning Text Strings

the str.ljust(), str.rjust(), str.center() functions

accept a number, and an optionall character

print("aaa".ljust(20, "b"))
print("aaa".rjust(20, "-"))
print("aaa".center(20, "="))
print("aaa".center(20))

the format function and the str.format methods

print(format("aaa", ">20")) # same as rjust
print(format("aaa", "=<20")) # same as ljust
print(format("aaa", "^20")) # same as center
print("{} {:=^10}".format("abc", 123))

"%s %s" % (a, b) is old way, now should use the new way.

2.14. Combining and Concatenating Strings

by str.join

by + operator

by print function's 'sep' parameter

by format function

2.15. Interpolating Variables in Strings, by str.format() or str.formatmap() method

Note: formatmap doesn't exist in python 2.7

print("{name} is {age} years old".format(name="Tom", age=16))

name = "Jim"
age = 18
# print("{name} is {age} years old".format_map(vars()))

formatmap accept a dictionay, while format accept keywords parameters

the vars() function, the same as locals() if no parameter

if pass one parameter, then it is the same as obj._dict__

s = 'abc'
d = 123
print(vars())
print(locals())
# print(vars(s))

the dict._missing_(self, key) method will be called when a key not exists, then KeyError will not be raised.

If this method is defined, then when a key not exists, it will be called and return the value. Else a KeyError will be raised.

class safedict(dict):
    def __missing__(self, key):
	return '{'+key+'}'

d = safedict();
print(d['name'])
d1 = dict();
# print(d1['name'])

a function that will do variable interpolating from env, just like $var in perl, by str.formatmap

class safedict(dict):
    def __missing__(self, key):
	return '{'+key+'}'


import sys
def sub(text):
    return text.format_map(safedict(sys._getframe(1).f_locals))

name="Jim"
age=18
print(sub("{name} is {age} years old"))
people = {
   'name': ['John', 'Peter'],
   'age': [56, 64]
}

for i in range(2):
    print('My name is {{name[{0}]}}, I am {{age[{0}]}} years old.'.format(i).format_map(people))

sys.getframe([depth]): like calls in perl, get the stack frame

depth default to 0, means current stack frame. flocals attribute is used to get all local variabls. flineno attribute is the line number.

import sys
print(sys._getframe().f_locals)
print(sys._getframe().f_globals)
print(dir(sys._getframe().f_code))
print(sys._getframe().f_code.co_filename)
print(sys._getframe().f_lineno)

2.16. Reformatting Text to a Fixed Number of Columns, by textwrap.fill(astr, columns, initialindent='', subsquentindent='')

import textwrap
s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."

print(s)
print(textwrap.fill(s, 60))

get terminal column size, by os.getterminalsize().columns

import os
print(os.get_terminal_size().columns)

2.17. Handling HTML and XML Entities in Text

the html.escape(astr, quote=True) function:

escape means convert special characters to

s = '<a>this is </a>'
import html
print(html.escape(s))

the str.encode('ascii', errors='xmlcharrefreplace') function: encode a string to ascii

s = 'Spicy Jalapeño'
print(s.encode('ascii', errors='xmlcharrefreplace'))

Python Cookbook 读书笔记(一)

Python Cookbook》读书笔记.

chapter 1: Data Structures and Algorithms

1.6 Mapping Keys to Multiple Values in a Dictionary, by list value and defaultdict

example: create a dictionary with default value, by defaultdict

dictionary's property:

  1. you can add new key to a dictionary
  2. but when you access a key that not exists, there will be error
  3. defaultdict is used to fix such problem.
from collections import defaultdict
`dict` = defaultdict(`func list`)
# `dict`['aaa'].append(2)
$0

1.6 example

from collections import defaultdict
frequency = defaultdict(int)
frequency['colorless'] = 4
frequency['ideas'] # will be 0

frequency = defaultdict(list)
# first, frequency['colorless'] will return a empty list, then append one element to this list.
frequency['colorless'].append(4)
frequency['ideas'] # will be []

# Or you can pass a function take no arguments 

# the idiom:
my_dictionary = defaultdict(function to create default value)
for item in sequence:
my_dictionary[item_key] is updated with information about item

1.7 Keeping Dictionaries in Order, OrderedDict

form collections import OrderedDict
d = OrderedDict()
# the insertion order will be reserved.

An typical application is when for serilization.

1.8. Calculating with Dictionaries

get max key/value in a dictionary, based on the value, by inverting the dict

max(zip(`dict`.values(), `dict`.keys()))
# another solution
# max(`dict`, key=lambda k:`dict`[k])

first convert the dict to list of (value, key) pairs, then max function will first compare value, then compare key.

result of the max value for many things

for tuple and list, it just the element. but for a dict, it returns only the key. Why? Because it accept a iterable as first parameter, and for a dictionary, the iterable value is the key.

understanding of multi value bind

---> (b11, b12) = 1 TypeError: 'int' object is not iterable

The right hand side should be an iterable, every element in the iterable will be asigned to the left hand side variable, with each variable comsume one element exzactly. If the number of elements and variables not match, then there will be an error.

To consume more than one values, use the '*varname' expresstion, then the variable 'varname' will be a list of many elements.

Back to the error prints in the example, the 'int' object refers to the right side '1'.

PS: I find python much simpler and funny than java.

1.9. Finding Commonalities in Two Dictionaries

a dictionary's d.keys() and d.items() support set operations So to find the common part keys/items in two dictionaries, just use the set operation '|' or 'union' function.

get all keys as a iterable in a dictionary, by keys()

`dict`.keys()

get all values as a iterable in a dictionary, by values()

`dict`.values()

get all key, value pairs as a iterable in a dictionary, by items()

`dict`.items()

set operations

'|': union '&': intersection '-': difference 's1 < s2': check if s1 is a subset of s2

Example: In 1: e.keys(), d.keys() Out1: (dict_keys([1, 4, 'a', 9]), dict_keys([1, 3, 5]))

In 2: e.keys() & d.keys() Out2: {1}

In 3: e.keys() | d.keys() Out3: {1, 3, 4, 5, 'a', 9}

In 4: e.keys() - d.keys() Out4: {9, 4, 'a'}

1.10. Removing Duplicates from a Sequence while Maintaining Order

problem: what is hashable(and the link to python glossary)

From the python glossary: https://docs.python.org/3/glossary.html

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value.

Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.

All of Python’s immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are. Objects which are instances of user-defined classes are hashable by default; they all compare unequal, and their hash value is their id().

how yield/generator iterator is implemented

From the glossary, it works by suspends the function and return the value, and save the current status. Then if it was called next time, it will start execute from the place last time it was suspended. Great!! I understanded this.

generator A function which returns a generator iterator. It looks like a normal function except that it contains yield expressions for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function. Usually refers to a generator function, but may refer to a generator iterator in some contexts. In cases where the intended meaning isn’t clear, using the full terms avoids ambiguity.

generator iterator An object created by a generator function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator iterator resumes, it picks-up where it left-off (in contrast to functions which start fresh on every invocation).

file object is also an iterable, the element is a line

with open(somefile,'r') as f:
    for line in f:
	print(line)

a function that delete all duplicates in a list, with order preserved

def dedupe(items, key=None):
    seen = set()
    for item in items:
	val = item if key is None else key(item)
	if val not in seen:
	    yield item
	    seen.add(val)

If the element is hashable, then key function is not needed. Else, provide a fucntion to convert the element to a hashable element.

examples: >>> a = [ {'x':1, 'y':2}, {'x':1, 'y':3}, {'x':1, 'y':2}, {'x':2, 'y':4}] >>> list(dedupe(a, key=lambda d: (d['x'],d['y']))) [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}] >>> list(dedupe(a, key=lambda d: d['x'])) [{'x': 1, 'y': 2}, {'x': 2, 'y': 4}] >>>

delete all duplicates in a list, don't preserve order, by set

set(`list`)

Then all duplicate elements in list will be removed.

the a?b:c expression in python, if else in one line

val = b if a else c

looks good

1.11. Naming a Slice

the slice object

create a slice

a=[1,2,3,4]
s = slice(1,2)
print(a[s])
print(a[1:2])

'1:2' is just a shortcut to 'slice(1,2)'

slice attributes

s = slice(1,2,2)
print(s.start, s.stop, s.step)

1.12. Determining the Most Frequently Occurring Items in a Sequence

A method by me

a = [1, 2, 1, 3, 2,3,3]
from collections import defaultdict
d = defaultdict(int)
# b = [d[k]+=1 for k in a]  # syntax error here
for k in a:
    d[k]+=1

r = max(zip(d.values(), d.keys()))
print(r[1])

the collections.Counter class: change a list to a list of tuple of (element, count)

a = [1, 2, 1, 3, 2,3,3]
from collections import Counter
b = Counter(a)
c = b.most_common(1)
print(c[0][0])

# get the count
print(b[3]) # 3 is the element in a


# update with more words
b.update([4, 2, 5])

# and a Counter object support the math operations: '+' and '-'

When you need to count data, use Counter class. This is a so little class, in practice, I will always write it from scratch before.

1.13. Sorting a List of Dictionaries by a Common Key

the operator.itemgetter function

it will return a callable that can be passed to 'sorted':s key parameter, for list or dictionary

# return value of
import operator
operator.itemgetter("name")
# is the same as this one
lambda r:r["name"]
# but the former  is a little faster

仍然是非常小的功能,为什么搞得这么精细呢?

1.14. Sorting Objects Without Native Comparison Support

the operator.attrgetter function

it will return a callable that can be passed to 'sorted':s key parameter, for user defined class

class User():
    def __init__(self, name):
	self.name = name

    def __repr__(self):
	return 'User({})'.format(self.name)

# return value of
operator.attrgetter("name")
# is the same as this one
lambda o:o.name
# but the former  is a little faster

1.15. Grouping Records(a sequence of dictionaries) Together Based on a Field

the itertools.groupby function: group sequencially the list as tuple (key, items)

import itertools
rows =  [{1:2}, {1: 4},  {1: 3}]
# a should be a generator
rows.sort(key=itemgetter(1))
a = itertools.groupby(rows, key=itemgetter(1))

another way is just use a default list dictionary to group, then no sort is needed.

1.16. Filtering Sequence Elements

To fitering, just use list comprehension with an if condition

itertools.compress function, a filtering tool

it takse two parameters:

  1. an iterable which to be compressed
  2. a Boolean sequence, with the same length of first parameter if the element in this sequence is True, then the element at the same position in the first iterable will be put to the output

    An example:

addresses = [
    '5412 N CLARK',
    '5148 N CLARK',
    '5800 E 58TH',
    '2122 N CLARK'
    '5645 N RAVENSWOOD',
    '1060 W ADDISON',
    '4801 N BROADWAY',
    '1039 W GRANVILLE',
]
counts = [ 0, 3, 10, 4, 1, 7, 6, 1]

import itertools
b = [e > 5 for e in counts]
a = itertools.compress(addresses, b)
# Now a will be all items where count larger than 5
print(a)

1.17. Extracting a Subset of a Dictionary

dictionary comprehension, just like list comprehension, but use '{' instead of '['

prices = {
    'ACME': 45.23,
    'AAPL': 612.78,
    'IBM': 205.55,
    'HPQ': 37.20,
    'FB': 10.75
}
# Make a dictionary of all prices over 200
p1 = { key:value for key, value in prices.items() if value > 200 }
# Make a dictionary of tech stocks
tech_names = { 'AAPL', 'IBM', 'HPQ', 'MSFT' }
p2 = { key:value for key,value in prices.items() if key in tech_names }

1.18. Mapping Names to Sequence Elements

the collections.nametuple function, map an index to a name, and access to an element with that name

example:

from collections import namedtuple
People =  namedtuple('People', ['name', 'age'])
p = People(name='Jim', age=12)
print(p, p.name, p.age)

A good application: for database selection.

The ._replace method: Because a tuple is immutable, so to change an element, you can use _replace to replace a field and a new one will be returned. A tipical usage is first create a prototype element with all field value be the default one, then update some fields with the _replace function. Why there is a '_' in the function name?

from collections import namedtuple
Stock = namedtuple('Stock', ['name', 'shares', 'price', 'date', 'time'])
# Create a prototype instance
stock_prototype = Stock('', 0, 0.0, None, None)
# Function to convert a dictionary to a Stock
def dict_to_stock(s):
    return stock_prototype._replace(**s)

a = {'name': 'ACME', 'shares': 100, 'price': 123.45}
dict_to_stock(a)
# Stock(name='ACME', shares=100, price=123.45, date=None, time=None)

1.19. Transforming and Reducing Data at the Same Time

use generator-expression argument

The reducing function means: given a list, return a value.

the any function, check if any of an element is True in a iterable

check if any .py files exist in a directory

# Determine if any .py files exist in a directory
import os
files = os.listdir('dirname')
if any(name.endswith('.py') for name in files):
    print('There be python!')
else:
    print('Sorry, no python.')

get all files in a directory as a list

import os
files = os.listdir('dirname')

change a tuple/list/iterable to a csv line

This is much better than the string format method

# Output a tuple as CSV
s = ('ACME', 50, 123.45)
print(','.join(str(x) for x in s))# Output a tuple as CSV

1.20. Combining Multiple Mappings into a Single Mapping

the collections.ChainMap

combining many maps/dictionaries, then when get an element, it will try to get from the first map, then the second, ...

And for operations that mutate the mapping always affect the first map/dictionary.

typical application: scoped variable in a programming language.

Difference from the dict.update function: ChainMap use a link to the original dictionary, while dict.update create a new one.

  • check if an element exists in many dictionaries/maps, sequencially
    a = {'x': 1, 'z': 3 }
    b = {'y': 2, 'z': 4 }
    from collections import ChainMap
    c = ChainMap(a,b)
    print(c['x']) # Outputs 1 (from a)
    print(c['y']) # Outputs 2 (from b)
    print(c['z']) # Outputs 3 (from a)
    

Footnotes:

1

DEFINITION NOT FOUND.

2

DEFINITION NOT FOUND.

3

DEFINITION NOT FOUND.

4

DEFINITION NOT FOUND.

基于隐马尔科夫模型的中文分词方法

本文主要讲述隐马尔科夫模及其在中文分词中的应用。 基于中文分词语料库,建立中文分词的隐马尔科夫模型,最后用维特比方法进行求解。

一、隐马尔科夫模型介绍

隐马尔科夫模型中包括两个序列,其中一个是观测序列,另一个是隐藏序列。模型要解决的一个问题是,给定观测序列, 求其对应的隐藏序列。比如对于语音识别,这里的观测序列是语音的每一个序列,隐藏序列是这一串语音对应的文字。 或者对于机器翻译,观测序列是待翻译的语言,而隐藏序列则是目标语言。在解决这类问题时,我们的已知条件是, 第一,隐藏序列中某一个元素到观测序列中某一个元素之间的映射关系。第二是隐藏序列中每个元素转变到另一个元素之间的关系。 并且会有两个假设,第一是每个隐藏元素中的元素,只依赖于它前面一个元素。第二是每一个隐藏元素能够直接确定另一个观测元素。 其中以上两个已知条件可以分别表示为两个矩阵,这个矩阵可以通过分词语料库,根据统计的方法求得。

从数学上理解,给定观测序列求解隐藏序列。一个观测序列可能对应无数个隐藏序列,而我们的目标隐藏序列就是, 概率最大的那一个隐藏序列,也就是说最有可能的那个序列。即求给定隐藏序列这个条件下观测序列概率最大的那个序列的条件概率问题。

虽然隐马尔科夫模型中做了比较强的假设,这里比较强的意思是,现实生活中,可能根本无法满足这种假设条件, 但是它的应用范围却是比较广泛,能够非常简单有效的解决复杂的问题。这个也是统计学方法的一个特点, 基于大量的数据,使用简单的方法便可以解决复杂的问题。这可能是大数据的一种威力吧,如果我们有足够多的数据, 那么我们的模型可以做简化。而如果我们只有少量的数据,则我们必须用非常精确的模型,才可以得到我们想要的结果。 数据模型以及数据量也是一种此消彼长的关系。

下图为隐马尔科夫模型的概率图,其中 \(x\) 代表隐藏序列, \(y\) 代表观察序列。 hmm.png

其中这里的隐藏序列,又称为一个马尔科夫链。马尔科夫链的定义是,序列中每一个元素,只依赖于他前面的一个元素而与其他所有元素不相关。

为了使用隐马尔科夫模型,我们必须知道两个矩阵。第一个矩阵式表示隐藏序列中, 一个元素转变到另一个元素的概率. 如果总的序列的元素数目为n,则这个是一个n乘n的矩阵。 第二个矩阵表示隐藏序列中某个元素转变到观测序列中某个元素的概率. 如果隐藏序列中元素个数为n,观测序列中元素的数目为m, 则这个矩阵的维数为n乘m。 这两个矩阵都可以通过大量的统计数据来求得。

第二、中文分词的隐马尔科夫模型

中文分词要解决的问题是,给定一段中文文字,将其划分为一个个单独的词或者单字。中文分词是所有后续自然语言处理的基础。 如果将连续的中文文字看作是观测序列,将每个文字所对应的分词状态看作是隐藏序列,每个字的分子状态可能有两个值, 一个表示这个字是某一个词的词尾,用字母E表示。另一个表示这个字不是某一个词的词尾,用字母B表示。 则中文分词问题可以看作是一个标准的隐马尔科夫模型。实际中将每个字的分子状态表示为四个可选的值。 词的开始,词的中间,词的结尾,单字成词。分别用bmes表示。

第三、中文分词语料库

在网上可以免费下载,北京大学和香港大学提供的中文分词语料库。这个语料库实际上是一个txt的文档, 文档中每个单独的词用两个空格隔开。 根据分词语料库,我们可以求得隐马尔科夫模型中的两个参数矩阵. 根据大数定理,概率等于次数的比值。因此为了模型的准确性, 我们必须有大量的语料数据来计算模型的参数。

第四、使用维特比算法进行求解

给定马尔科夫模型中的两个参数以及一个观测序列,求解隐藏序列的问题,可以看作是一个类似于图论中的最短路径问题.

维特比算法使用动态规划的方法进行求解。动态规划的子问题是,对于第 \(i\) 列的每一个元素 \(A_j\), 以这个元素为最后一个节点的最短路径。如果知道第 \(i\) 列的每一个元素的最短路径,那么第 \(i+1\) 列的每个元素的最短路径变可求得.

最终的结果就是第 \(n\) 列中每一个元素的最短路径中值最小的那一个元素所对应的最短路径。

第五、代码实现

生成模型和判定模型

统计概率模型的作用是给定输入 \(x\) , 预测输出 \(y\) 的概率,也即求解条件概率 \(P(y|x)\) 。统计概率模型分为两种:生成模型(Generative Model)和判定模型(Descriminative Model)。

  • 如果模型学习的是 \(P(x, y)\) 的联合概率分布,则这个模型是生成模型。求得 \(P(x, y)\) 后,根据贝叶斯定律,可间接求得

\(P(y|x) = \frac{P(x, y)}{P(x)}\) 。

  • 如果模型学习的是 \(P(y|x)\) 的条件概率分布,则这个模型是判定模型。此时是直接求解。

常见的生成模型和判定模型如下

生成模型 判定模型
朴素贝叶斯 罗辑斯谛回归
隐马尔科夫模型 条件随机场

为什么叫生成模型

生成模型的中的生成表示: 根据训练数据, 模型学习给定 \(y\) 生成 \(x\) 的概率,即学习 \(P(x|y)\) 。以朴素贝叶斯为例, 求解公式为

\begin{equation} y^* = \underset{y}{\operatorname{argmax}} P(y|x) = \underset{y}{\operatorname{argmax}} P(x|y)P(y) \end{equation}

\(P(x|y)\) 表示给定分类 \(y\) ,来预测可能的 \(x\) 值,或者生成 \(x\) 的值。

区别

判定模型由于是直接学习 \(P(y|x)\) ,因此在相同数目的训练数据条件下,通常其预测效果要好于生成模型。

生成模型的功能更加强大。

自然语言处理名词表

名词 英文 含义
词性标注 part of speech, tagging, POS 给定一个句子,为每个词标注词性
  N-gram 一种文档表示模型,详见 这里
词袋 bag of words 一种文档表示模型
停止词 stop word 常用词,如 the, a, of
词根 steming 一个词的词根
命名实体识别 NER 从一个句子中识别人名、地名、机构名等

语言模型

语言模型的定义包含两部分

  1. 词汇表 \(V\), 这个语言所有词的集合, 每个词表示为 \(w_i\)
  2. 概率函数 \(P(s)\), 其中 \(s\) 中所有句子组成的集合, 也即为由词汇表 \(V\) 中所有词 \(w_i\) 组成的任意长度的序列.

语言模型对句子的可能性进行建模。假设语言是由所有可能句子组成的集合,其中句子定义为由任意词语, 以任意顺序, 以任意长度组成的有序序列。用 \(s\) 表示这样一个句子,则语言模型建模为一个概率分布函数 \(P(s)\), 函数值是这个句子的概率。

例子

假设一个语言的词汇表只包含三个单词:

[you, are, ok]

由这个词汇表所产生的所有的句子如下所示.

长度为一的有3个:

you
are
OK

长度为二的有9个:

you you
you are
you OK
are you
are are
are OK
OK you
OK are
OK OK

长度为三的有27个:

you you you
you you are
you you OK
you are you
you are are
you are OK
...
are you OK
...
OK OK OK

长度为四的,有3×3×3×3个. 长度为五的,有3×3×3×3x3个. 这个序列是无穷的。

然后我们有一个概率分布函数 \(P(s)\) 来计算每一个句子 \(s\) 的概率, 以下是这个函数的一些可能结果的例子:

\begin{equation} P(are\ you\ OK) = 0.008 \\ P(you\ are\ OK) = 0.002\\ P(you\ you) = 0 \\ P(OK) = 0.01 \\ P(you) = 0\\ \end{equation}

例子中表示句子 \("are\ you\ OK"\) 的概率为 \(0.008\) , 句子 \("you\ you"\) 的概率为 \(0\).

这样的一个函数 \(P(s)\) 就是这个语言的语言模型.

语言模型的建立

给定一个句子 \(s = w_1w_2w_3...w_n\), 其中 \(w_i\) 是一个词. 语言模型就是要求以下函数

\begin{equation} P(s)=P(w_1w_2w_3...w_n) \\ = P(w_1)P(w_2|w_1)P(w_3|w_1w_2)...P(w_n|w_1w_2...w_{n-1})\\ = \prod_i{P(w_i|w_1w_2...w_{i-1})} \end{equation}

其中

  • \(P(w_1w_2w_3...w_n)\) 为 词序列 \(w_1w_2w_3...w_n\) 的联合概率分布
  • \(P(w_1)P(w_2|w_1)P(w_3|w_1w_2)...P(w_n|w_1w_2...w_{n-1})\) 是使用的链式法则后的结果

这个公式可以理解为一个句子的概率是其所有单词概率的乘积, 一个单词的概率取决于他前面的所有的单词. 如果单词表的数目为5000, 句子的长度为3,那么有1250亿种可能性, 这个在实际情况下是无法计算的.

因此,为了使模型能够实际可计算,需要做出一个假设。语言模型中假设一个词只取决于它前面的一个词,与更之前的所有单词无关, 则上式可以转变为

\begin{equation} P(s)=P(w_1w_2w_3...w_n) = \prod_i{P(w_i|w_{i-1})} \end{equation}

这便是语言模型的简化形式: 二元语法模型(bigram model). 相应的也有三元语法模型(trigram model), 每个词依赖于他前面的两个字。 定义如下:

\begin{equation} P(s)=P(w_1w_2w_3...w_n) = \prod_i{P(w_i|w_{i-1}w_{i-2})} \end{equation}

三元语法模型是实际中通常使用的语言模型。

通常使用一个语料库来计算每个词的概率。语料库可以由任意文档组成。以二元模型为例,每个词的概率的计算的方法为

\begin{equation} P(w_i|w_{i-1})= \frac{c(w_{i-1}w_i)} {\sum_w{c(w_{i-1}w)}} \end{equation}

其中

  • \(c(w_{i-1}w_i)\) 为所有以 \(w_{i-1}\) 开头,并且以 \(w_i\) 结束的二元组的数目.
  • \(\sum_w{c(w_{i-1}w)}\) 表示所有以 \(w_{i-1}\) 开头, 并且以任意词结束的二元组的数目.

概率就是这两个数目的比值.

参数平滑方法

以上计算过程中有一个问题,比如计算以下这个句子的时候,句子的概率将为零。

这里的问题是:语料库中不可能包括所有可能出现的词序列, 当某个词的组合在语料库中不存在的时候,便会导致概率为零。 此时需要使用平滑方法来解决这个问题。最简单的一种方法是加法平滑方法.

加法平滑方法的基本原理是在统计每一个二元组的数目的时候总是为统计出的数目加个一,如下式所示

\begin{equation} P(w_i|w_{i-1})= \frac{c(w_{i-1}w_i) + 1} {\sum_w{(c(w_{i-1}w) + 1)}} \end{equation}

这样计算后,每个词的概率的计算结果总是大于0.

还有很多种平滑方法,如 古德图灵方法,katz平滑方法等。

语音识别中应用的例子

语音识别中,包含以下两个步骤

  1. 根据语音数据, 计算出出几种可能的句子. 因为有同音词的存在, 所以这一步可能有多个结果
  2. 根据语言模型, 计算每个句子的概率,选取概率最大的那个句子作为语音识别的结果

语言模型在第二步发挥了作用.

梯度下降算法

梯度下降算法是一种用于求解函数最小值的数值计算方法。求解函数的极值一般有两种方法,第一种是解析法, 即函数的极值可以用公式直接计算出来。如对于1元2次方程,我们可以直接求出这个函数的最大值或者最小值。 但对于更高次的方程或者非线性函数, 大多无法求出解析解,此时需要使用数值方法.

对于一元函数来说, 函数的梯度就是其导数. 假设函数为

\begin{equation} f(x) = 2x^2-4x+3 \end{equation}

则函数的梯度函数,也即导数为

\begin{equation} Grad(x) = 4x-4 \end{equation}

梯度下降算法的基本原理如下:

对于任意一个自变量 \(x\) 的值 \(x_1\), 我们可以计算此时函数的梯度为 \(Grad(x_1)\), 用当前的 \(x\) 的值 \(x_1\) 减去梯度值的一个比例, 得到一个新的自变量 \(x\) 的值 \(x_2 = x_1 - 0.01*Grad(x_1)\). 则函数在这个新的自变量 \(x\) 的值的地方肯定是小于原来的值的地方, 即

\begin{equation} f(x_2) < f(x_1) \end{equation}

其中 \(0.01\) 为步长值.

比如我们取 \(x_1 = 2\), 此时梯度值为 \(4*2-4=4\), 则 \(x_2 = 2 - 0.01*(4) = 1.96\),

\begin{equation} f(x_1) = 2*2^2-4*2+3 = 3\\ f(x_2) = 2*1.96^2-4*1.96+3 = 2.843 \end{equation}

可见 \(f(x_2) < f(x_1)\)

其中

  • \(x_1=2\) 中的 \(2\) 为自变量的初值. 初值可以随机选择, 如果函数是凸的, 不管初值为何,总能找到相同的极值点
  • \(0.01\) 为每次迭代的步长值

重复以上步骤, 选定初值后,根据以上方法不断变化自变量的值, 随着我们每一次的迭代,自变量的值将慢慢向最优值所对应的 \(x\) 值逼近 (对于这个例子, 最优值对应的 \(x = 1\)). 那么只要迭代的次数足够多,我们总会达到这个最优点,由此便求得函数的最小值.

梯度的概念

以上提到对于一元函数, 其梯度就是函数的导数,梯度值为一个数值。梯度概念是对导数的扩展, 当函数有多于一个自变量的时候. 只是梯度此时是一个向量,向量的每一维分别是对于每一个自变量求偏导数。

以一个二元函数为例

\begin{equation} f(x_1, x_2) = (x_1-2)^2 + 3x_2 \end{equation}

函数的梯度函数为

\begin{equation} Grad(x_1) = \frac{\partial f}{\partial x_1} = 2x_1-4\\ Grad(x_2) = \frac{\partial f}{\partial x_2}=3\\ Grad(x_1, x_2) = [Grad(x_1), Grad(x_2)] = [2x_1-4, 3] \end{equation}

以上讲到的关于自变量沿梯度反方向变化时, 函数值将减小的规则同样适用于二元及更高元的函数. 梯度下降算法的原理也一样, 不同点为此时更新自变量时,是向量操作.

函数的等值线

梯度总是垂直于函数的等值线。从等值线中可以更加形象地看出梯度下降的原理. 当自变量沿着梯度的方向变化时, 函数值将增加, 等值线所对应的值会增加. 因此,当我们将自变量沿着梯度的反方向变化时,函数的值将会减小. 由此我们可以求出函数的一个极小值, 通过让自变量不断的向梯度的反方向变化.

步长值

梯度给出了自变量变化的方向, 而步长值指定了每次变化的幅度. 如果步长值选得过大,则可能会导致在一次更新自变量后, 错过极值点,从而造成震荡, 无法求得函数的极值. 因此我们要保证步长值足够小,避免震荡现象的出现.

收敛准则

由于是通过迭代的方式一步一步的改变自变量的值来求函数的极值, 那么怎样判断函数已经达到一个极值了呢? 此时需要有一个收敛准则, 即判断一次迭代后,函数是否达到极值点. 一般使用的收敛准则为:指定一个精确率, 当函数值的变化率小于这个精确率时,我们就认为收敛了,此时函数的极值已经被找到.

在机器学习算法中的应用

对于一个机器学习算法,如果其目标函数可以表示成连续可微的凸函数,则我们可以用梯度下降算法进行求解其最小值或者最大值.

例如在对数几率回归算法中, 可以将目标函数写成一个连续可微的凸函数, 然后用梯度下降算法求函数的最小值, 从而求得模型的参数。

词袋模型和N-gram模型

词袋模型(bag of words)是常用的语言模型,将文档看作文档中所有词的集合,而忽略词的顺序。尽管丢失了词的顺序这一属性, 但这个模型仍然能够有效用于一些自然处理任务中,如文本分类。

N-gram模型是对词袋模型的扩展,N为一个数字,以N=2为例,2-gram模型将文档看作文档中所有相邻两个词这些词对的集合, 也忽略这些词对在文档中出现的顺序。词袋模型是当N=1时的特例。

例子

对于如下文档:

The brown fox jump up the rog.

对应的词袋模型为:

(The, brown, fox, jump, up, the, rog)

对应的2-gram模型为:

((The, brown), (brown, fox), (fox, jump), (jump, up), (up, the), (the, rog))

文档表示

将文档表示成词袋模型后,就可以进行定量的分析。一种简单的方式是判断这个文档中是否出现词汇表中某一个词,因此可以将文档表示成一个由0和1组成的向量。

如果语言的词汇表如下:

(a, boy, the, cute, likes, dog, little, girl)

则文档与向量表示的对应关系如下.

文档 向量表示
a little girl [1, 0, 0, 0, 0, 0, 1, 1]
the body likes the dog [0, 1, 1, 0, 1, 1, 0, 0]

其中向量中的每个元素表示这个文档中是否包含某个词. 如第一向量, 最后一个 1 表示它包含 girl这个词.

进一步的可以考虑词语频率(term frequency, TF)的影响,相当于是在前一个表示的模型新增了词语频率这一信息。 词的频率此时可以作为某种度量的权重.

文档 向量表示
a little girl [1, 0, 0, 0, 0, 0, 1, 1]
the body likes the dog [0, 1, 2, 0, 1, 1, 0, 0]

第二个向量的第三元素值为2,表示 the 在这个文档中出现了两次.

有些词出现的频率非常高,可能在每一篇文档中都会出现,因此引入文档频率(doc frequency, DF), 也即一个词在所有文档中出现的频率。 如果这个频率很高,那么这个词可能是一个停止词, 需要将其移除, 因为它对文档分类没有帮助。

与语言模型的关系

词袋模型是语音模型的一种简化形式。语言模型是给定一个词汇表,然后求所有的词的序列的概率。 词袋模型则是将这个词的序列的范围大大缩小,词的序列只包含由单个词所组成的序列,则每个序列的概率则等于词频率。

语言模型的详细介绍见这里.

在信息检索中的应用

信息检索要解决的问题是:给定一组关键字, 在所有的文档集合中, 返回与关键字相关的所有文档, 并对所有文档根据相关性进行排序.

基本的处理方法为: 首先将所有文档表示为词袋模型,表示为一个向量,然后查询搜索关键字是否包含在这个向量里, 这样便可以计算出所有与给定关键字相关的文档. 排序文档时,可能根据词频-逆文档频率(TF-IDF)进行排序.

在文本分类的应用

使用词袋模型的向量表示将每个文档表示为数学的形式,然后再输入到具体机器学习算法中.

Recall,Precision 和 F-score

这个三个数据用于描述一个模型的好坏程度,分别为召回率、准确度和F值。

定义

在二分类模型中,对于任意一个输入数据,结果只用两个:正类或者负类。对于一个输入数据集,假定该模型产生如下结果:

  • PT: 模型结果为正类且真实为正类的数目
  • PF: 模型结果为正类且真实为负类的数目
  • NT: 模型结果为负类且真实为负类的数目
  • NF: 模型结果为负类且真实为正类的数目

则召回率、准确度和F值的定义分别为:

Recall \(\frac{PT}{PT + NF}\)
Precision \(\frac{PT}{PT + PF}\)
F-score \(\frac{Recall + Precision}{2}\)

召回率和准确度具有相同的分子,分母不同。F值是二者的算数平均数。

搜索引擎的例子

假设给定一个关键字, 搜索引擎返回了100个文档,其中80个是正确的,也即符合关键字, 剩余20个是错误的,也即和给定的关键字没有关系。则此时准确度为: \(80/100 = 80\%\)

假设还有40篇文档也符合这个关键字,但搜索引擎并没有返回这40篇文档,则召回率为: \(80 / (80 + 40) = 67\%\)

由此可见, 召回率表示了搜索引擎从所有符合条件的文档中“召回”正确文档的比例, 而准确度表示了搜索引擎返回文档中正确文档的比例。 所有符合条件的文档和返回文档这两个概念是不同的。

进一步思考

召回率描述了是否有遗漏,准确度描述了是否正确,F值综合了二者。

如果一个模型如果准确率很高,但召回率很低,即使结果中大多数数据是正确的,但却遗漏了很多正确的数据。如果一个模型召回率很高, 但准确率很低,则表示这个模型可能返回所有正确的数据,但返回的数据中,错误的数据也很多。

好的模型应该做到召回率和准确率都较高,也即F值较高。

Emacs Lisp 编程笔记

错误处理

抛出错误

(signal 'my-error '("This is a demo error"))
(error "This is another error")

What's the difference between the two?

捕获错误

(progn
  (condition-case err
      (signal 'my-error '("This is a demo error"))
    ('my-error (message "my error handled, %s" err)))
  (message "end"))

(defun aaa ()
  (interactive)
  (condition-case error
      (progn
	(op/git-change-branch op/repository-directory "source")
	(op/git-commit-changes op/repository-directory "Changes"))
    '('git-error  (message "Error is %s" error))))

(aaa)

忽略错误

(progn
  (ignore-errors
    ;;(signal 'my-error '("This is a demo error"))
    (error "This is another error")
    )
  (message "end"))