Classify Text With NLTK
Classification is the task of choosing the correct class label for a given input.
A classifier is called supervised if it is built based on training corpora containing the correct label for each input.
这里就以一个例子来说明怎样用nltk来实现分类器训练和分类
一个简单的分类任务,给定一个名字,判断其性别,就是在male,female两类进行分类
好,先来训练,训练就要有corpus,就是分好类的名字的例子
nltk提供了names的corpus
from nltk.corpus import names
names.words(male.txt) #男性的name的列表
names.words(female.txt) #女性的name的列表
有了训练corpus,下面就是特征提取
The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features.
这里简单的假设这个名字的性别和最后一个字母相关,那么就把最后一个字母作为每个test case的特征
def gender_features(word):
... return {last_letter: word[-1]}
gender_features(Shrek)
{last_letter: k}
所以就定义如上的特征抽取函数,并用它来生成我们的训练集和测试集
from nltk.corpus import names
import random
names = ([(name, male) for name in names.words(male.txt)] +
... [(name, female) for name in names.words(female.txt)])
random.shuffle(names) #原来的name是按字母排序的,为了达到比较好的训练效果,必须打乱顺序,随机化
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500] #把特征集一部分作为train集,一部分用来测试
classifier = nltk.NaiveBayesClassifier.train (train_set) #用训练集来训练bayes分类器
classifier.classify (gender_features(Trinity)) #训练完就可以用这个分类器来实际进行分类工作了
female
用测试集来测试
print nltk.classify.accuracy (classifier, test_set) #用测试集来测试这个分类器,nltk提供accuracy接口
0.758
现在只考虑了最后一个字母这个特征,准确率是75%,显然还有很大的提升空间。
classifier.show_most_informative_features (5) #这个接口有意思, 你可以显示出区分度最高的几个features
Most Informative Features
last_letter = a female : male = 38.3 : 1.0
last_letter = k male : female = 31.4 : 1.0
last_letter = f male : female = 15.3 : 1.0
last_letter = p male : female = 10.6 : 1.0
last_letter = w male : female = 10.6 : 1.0
nltk接口很贴心,还考虑到你内存太小,放不下所有的feature集合,提供这个接口来当用到时,实时的计算feature
from nltk.classify import apply_features
train_set = apply_features (gender_features, names[500:])
test_set = apply_features(gender_features, names[:500])
分类器分类效果好坏很大取决于训练集的特征选取,特征选取的比较合理,就会取得比较好的分类效果。
当然特征也不是选取的越多越好,
if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don’t generalize well to new examples. This problem is known as overfitting , and can be especially problematic when working with small training sets.
所以特征抽取这个在分类领域中是一个很重要的研究方向。
比如把上面那个例子的特征增加为,分别把最后两个字符,作为两个特征, 这样会发现分类器测试的准确性有所提高。
def gender_features(word):
... return {suffix1: word[-1:],
... suffix2: word[-2:]}
但是如果把特征增加为,首字母,尾字母,并统计每个字符的出现次数,反而会导致overfitting,测试准确性反而不如之前只考虑尾字母的情况
def gender_features2(name):
features = {}
features["firstletter"] = name[0].lower()
features["lastletter"] = name[–1].lower()
for letter in abcdefghijklmnopqrstuvwxyz:
features["count(%s)" % letter] = name.lower().count(letter)
features["has(%s)" % letter] = (letter in name.lower())
return features
gender_features2(John)
{count(j): 1, has(d): False, count(b): 0, ...}
featuresets = [(gender_features2(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
0.748
那么上面这个简单的方法已经讲明了用nltk,进行分类的过程,那么剩下的就是针对不同的分类任务,特征的选取上会有不同,还有分类器的也不止bayes一种,可以针对不同的任务来选取。
比如对于文本分类,可以选取是否包含特征词汇作为文本特征
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000] #找出出现频率较高的特征词,虽然这个找法不太合理
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features[contains(%s) % word] = (word in document_words)
return features
print document_features(movie_reviews.words(pos/cv957_8737.txt))
{contains(waste): False, contains(lot): False, ...}
对于pos tagging,我们也可以用分类的方法去解决
比如我们可以通过词的后缀来判断它的词性, 这边就以是否包含常见的词的后缀作为特征
def pos_features(word):
... features = {}
... for suffix in common_suffixes:
... features[endswith(%s) % suffix] = word.lower().endswith(suffix)
... return features
当然这个特征选取的比较简单,那么改进一下,根据后缀,并考虑context,即前一个词和词性,一起作为特征,这样考虑就比较全面了。后缀之所以要考虑3种情况,是因为一般表示词性的后缀,最多3个字符,s,er,ing
def pos_features(sentence, i, history):
features = {"suffix(1)": sentence[i][-1:],
"suffix(2)": sentence[i][-2:],
"suffix(3)": sentence[i][-3:]}
if i == 0:
features["prev-word"] = " START "
features["prev-tag"] = " START "
else:
features["prev-word"] = sentence[i-1]
features["prev-tag"] = history[i-1] #history里面存放了句子里面每个词的词性
return features
那么分类器,除了bayes外,nltk还有decision tree, Maximum Entropy classifier就不具体说了
还有对于大规模数据处理, pure python的分类器的效率相对是比较底下的,所以必须用高效的语言如c语言实现的分类器, NLTK也支持这样的分类器的package,可以参考NLTK的web page。
本文章摘自博客园,原文发布日期:2011-07-04
相关文章
- How to find certificates by thumbprint or name with powershell
- What is “with (nolock)” in SQL Server?
- Add docking and floating support easely and quickly with DockExtender
- Extracting Information from Text With NLTK
- Linux 有问必答:如何修复 Raspbian 上的 “Encountered a section with no Package:
- spring常见错误Error creating bean with name ‘xxx
- 【RN创建工程一直报错网络连接问题info There appears to be trouble with your network connection. Retrying...】
- 基于人脸识别的考勤系统 — Flask App — With GUI — with source code
- Manifest merger failed with multiple errors, see logs(各种解决方式的集合看这里)
- WinRM不起作用 Connecting to remote server failed with the following error message : WinRM cannot complete the operation
- <Agglomerative Fuzzy K-means Clustering Algorithm with Selection of Number of Clusters>凝聚聚类
- GSKit Versions Shipped with DB2
- 关于Docker报错问题解决:Docker fails to start containers with cgroup memory allocation error.
- Spring JTA multiple resource transactions in Tomcat with Atomikos example--转载
- 【Linux学习笔记】解决:error: command ‘gcc‘ failed with exit status 1
- 《R3Det:Refined Single-Stage Detector with Feature Refinement for Rotating Object》论文笔记
- 1023 Have Fun with Numbers
- Android Studio 3.4 Manifest merger failed with multiple errors