您现在的位置是：首页 > 其它

当前栏目

Classify Text With NLTK

with Text NLTK

2023-09-11 14:16:06 时间

Classification is the task of choosing the correct class label for a given input.

A classifier is called supervised if it is built based on training corpora containing the correct label for each input.

这里就以一个例子来说明怎样用nltk来实现分类器训练和分类

一个简单的分类任务，给定一个名字，判断其性别，就是在male，female两类进行分类

好，先来训练，训练就要有corpus，就是分好类的名字的例子

nltk提供了names的corpus

from nltk.corpus import names

names.words(male.txt) ＃男性的name的列表

names.words(female.txt) ＃女性的name的列表

有了训练corpus，下面就是特征提取

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features.

这里简单的假设这个名字的性别和最后一个字母相关，那么就把最后一个字母作为每个test case的特征

def gender_features(word):
... return {last_letter: word[-1]}
gender_features(Shrek)

{last_letter: k}

所以就定义如上的特征抽取函数，并用它来生成我们的训练集和测试集

from nltk.corpus import names
import random
names = ([(name, male) for name in names.words(male.txt)] +
... [(name, female) for name in names.words(female.txt)])
random.shuffle(names) ＃原来的name是按字母排序的，为了达到比较好的训练效果，必须打乱顺序，随机化

featuresets = [(gender_features(n), g) for (n,g) in names]

train_set, test_set = featuresets[500:], featuresets[:500] ＃把特征集一部分作为train集，一部分用来测试
classifier = nltk.NaiveBayesClassifier.train (train_set) ＃用训练集来训练bayes分类器

classifier.classify (gender_features(Trinity)) ＃训练完就可以用这个分类器来实际进行分类工作了
female

用测试集来测试

print nltk.classify.accuracy (classifier, test_set) ＃用测试集来测试这个分类器，nltk提供accuracy接口
0.758

现在只考虑了最后一个字母这个特征，准确率是75％，显然还有很大的提升空间。

classifier.show_most_informative_features (5) ＃这个接口有意思，你可以显示出区分度最高的几个features
Most Informative Features
last_letter = a     female : male = 38.3 : 1.0
last_letter = k     male : female = 31.4 : 1.0
last_letter = f      male : female = 15.3 : 1.0
last_letter = p     male : female = 10.6 : 1.0
last_letter = w    male : female = 10.6 : 1.0

nltk接口很贴心，还考虑到你内存太小，放不下所有的feature集合，提供这个接口来当用到时，实时的计算feature

from nltk.classify import apply_features
train_set = apply_features (gender_features, names[500:])
test_set = apply_features(gender_features, names[:500])

分类器分类效果好坏很大取决于训练集的特征选取，特征选取的比较合理，就会取得比较好的分类效果。

当然特征也不是选取的越多越好，

if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don’t generalize well to new examples. This problem is known as overfitting , and can be especially problematic when working with small training sets.

所以特征抽取这个在分类领域中是一个很重要的研究方向。

比如把上面那个例子的特征增加为，分别把最后两个字符，作为两个特征，这样会发现分类器测试的准确性有所提高。

def gender_features(word):
... return {suffix1: word[-1:],
... suffix2: word[-2:]}

但是如果把特征增加为，首字母，尾字母，并统计每个字符的出现次数，反而会导致overfitting，测试准确性反而不如之前只考虑尾字母的情况

def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[–1].lower()
    for letter in abcdefghijklmnopqrstuvwxyz:
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features
gender_features2(John)
{count(j): 1, has(d): False, count(b): 0, ...}

featuresets = [(gender_features2(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
0.748

那么上面这个简单的方法已经讲明了用nltk，进行分类的过程，那么剩下的就是针对不同的分类任务，特征的选取上会有不同，还有分类器的也不止bayes一种，可以针对不同的任务来选取。

比如对于文本分类，可以选取是否包含特征词汇作为文本特征

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000] ＃找出出现频率较高的特征词，虽然这个找法不太合理
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[contains(%s) % word] = (word in document_words)
    return features
print document_features(movie_reviews.words(pos/cv957_8737.txt))
{contains(waste): False, contains(lot): False, ...}

对于pos tagging，我们也可以用分类的方法去解决

比如我们可以通过词的后缀来判断它的词性，这边就以是否包含常见的词的后缀作为特征

def pos_features(word):
...     features = {}
...     for suffix in common_suffixes:
...         features[endswith(%s) % suffix] = word.lower().endswith(suffix)
...     return features

当然这个特征选取的比较简单，那么改进一下，根据后缀，并考虑context，即前一个词和词性，一起作为特征，这样考虑就比较全面了。后缀之所以要考虑3种情况，是因为一般表示词性的后缀，最多3个字符，s，er，ing

def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                       "suffix(2)": sentence[i][-2:],
               "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = " START "
        features["prev-tag"] = " START "
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1] ＃history里面存放了句子里面每个词的词性
    return features

那么分类器，除了bayes外，nltk还有decision tree， Maximum Entropy classifier就不具体说了

还有对于大规模数据处理， pure python的分类器的效率相对是比较底下的，所以必须用高效的语言如c语言实现的分类器， NLTK也支持这样的分类器的package，可以参考NLTK的web page。

本文章摘自博客园，原文发布日期：2011-07-04

猜你喜欢

[Docker] Storing Container Data in Google Cloud Storage
【Linux入门篇】服务器优化
spring boot用docker打包部署
自建函数取某个字符串固定位置的元素
maven 下载源码downloadsources
cocoa编程第4版 8.6 挑战2 解答
C++代理模式
设计模式---迭代器模式
【数学建模】河流大坝坍塌建立模型
Eolink神技之四、IDEA工具插件Eolink ApiKit
(动态规划)最长回文子序列、回文子序列个数
Python编程：StringIO和BytesIO内存中读写操作
及时重构代码，让开发更流畅
[Go] go-nsq 使用指南
如何在 GitHub 建立个人主页和项目演示页面
My Account应用里Account主数据搜索的FromDate是如何在后台生成的
mysql命令行中包含table的命令
PHP 表单和用户输入
MVC3教程之实体模型和EF CodeFirst

相关主题

python的with语句
python之with
with open
SQL中with的用法
WITH (NOLOCK)
with语句
python之with...as
sql 之 with as
python with用法
python with语句
WITH AS 用法
by,with

zl程序教程

当前栏目

Classify Text With NLTK

相关文章