您现在的位置是：首页 > 其它

当前栏目

TFIDF练习

练习

2023-09-11 14:17:15 时间

直接上代码吧：

 1 """
 2     测试Demo
 3 """
 4 import lightgbm as lgb
 5 import numpy as np
 6 from sklearn.feature_extraction.text import TfidfVectorizer
 7 from sklearn.feature_extraction.text import CountVectorizer
 8 
 9 
10 def use_lgb():
11     # 训练数据，500个样本，10个维度
12     train_data = np.random.rand(500, 10)
13     # 构建二分类数据
14     label = np.random.randint(2, size=500)
15     # 放入到dataset中
16     train = lgb.Dataset(train_data, label=label)
17     print(train)
18 
19 
20 def use_tfidf():
21     sentence = ['没有 你 的 地方 都是 他乡', '没有 你 的 旅行 都是 流浪']
22     # 不去掉停用词
23     c = CountVectorizer(stop_words=None)
24 
25     # 拟合模型返回文本矩阵
26     count_word_tf = c.fit_transform(sentence)
27 
28     # print(count_word_tf.toarray())
29     # # 查看那些词，以字典的形式
30     # print(c.vocabulary_)
31     # # 得到特征
32     # print(c.get_feature_names())
33 
34 
35 ###############################
36     stopword = ['都是']
37     # 构建一个tfidf向量器,去除停用词
38     tfidf = TfidfVectorizer(stop_words=stopword)
39 
40     # 给出tfidf的权重,将tfidf矩阵抽取出来
41     weight = tfidf.fit_transform(sentence).toarray()
42     # 给出特征名称
43     word = tfidf.get_feature_names()
44 
45     print("有哪些词：")
46     print(word)
47 
48     print("\n词汇表以及他们的位置索引：")
49     for key, value in tfidf.vocabulary_.items():
50         print(key, value)
51 
52     print("\n词频矩阵：")
53     print(weight)
54     print(len(weight))
55 
56     # 打印每类文本中的tfidf权重，第一个for变量所有样本，第二个for遍历某一类文档下的所有权重
57     for i in range(len(weight)):
58         print("这里输出的是第{}文本的词语tfidf权重".format(i))
59         for j in range(len(word)):
60             # 经过tfidf后，找出每篇文档相关的词，这些词就是精心挑选出来的。然后根据这些词到文档中去找到tfidf值
61             print(word[j], weight[i][j])
62 
63 
64 if __name__ == '__main__':
65     use_tfidf()

输出：

 1 有哪些词：
 2 ['他乡', '地方', '旅行', '没有', '流浪']
 3 
 4 词汇表以及他们的位置索引：
 5 他乡 0
 6 旅行 2
 7 流浪 4
 8 地方 1
 9 没有 3
10 
11 词频矩阵：
12 [[0.6316672  0.6316672  0.         0.44943642 0.        ]
13  [0.         0.         0.6316672  0.44943642 0.6316672 ]]
14 2
15 这里输出的是第0文本的词语tfidf权重
16 他乡 0.6316672017376245
17 地方 0.6316672017376245
18 旅行 0.0
19 没有 0.4494364165239821
20 流浪 0.0
21 这里输出的是第1文本的词语tfidf权重
22 他乡 0.0
23 地方 0.0
24 旅行 0.6316672017376245
25 没有 0.4494364165239821
26 流浪 0.6316672017376245

本文参考：https://blog.csdn.net/the_lastest/article/details/79093407

猜你喜欢

kafkaStream执行过程中出现TimeoutException异常退出
大数据时代：数据即信用，信用即数据
域渗透LDAP收集AD用户相关信息
【光波电子学】MATLAB绘制平面介质中的波场-以TE波为例
第十四届蓝桥杯集训——switch——配套用法示例
2022年全国职业院校技能大赛（中职组）网络安全竞赛试题——MYSQL安全测试解析（详细）
决策树也可以做特征分析啦
给你讲清楚什么是XSS攻击
Android之kernel启动init流程
Electron使用指南 - [03] Main Process API
【MySQL】in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
浅谈压缩感知（六）：TVAL3
Leetcode688: 骑士在棋盘上的概率(medium)
Vue+TS/Typescript：Property does not exist on type ‘(() =＞ any) | ComputedOptions＜any＞‘
Atitit.rsa密钥生成器的attilax总结
举例讲解Python中的死锁、可重入锁和互斥锁
【Spring6】| Spring启示录、Spring概述
CSharpGL(31)[译]OpenGL渲染管道那些事
GitLab 内存使用优化
5-django rest framework，搭建api，这是最重要的章节
mysql数据备份mysqldump
IIS Express 配置 Json
整理k8s————k8s概念[一]
第二十七章 linux-输入子系统二
SpringBoot 项目日志中警告信息的处理

相关主题

Java小练习
python 小练习
文件类的练习
html练习（3）
js练习
c语言练习
python练习-8.12
专项练习26
专项练习24
3139 栈练习3
CSS练习

zl程序教程

当前栏目

TFIDF练习

相关文章