您现在的位置是：首页 > 后端

当前栏目

Python3-语言探测工具langdetect和langid

Python3 工具语言探测

2023-09-27 14:29:08 时间

一、写在前面
主要介绍两款语言探测工具langdetect和langid，用于区分文本到底是什么语言，除了这两款之后，看到网上有的说使用NGram来解决这个问题也比较好。

二、运行环境
python3.6（anaconda）

三、langdetect
网址：https://code.google.com/archive/p/language-detection

1、安装
直接在DOS窗口下使用pip安装（如不可以，搜下如何使用pip安装）：

pip install langdetect

2、使用
程序使用比较简单，直接调用即可，代码如下：

from langdetect import detect
from langdetect import detect_langs

s1 = "本篇博客主要介绍两款语言探测工具，用于区分文本到底是什么语言，"
s2 = 'We are pleased to introduce today a new technology – Record Matching –that automatically finds relevant historical records for every family tree on MyHerit'
s3 = "Javigator：Java代码导读及分析管理工具的设计"

print(detect(s1))
print(detect(s2))
print(detect(s3))     # detect()输出探测出的语言类型
print(detect_langs(s3))    # detect_langs()输出探测出的所有语言类型及其所占的比例

输出结果如下：
注：语言类型主要参考的是ISO 639-1语言编码标准，详见ISO 639-1百度百科

zh-cn    # 中文
en      # 英文
et     # 爱沙尼亚语
[et:0.7139002269697295, lt:0.1432406269337342, no:0.142858586700596]  # 这里是所探测的句子中包含的比例及其所占的比例。

3、总结
从上面简单的示例可以看出，s3其实是一篇中文论文的题目，但是探测错误，所以个人觉得langdetect准确率不是很高。

四、langid
网址：https://github.com/saffsd/langid.py

1、安装
直接在DOS窗口下使用pip安装（如不可以，搜下如何使用pip安装）：

pip install langid

2、使用
程序使用比较简单，直接调用即可，代码如下：

import langid

s1 = "本篇博客主要介绍两款语言探测工具，用于区分文本到底是什么语言，"
s2 = 'We are pleased to introduce today a new technology – Record Matching –that automatically finds relevant historical records for every family tree on MyHerit'
s3 = "Javigator：Java代码导读及分析管理工具的设计"

print(langid.classify(s1))
print(langid.classify(s2))
print(langid.classify(s3))   # langid.classify(s3)输出探测出的语言类型及其confidence score，其confidence score计算方式方法见：https://jblevins.org/log/log-sum-exp

输出结果如下：
注：语言类型主要参考的是ISO 639-1语言编码标准，详见ISO 639-1百度百科

('zh', -461.7451400756836)   # 中文
('en', -264.470148563385)   # 英文
('zh', -206.2425878047943)   # 中文

3、总结
个人感觉比langdetect正确率高一些，但是运行效率较低。

猜你喜欢

一文彻底理解Redis序列化协议，你也可以编写Redis客户端（上）
实时即未来，大数据项目车联网之Flink Watermark(水位线)【十四】
Unity Hub 自定义一个创建新项目模板（Template）
zabbix 自定义监控
vnc 远程连接服务器（引）
Unity 刷新Project
Java Eclipse常用快捷键和设置
系统管理模块_部门管理_设计(映射)本模块中的所有实体并总结设计实体的技巧_懒加载异常问题_树状结构
devm_xxx机制【转】
嵌入式linux串口编程（二）
入门人工智能
react列表key满足这些条件可以直接使用数组索引
浅谈TCP IP协议栈(三)路由器简介
JAVA不可变类(immutable)机制与String的不可变性
java批量生成excel代码分享
Git使用教程
如何避免IPv6“友邻发现”威胁？
Speeding up Migration on ApsaraDB for Redis
系统OOM复位定位
如何转载别人的文章
BZOJ2049[SDOI2008] 洞穴勘测
人眼的呈像是倒立的实像
Oracle 12c Windows安装、介绍及简单使用(图文)

相关主题

python3调用R
python3 安装
Python3 教程
Python3 简介
Python3运算符
python3的运算符
python3爬虫
Python3位运算符
python3的函数
python3安装pip
Python3-模块
Python3简介
python2 与 python3的区别
Python3基础
Python3字符串
linux 安装python3
python3 多进程
python3实例

zl程序教程

当前栏目

Python3-语言探测工具langdetect和langid

相关文章