Mining of Massive Datasets – Data Mining
1 What is Data Mining?
The most commonly accepted definition of “data mining” is the discovery of “models” for data.
1.1 Statistical Modeling
Statisticians were the first to use the term “data mining.”
Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution(ex. Gaussian distribution) from which the visible data is drawn.
1.2 Machine Learning
There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriately uses algorithms from machine learning.
Machine-learning practitioners use the data as a training set, to train an algorithm of one of the many types used by machine-learning practitioners, such as Bayes nets, support-vector machines, decision trees, hidden Markov models, and many others.
The typical case where machine learning is a good approach is when we have little idea of what we are looking for in the data.
For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it.
机器学习适用于这种搞不清什么样的规则能被mining, 所以只需要把数据feed给ML算法, 它就可以替你做出判断, 而你不用关心这个具体的过程.
1.3 Computational Approaches to Modeling
前面谈了两种模型, 用什么方法来discovery模型?
There are many different approaches to modeling data, 这里介绍两种,
1. Summarizing the data succinctly and approximately
2. Extracting the most prominent features of the data and ignoring the rest
下面就来具体谈谈这两种方法.
1.4 Summarization
One of the most interesting forms of summarization is the PageRank idea, which made Google successful. The entire complex structure of the Web is summarized by a single number for each page.
Another important form of summary – Clustering. 书中举了个'Plotting cholera cases on a map of London’的例子, 通过简单的手工的plotting, 对点经行聚类建模, 挖掘出了靠近路口更容易得病的规则.
1.5 Feature Extraction
A complex relationship between objects is represented by finding the strongest statistical dependencies among these objects and using only those in representing all statistical connections.
2 Statistical Limits on Data Mining
A common sort of data-mining problem involves discovering unusual events hidden within massive amounts of data.
但是数据挖掘技术也不是总是有效的, 下面介绍Bonferroni’s Principle来避免滥用这种技术.
2.1 Total Information Awareness
In 2002, the Bush administration put forward a plan to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity.
当然bush这个计划由于隐私问题最终被议会否决了, 但是这儿只是作为个例子来讨论数据挖掘技术是否有效.
2.2 Bonferroni’s Principle
Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus.
In a situation like searching for terrorists, where we expect that there are few terrorists operating at any one time.
如果我们通过数据挖掘技术, 每天都挖掘出大量的, 上百万起的恐怖事件, 那么这样的技术就是无效的, 哪怕其中确实有若干恐怖事件...
3 Things Useful to Know
如果你学习数据挖掘, 下面这些基本概念很重要,
1. The TF.IDF measure of word importance.
2. Hash functions and their use.
3. Secondary storage (disk) and its effect on running time of algorithms.
4. The base e of natural logarithms and identities involving that constant.
5. Power laws.
3.1 Importance of Words in Documents
In several applications of data mining, we shall be faced with the problem of categorizing documents (sequences of words) by their topic. Typically, topics are identified by finding the special words that characterize documents about that topic.
这是个相当典型的数据挖掘问题...topic keywords extraction
而最基本的技术就是, TF.IDF (Term Frequency times Inverse Document Frequency).
词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化,以防止它偏向长的文件
逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到
3.5 The Base of Natural Logarithms (自然对数)
The constant e = 2.7182818 · · · has a number of useful special properties. In particular, e is the limit of (1 + 1/x)xas x goes to infinity.
e的用处很大, 我不学数学不明白, 这儿通过e可以用近似来简化计算,
1. Consider (1+a)b, where a is small, We can thus approximate (1 + a)b as eab.
2. ex = 1 + x + x2/2 + x3/6 + x4/24 + · · ·
3.6 Power Laws (幂法则)
There are many phenomena that relate two variables by a power law, that is, a linear relationship between the logarithms of the variables.
这章就是综述性质的,
谈了什么是数据挖掘, 常用的方法思路是什么, 挖掘技术的局限是什么, 常用的基本概念.
本文章摘自博客园,原文发布日期:2011-07-06
相关文章
- 软件测试|简单易学的性能监控体系prometheus+grafana搭建教程
- 超火的 ChatGPT,APISpace 让你一分钟免费接入
- SAP QM 不常用事务代码QVM3 - Inspection Lots Without Usage Decision
- 从源码编译安装ZABBIX
- R包安装、加载、报错
- 时间基准,NTP网络授时服务器助力智慧农业系统
- R语言里的符号
- 一文教你轻松快速使用 ChatGPT,亲测有效~
- Ubuntu 安装 TightVNCServer 时灰屏
- 生信星球学习小组-Day3学习笔记--conda
- 机器学习模型集成管理介绍
- 学习小组Day3笔记--刘
- 详解源文件编译链接至可执行程序的每一步
- RT-qPCR从384板结果到mRNA相对表达量箱线图
- 【视频】风险价值VaR原理与Python蒙特卡罗Monte Carlo模拟计算投资组合实例|附代码数据
- R语言量化技术分析的百度指数关注度交易策略可视化
- R语言k-Shape时间序列聚类方法对股票价格时间序列聚类|附代码数据
- 带你体验下来自人工智能ChatGPT的魅力
- Golang实现线程安全的懒汉式单例模式
- Portraiture4最新li磨皮滤镜插件