您现在的位置是：首页 > IT要闻

当前栏目

【一起从0开始学习人工智能】0x01机器学习基础+初次实践

2023-02-19 12:19:54 时间

从零开始----到敲出一个推荐系统

文章目录

人工智能概述

人工智能------机器学习-------深度学习应用：网络安全、交通网络、社交网络…

比如：小案例-------你画我猜人工智能之父---------McCarthy、 Minsky 达特茅斯会议-----------人工智能的起点

流派

符号主义--------推机器学习----统计的方法实现人工智能----------神经网络深度学习------图像学习

能做什么

传统预测图像识别----------无人驾驶、人脸识别自然语言处理------感情分析、自动聊天、文本检测、智能客服

什么是机器学习

从数据中自动分析获得模型，并利用模型对位置数据来对位置数据进行预测

机器学习： 数据----------模型-------预测

人类：问题-----------规律---------未来本质：从中总结规律比如：识别动物、房屋价格预测

数据集构成

结构：特征值+目标值有些数据集可以没有目标值-------------------进行分类------------物以类聚

机器学习算法分类

目标值：类别---------分类问题------K临近算法、贝叶斯分类、决策树与随机森林、逻辑回归目标值：连续性数据----------回归问题---------------------前两种监督学习--------线性回归、岭回归目标值：无------Kmeans-------------无监督学习—没有目标值

人脸识别：分类问题

机器学习开发流程

获取数据数据处理特征工程机器学习算法训练—模型模型评估

学习框架和资料

算法是核心、数据和计算是基础找准定位：

大部分复杂模型的算法设计都是算法工程师在做，工程师一般：分析很多数据分析具体的业务应用常见的算法特征工程、调参属、优化

当前重要的是掌握一些机器学习算法等技巧、从某个业务领域切入问题

怎么做

入门实战类书籍 机器学习–周志华------统计学习方法–李航---------深度学习–花书

机器学习库与框架

pythorch caffe2 chainer

可用数据集

公司内部— 百度、数据结构 ---- 花钱数据集----政府内部的

学习阶段常用：sklearn kaggle UCI

sklearn

文档完善包含：分类回归聚类、降维、模型选择、特征工程 python语言的机器学习工具

pip install Scikit-learn
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting Scikit-learn

使用自带数据集

sklearn.datasets
1. load_*----------获取小规模数据集
2. fetch_*----------获取大规模数据集---从网上下载-------函数的第一个参数是data_home, 默认~/scikit-learn-data/

sklearn.datasets.load_iris()应为花数据集
sklearn.datasets.load_Boston()波士顿房价数据集

sklearn.datasets.fetch_20newsgroups(data_home=None,subset='all')
subset有 train test all--全部

sklearn数据集使用

数据集返回值：数据类型datasets.base.Bunch-------字典格式

获取数据集返回的类型 load和fetch返回的数据类型datasets.base.Bunch(字典格式) data：特征数据数组，是 [n_samples * n_features] 的二维 numpy.ndarray 数组 target：标签数组，是 n_samples 的一维 numpy.ndarray 数组 DESCR：数据描述 feature_names：特征名，新闻数据，手写数字、回归数据集没有 target_names：标签名，回归数据集没有

可以用字典的特性--------dict"key"]=value 继承自字典----自己的特性----------bunch.key = value

from sklearn.datasets import load_iris

def datasets_demo():
    #获取数据集
    iris = load_iris();
    print("鸢尾花数据集：\n", iris);
    # 获取鸢尾花数据集
    iris = load_iris()
    print(type(iris))
    # print('鸢尾花数据集的返回值：\n', iris)
    # print('鸢尾花数据集的特征值：\n', iris['data'])
    print(iris['data'].shape)
    print('鸢尾花数据集的目标值/标签值：\n', iris.target)
    print('鸢尾花数据集特征的名字：\n', iris.feature_names)
    print('鸢尾花数据集目标值的名字：\n', iris.target_names)
    print('鸢尾花数据集的描述：\n', iris.DESCR)

    return None
if __name__ == "__main__":
    datasets_demo()

鸢尾花数据集：
 {'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

鸢尾花数据集的描述：
 .. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

拿到的数据是否全部用来训练模型

留一部分-----来检验验证部分训练部分测试

划分比例7：3

数据集划分

sklearn.model_selection.train_test_split(arrays, *options)

sklearn.model_selection.train_test_split(arrays, *options) x 数据集的特征值 y数据集的标签值 test_size 测试集的大小，一般为float random_state随机数种子,不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。 return 测试集特征训练集特征值值，训练标签，测试标签(默认随机取)

注意：当要做对比实验的时候，要将random_state设置为一个固定的值，这样才能到达控制变量的效果

x-train x_test, y_train,y_test

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

def datasets_demo():
    #获取数据集
    iris = load_iris();
    print('鸢尾花数据集的描述：\n', iris.data, iris.data.shape)
    x_train, x_test, y_train, y_test=train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
    print("xunlianjitezhengzhi:\n:", x_train, x_train.shape)

    return None

if __name__ == "__main__":
    datasets_demo()

 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [5.9 3.2 4.8 1.8]
 [5.4 3.9 1.7 0.4]
 [6.  2.2 4.  1. ]
 [6.4 2.8 5.6 2.1]
 [4.8 3.4 1.9 0.2]
 [6.4 3.1 5.5 1.8]
 [5.9 3.  4.2 1.5]
 [6.5 3.  5.5 1.8]
 [6.  2.9 4.5 1.5]
 [5.5 2.4 3.8 1.1]
 [6.2 2.9 4.3 1.3]
 [5.2 4.1 1.5 0.1]
 [5.2 3.4 1.4 0.2]
 [7.7 2.6 6.9 2.3]
 [5.7 2.6 3.5 1. ]
 [4.6 3.4 1.4 0.3]
 [5.8 2.7 4.1 1. ]
 [5.8 2.7 3.9 1.2]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]
 [4.6 3.1 1.5 0.2]
 [5.8 2.8 5.1 2.4]
 [5.1 3.5 1.4 0.3]
 [6.8 3.2 5.9 2.3]
 [4.9 3.1 1.5 0.1]
 [5.5 2.3 4.  1.3]
 [5.1 3.7 1.5 0.4]
 [5.8 2.7 5.1 1.9]
 [6.7 3.1 4.4 1.4]
 [6.8 3.  5.5 2.1]
 [5.2 2.7 3.9 1.4]
 [6.7 3.1 5.6 2.4]
 [5.3 3.7 1.5 0.2]
 [5.  2.  3.5 1. ]
 [6.6 2.9 4.6 1.3]
 [6.  2.7 5.1 1.6]
 [6.3 2.3 4.4 1.3]
 [7.7 3.  6.1 2.3]
 [4.9 3.  1.4 0.2]
 [4.6 3.2 1.4 0.2]
 [6.3 2.7 4.9 1.8]
 [6.6 3.  4.4 1.4]
 [6.9 3.1 4.9 1.5]
 [4.3 3.  1.1 0.1]
 [5.6 2.7 4.2 1.3]
 [4.8 3.4 1.6 0.2]
 [7.6 3.  6.6 2.1]
 [7.7 2.8 6.7 2. ]
 [4.9 2.5 4.5 1.7]
 [6.5 3.2 5.1 2. ]
 [5.1 3.3 1.7 0.5]
 [6.3 2.9 5.6 1.8]
 [6.1 2.6 5.6 1.4]
 [5.  3.4 1.5 0.2]
 [6.1 3.  4.6 1.4]
 [5.6 3.  4.5 1.5]
 [5.1 3.8 1.5 0.3]
 [5.6 2.8 4.9 2. ]
 [4.4 3.  1.3 0.2]
 [5.5 2.4 3.7 1. ]
 [4.7 3.2 1.6 0.2]
 [6.7 3.3 5.7 2.5]
 [5.2 3.5 1.5 0.2]
 [6.4 2.7 5.3 1.9]
 [6.3 2.8 5.1 1.5]
 [4.4 2.9 1.4 0.2]
 [6.1 3.  4.9 1.8]
 [4.9 3.1 1.5 0.2]
 [5.  2.3 3.3 1. ]
 [4.8 3.  1.4 0.3]
 [5.8 4.  1.2 0.2]
 [6.3 3.4 5.6 2.4]
 [5.4 3.  4.5 1.5]
 [7.1 3.  5.9 2.1]
 [6.3 3.3 6.  2.5]
 [5.1 3.8 1.9 0.4]
 [6.4 2.8 5.6 2.2]
 [7.7 3.8 6.7 2.2]] (120, 4)

Process finished with exit code 0

?I could be bounded in a nutshell and count myself a king of infinite space.

特别鸣谢：木芯工作室、Ivan from Russia

猜你喜欢

最长无重复子串
写技术博客的一些心得分享
Java 多线程（七）：线程池
Java 多线程（五）：锁（三）
Java 多线程（四）：锁（二）
Java 多线程（三）：锁（一）
Java 多线程（二）：并发编程的三大特性
线性时间非比较类排序
Java 多线程（一）：基础
合并k个已排序的链表
HDFS 高可用分布式环境搭建
合并两个有序数组
连续子数组的最大和
HDFS 分布式环境搭建
容器盛水问题
大数加法
HDFS 伪分布式环境搭建
设计LRU缓存结构
两数之和
使用单调栈来解决的一些问题

zl程序教程