zl程序教程

您现在的位置是:首页 >  后端

当前栏目

Python easyOCR图像文本提取 初识

Python 图像 文本 提取 初识
2023-09-14 09:10:54 时间

    博物馆一日游,拍照片无数。分类整理,希望图片中的文字进行识别,加上各展馆、各展品的说明。
    手工一张张的整理,慢,累,要老命。。。。。。
    还好,模块化、低代码时代,效率、性能、界面、易用性暂不过多考虑,解决问题先,省点力气、省点时间。

  • OCR

    OCR(optical character recognition,光学字符识别),指电子设备(如扫描仪或数码相机)检查/获取自然界打印/显示的字符,然后用字符识别方法将形状字符翻译成计算机文字的过程。即对文本资料进行扫描/摄像后形成图像文件,然后通过OCR技术对图像文件进行分析处理,获取文字及版面信息的过程。     -- 百度百科
     
  • easyOCR

    EasyOCR 是一个用于从图像中提取文本的 python 模块,它是一种通用的 OCR,既可以读取自然场景文本,也可以读取文档中的密集文本。目前支持 80 多种语言和所有流行的书写脚本,包括:拉丁文、中文、阿拉伯文、梵文、西里尔文等。
     
  • 安装easyOCR模块库

    使用pip install命令,与easyocr相关的模块库一并安装,以下是安装后的模块库列表,包括OCR、深度学习(torch)、图像处理(pillow)、数值处理(numpy)等多个模块库。
>>> pip install easyocr
Requirement already satisfied: easyocr in c:\python39\lib\site-packages (1.6.2)
Requirement already satisfied: scikit-image in c:\python39\lib\site-packages (from easyocr) (0.19.3)
Requirement already satisfied: Pillow in c:\python39\lib\site-packages (from easyocr) (9.2.0)
Requirement already satisfied: PyYAML in c:\python39\lib\site-packages (from easyocr) (6.0)
Requirement already satisfied: torch in c:\python39\lib\site-packages (from easyocr) (1.13.0)
Requirement already satisfied: pyclipper in c:\python39\lib\site-packages (from easyocr) (1.3.0.post3)
Requirement already satisfied: python-bidi in c:\python39\lib\site-packages (from easyocr) (0.4.2)
Requirement already satisfied: Shapely in c:\python39\lib\site-packages (from easyocr) (1.8.5.post1)
Requirement already satisfied: numpy in c:\python39\lib\site-packages (from easyocr) (1.23.4)
Requirement already satisfied: scipy in c:\python39\lib\site-packages (from easyocr) (1.9.2)
Requirement already satisfied: opencv-python-headless<=4.5.4.60 in c:\python39\lib\site-packages (from easyocr) (4.5.4.60)
Requirement already satisfied: ninja in c:\python39\lib\site-packages (from easyocr) (1.10.2.4)
Requirement already satisfied: torchvision>=0.5 in c:\python39\lib\site-packages (from easyocr) (0.14.0)
Requirement already satisfied: typing-extensions in c:\python39\lib\site-packages (from torchvision>=0.5->easyocr) (4.4.0)
Requirement already satisfied: requests in c:\python39\lib\site-packages (from torchvision>=0.5->easyocr) (2.25.1)
Requirement already satisfied: six in c:\python39\lib\site-packages (from python-bidi->easyocr) (1.16.0)
Requirement already satisfied: networkx>=2.2 in c:\python39\lib\site-packages (from scikit-image->easyocr) (2.8.7)
Requirement already satisfied: PyWavelets>=1.1.1 in c:\python39\lib\site-packages (from scikit-image->easyocr) (1.4.1)
Requirement already satisfied: packaging>=20.0 in c:\python39\lib\site-packages (from scikit-image->easyocr) (21.3)
Requirement already satisfied: imageio>=2.4.1 in c:\python39\lib\site-packages (from scikit-image->easyocr) (2.22.1)
Requirement already satisfied: tifffile>=2019.7.26 in c:\python39\lib\site-packages (from scikit-image->easyocr) (2022.10.10)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\python39\lib\site-packages (from packaging>=20.0->scikit-image->easyocr) (2.4.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\python39\lib\site-packages (from requests->torchvision>=0.5->easyocr) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\python39\lib\site-packages (from requests->torchvision>=0.5->easyocr) (1.26.3)
Requirement already satisfied: idna<3,>=2.5 in c:\python39\lib\site-packages (from requests->torchvision>=0.5->easyocr) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\python39\lib\site-packages (from requests->torchvision>=0.5->easyocr) (4.0.0)
  • 图像文字识别



    * 对比原文 ,识别率还可,不用全部一张张、一个字一个字的手工抄写了。
    * 可以通过对图片的对比度、灰度、字体、显示角度(旋转)转化后提高文字识别率。-- 待实践
import easyocr

reader = easyocr.Reader(['ch_sim','en'], gpu=True)
result = reader.readtext('pic_file.jpg')
print(result)

>>>
CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU.

([[12, 0], [292, 0], [292, 24], [12, 24]], '博物馆一日游。拒照片无数。分类整理', 0.5019760698786572)
([[298, 0], [500, 0], [500, 24], [298, 24]], '希望图片牛的文字进行识别', 0.2667440711212794)
([[506, 0], [711, 0], [711, 24], [506, 24]], '加上各展馆。各展品的说明。', 0.48956195253399476)
([[12, 26], [280, 26], [280, 50], [12, 50]], '手工一张张的整理。慢。累。要老命。', 0.443645141397)
([[12, 52], [260, 52], [260, 76], [12, 76]], '还好。模块化。低代码时代。效率', 0.48323813949440303)
([[268, 52], [358, 52], [358, 76], [268, 76]], '性能。界面', 0.7953857046933088)
([[364, 52], [516, 52], [516, 76], [364, 76]], '易用性暂下过多考虑', 0.6913828229274245)
([[522, 52], [612, 52], [612, 76], [522, 76]], '解决问题先', 0.8767933218561421)
([[620, 52], [776, 52], [776, 76], [620, 76]], '省点力气。省点时间。', 0.563630720606001)
  • 说明

    * easyocr.Reader

    Reader(lang_list, gpu=True, model_storage_directory=None, user_network_directory=None, detect_network='craft', recog_network='standard', download_enabled=True, detector=True, recognizer=True, verbose=True, quantize=True, cudnn_benchmark=False)

    - lang_list: detection model language file list
    - gpu: 是否使用gpu进行运算,不使用则使用CPU进行运算 -- 似乎很耗资源,简单测试大批量图片时,个人机器直接重启
    - model_storage_directory: detection model language file list 存储位置。默认windows 10:C:\Users\Administrator\.EasyOCR\model
    - detect_network: Text Detection Model,需从 Jaided AI: EasyOCR model hub 下载
    - download_enabled: 如果缺少detection model,是否可以直接下载

    * reader.readtext()

    readtext(self, image, decoder='greedy', beamWidth=5, batch_size=1, workers=0, allowlist=None, blocklist=None, detail=1, rotation_info=None, paragraph=False, min_size=20, contrast_ths=0.1, adjust_contrast=0.5, filter_ths=0.003, text_threshold=0.7, low_text=0.4, link_threshold=0.4, canvas_size=2560, mag_ratio=1.0, slope_ths=0.1, ycenter_ths=0.5, height_ths=0.5, width_ths=0.5, y_ths=0.5, x_ths=1.0, add_margin=0.1, threshold=0.2, bbox_min_score=0.2, bbox_min_size=3, max_candidates=0, output_format='standard')

    - 参数说明,未研究,待后续
    - 返回识别结果列表:文本框坐标 -> 文本 -> 识别精度
  • 初识过程问题记录

    * pip help install,查看pip install使用参数及方法

    * 使用 pip install 安装模块时,响应慢时,可以尝试使用国内的服务进行下载
      - pip install -i https://pypi.tuna.tsinghua.edu.cn/simple easyocr

    * 访问时,如有HTTP/HTTPS的SSL安全限制时,可使用 --trusted-host 选项 
      - --trusted-host <hostname>   Mark this host or host:port pair as trusted, even though it does not have valid or any HTTPS.

    * CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU.
      Downloading detection model, please wait. This may take several minutes depending upon your network connection.
      - 下载 detection model,即识别模型;包括 easyocr.Reader中的lang_list、craft中的语言包
      - 下载地址:Jaided AI: EasyOCR model hub
      - 如果不清楚缺少哪些detection model,可以设置  download_enabled=False,通过提示信息确认缺少内容。如下提示,Missing ./model\craft_mlt_25k.pth
      - 安装:下载后为*.zip文件,如 craft_mlt_25k.zip,解压后将 craft_mlt_25k.pth 放入设置的 model_storage_directory 文件夹中即可 
    ..........
    raise FileNotFoundError("Missing %s and downloads disabled" % detector_path)
    FileNotFoundError: Missing ./model\craft_mlt_25k.pth and downloads disabled
    * 关于CUDA
      - CUDA,Compute Unified Device Architecture,显卡厂商NVIDIA推出的运算平台、并行运算架构,使GPU(graphics processing unit,图形处理器)能够解决复杂的计算问题。 
      - 下载地址,CUDA Toolkit Archive | NVIDIA Developer
      - 查看 CUDA 版本:CMD ->命令: nvidia-smi (注意安装版本不能高过显示的硬件版本)
      - 模看CUDA安装  :CMD ->命令: nvcc -V

      - 使用细节,未研究,待后续

    * AttributeError: partially initialized module 'cv2' has no attribute 'gapi_wip_gst_GStreamerPipeline' (most likely due to a circular import)
      - opencv-python-headless,版本不匹配
      - pip uninstall 卸载,然后使用 pip install 重新安装
        opencv-python-headless<=4.5.4.60 in c:\python39\lib\site-packages (from easyocr) (4.5.4.60)

    * WARNING: Ignoring invalid distribution -pencv-python-headless (python_install_path\lib\site-packages)
      - 安装 opencv-python-headless 时出错形成的临时文件,位置: python_install_path\lib\site-packages
      - 解决方法:python安装lib库文件夹下找到该文件,直接删除,重新安装即可

    * ERROR: Could not install packages due to an OSError: [WinError 5] 拒绝访问。: '%APPDATA%\Python\..........'
      Consider using the `--user` option or check the permissions.
      - 使用 --user参数,例,pip install --user *********************
      - 命令说明:--user  Install to the Python user install directory for your platform. Typically ~/.local/, or %APPDATA%\Python on Windows. (See the Python documentation for site.USER_BASE for full details.)

    * cv.gapi.wip.GStreamerPipeline = cv.gapi_wip_gst_GStreamerPipeline
       AttributeError: partially initialized module 'cv2' has no attribute 'gapi_wip_gst_GStreamerPipeline' (most likely due to a circular import)
      - opencv-python 与 opencv-python-headless  版本不一致
      - 解决方法:确认库模块版,uninstall后重新安装指定版本

  • 附代码提示:当前文件夹下、后缀为 jpg 的、图像文字识别,输出到 GetText.txt 文件

    import easyocr
    import glob
    import os,os.path
    from pathlib import Path
    
    reader = easyocr.Reader(['ch_sim','en'],gpu=True, model_storage_directory='./model',verbose=True,download_enabled=False)
    fn = 1
    
    ckfile = Path("./GetText.txt")
    if ckfile.exists():
        os.remove(ckfile)
    
    for f in glob.glob('./*.*'):
        result = ""
        if f.endswith('jpg'):
            result = reader.readtext(f)
            
            print("################ ", f.split('\\',1)[1], " ################")
            temp = ""        
            for i in result:
                temp = temp + i[1]
                print(i)
                
            with open("./GetText.txt","a",encoding='utf-8') as fp:
                fp.write("################ " + f.split('\\',1)[1] + " ################\n")
                fp.write(temp)
                fp.write("\n\n\n")
            fn = fn + 1 

参考:

  1. 使用EasyOCR库进行OCR文字识别介绍与实践 - 腾讯云开发者社区-腾讯云
  2. Jaided AI: EasyOCR model hub
  3. Jaided AI: EasyOCR tutorial
  4.  PyTorch