您的位置:首页 > 健康 > 美食 > 网页制作创建站点_怎么下载四川人社app_培训班有哪些_南京关键词seo公司

网页制作创建站点_怎么下载四川人社app_培训班有哪些_南京关键词seo公司

2025/1/8 0:50:26 来源:https://blog.csdn.net/fareast_mzh/article/details/144886287  浏览:    关键词:网页制作创建站点_怎么下载四川人社app_培训班有哪些_南京关键词seo公司
网页制作创建站点_怎么下载四川人社app_培训班有哪些_南京关键词seo公司

Tesseract Installation

  • ocr2.py
#!/usr/bin/python3
# 
# Python OCR PDF Extraction
# https://github.com/tesseract-ocr/tesseract
#
# sudo apt install tesseract-ocr
# sudo apt install libtesseract-dev
# pip install pytesseract PyPDF2 pdfplumber opencv-python pillow
# pip install pdf2image
# sudo apt-get install poppler-utils
# sudo apt-get install tesseract-ocr-chi-sim  # Simplified Chinese
# sudo apt-get install tesseract-ocr-chi-tra  # Traditional Chinese
# tesseract --list-langsimport pytesseract
from pdf2image import convert_from_path
from PyPDF2 import PdfReader
import cv2
import numpy as np
from PIL import Image# Path to Tesseract executable (update to match your system)
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'def preprocess_image(pil_image):"""Preprocesses an image for OCR using OpenCV.Converts to grayscale, applies thresholding."""# Convert PIL image to OpenCV formatopen_cv_image = np.array(pil_image)# Convert RGB to BGR (OpenCV default format)open_cv_image = cv2.cvtColor(open_cv_image, cv2.COLOR_RGB2BGR)# Convert to grayscalegray_image = cv2.cvtColor(open_cv_image, cv2.COLOR_BGR2GRAY)# Apply binary thresholding_, thresh_image = cv2.threshold(gray_image, 128, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)return thresh_imagedef extract_text_from_pdf(pdf_path):# First try extracting text from the PDF directlyreader = PdfReader(pdf_path)text = ""for page in reader.pages:text += page.extract_text() or ""# If no text is extracted, assume it's a scanned PDF and use OCRif not text.strip():images = convert_from_path(pdf_path)for image in images:# Preprocess image for better OCR resultspreprocessed_image = preprocess_image(image)# Convert OpenCV image back to PIL format for Tesseractpil_image = Image.fromarray(preprocessed_image)# Perform OCRtext += pytesseract.image_to_string(pil_image, lang='chi_sim')return text# Example usage
pdf_path = "scan_2025-01-02_09.31.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com