Convert PDF into TXT

olivalej · April 26, 2023, 4:29pm

Good day community,

I’m trying to compile some code to convert PDF to text, but the result is not what I expected. I have tried different libraries such as pytesseract, pdfminer, pdftotext, pdf2image, and OpenCV, but all of them extract the text incompletely or with errors. The last two codes that I used are these:

CODIGO 1
import pytesseract
from pdf2image import convert_from_path

Configurar pytesseract

pytesseract.pytesseract.tesseract_cmd = “/usr/bin/tesseract”
pytesseract.pytesseract.tessdata_dir_config = ‘/usr/share/tesseract-ocr/4.00/tessdata’

Ruta del archivo PDF

pdf_path = “/content/drive/MyDrive/PDF/file.pdf” # Asegúrate de cambiar ‘tu_archivo.pdf’ por el nombre real de tu archivo

Convertir PDF a imágenes de alta calidad

images = convert_from_path(pdf_path, dpi=300, fmt=“PNG”, thread_count=4)

Extraer texto de las imágenes

texts = [pytesseract.image_to_string(img, lang=“eng”, config=“–oem 1 --psm 11”) for img in images]

Imprimir el texto extraído

for i, text in enumerate(texts):
print(f"Texto de la página {i + 1}:\n{text}\n")

CODIGO 2
from pdfminer.high_level import extract_text
def convert_pdf_to_txt(path):
text = extract_text(path)
return text

Cambia la ruta del archivo según la ubicación de tu archivo PDF

pdf_path = ‘/content/drive/MyDrive/PDF/file.pdf’

Convertir el PDF a texto

texto = convert_pdf_to_txt(pdf_path)

Imprimir el texto en la consola

print(texto)

However, when I use online PDF to text converters, the conversion comes out very well, almost perfect, without the errors that I encounter in both codes. Here I attach the PDF that I want to convert to text and the results that I get from both codes when I try to convert my file.

Topic		Replies	Views
Convert base64 to pdf in reactJS	2	6197	October 31, 2023
Popular python packages working with pdf Python	0	234	January 31, 2022
What in the world is going on? Python	0	240	September 27, 2023
Python Section 11-9 Working with PDFs: import PyPDF2 Python	2	755	January 31, 2022
PyPDF2 DeprecationError Python	1	1051	September 3, 2023