Convert PDF into TXT

Good day community,

I’m trying to compile some code to convert PDF to text, but the result is not what I expected. I have tried different libraries such as pytesseract, pdfminer, pdftotext, pdf2image, and OpenCV, but all of them extract the text incompletely or with errors. The last two codes that I used are these:

import pytesseract
from pdf2image import convert_from_path

Configurar pytesseract

pytesseract.pytesseract.tesseract_cmd = “/usr/bin/tesseract”
pytesseract.pytesseract.tessdata_dir_config = ‘/usr/share/tesseract-ocr/4.00/tessdata’

Ruta del archivo PDF

pdf_path = “/content/drive/MyDrive/PDF/file.pdf” # Asegúrate de cambiar ‘tu_archivo.pdf’ por el nombre real de tu archivo

Convertir PDF a imágenes de alta calidad

images = convert_from_path(pdf_path, dpi=300, fmt=“PNG”, thread_count=4)

Extraer texto de las imágenes

texts = [pytesseract.image_to_string(img, lang=“eng”, config=“–oem 1 --psm 11”) for img in images]

Imprimir el texto extraído

for i, text in enumerate(texts):
print(f"Texto de la página {i + 1}:\n{text}\n")

from pdfminer.high_level import extract_text
def convert_pdf_to_txt(path):
text = extract_text(path)
return text

Cambia la ruta del archivo según la ubicación de tu archivo PDF

pdf_path = ‘/content/drive/MyDrive/PDF/file.pdf’

Convertir el PDF a texto

texto = convert_pdf_to_txt(pdf_path)

Imprimir el texto en la consola


However, when I use online PDF to text converters, the conversion comes out very well, almost perfect, without the errors that I encounter in both codes. Here I attach the PDF that I want to convert to text and the results that I get from both codes when I try to convert my file.