Usage

Convert PDF to image

form_analyzer.pdf_to_image(folder: str, dpi: int = 400, poppler_path: ~typing.Optional[str] = None, image_processor: ~typing.Callable[[int, ~PIL.Image.Image], ~typing.List[~form_analyzer.conversion.ProcessedImage]] = <function <lambda>>)

Converts PDF files in a folder to PNG images.

PDF files are converted page by page using pdf2image. Each generated page can optionally be passed to a function that can further process the image (e.g. split it or crop it). Additionally, the extension of the resulting file name can be passed. This can be used to reorder pages in a PDF.

Parameters
  • folder – The folder containing the PDF files.

  • dpi – DPI to use for image generation. The higher, the bigger the image. 400 is the default.

  • poppler_path – Path to a poppler installation, required for Windows.

  • image_processor – A function that takes an image index and an image and returns a list of ProcessedImage.

Run AWS Textract

form_analyzer.run_textract(folder: str, aws_region_name: Optional[str] = None, aws_access_key_id: Optional[str] = None, aws_secret_access_key: Optional[str] = None, s3_bucket_name: Optional[str] = None, s3_folder: str = '')

Run AWS Textract on all PNG files in a folder.

The function can either upload all files to an S3 bucket and process them from there or upload them directly to Textract. The analysis results are saved as JSON files. If a result JSON already exists for a PNG file, it will not be analyzed again.

Parameters
  • folder – PNG folder name

  • aws_region_name – Optional AWS region name

  • aws_access_key_id – Optional AWS access key ID

  • aws_secret_access_key – Optional AWS secret access key

  • s3_bucket_name – Optional S3 bucket name, if given, the function will upload the files to S3

  • s3_folder – S3 bucket folder name, defaults to ‘’

Analyze form

form_analyzer.analyze(form_folder: str, form_description_module_name: str, excel_file_name: str = 'results')

Analyzes the AWS Textract results in a folder based on a given form description and writes the results to an Excel file.

Parameters
  • form_folder – Folder with the AWS Textract result files

  • form_description_module_name – Name of the form description Python module

  • excel_file_name – Name of the result Excel file, default is ‘results’

Form description types

form_analyzer.FormFields

alias of List[FormField]

class form_analyzer.FormField(title: str, selector: form_analyzer.selectors.base.Selector)