Usage
Convert PDF to image
- form_analyzer.pdf_to_image(folder: str, dpi: int = 400, poppler_path: ~typing.Optional[str] = None, image_processor: ~typing.Callable[[int, ~PIL.Image.Image], ~typing.List[~form_analyzer.conversion.ProcessedImage]] = <function <lambda>>)
Converts PDF files in a folder to PNG images.
PDF files are converted page by page using pdf2image. Each generated page can optionally be passed to a function that can further process the image (e.g. split it or crop it). Additionally, the extension of the resulting file name can be passed. This can be used to reorder pages in a PDF.
- Parameters
folder – The folder containing the PDF files.
dpi – DPI to use for image generation. The higher, the bigger the image. 400 is the default.
poppler_path – Path to a poppler installation, required for Windows.
image_processor – A function that takes an image index and an image and returns a list of ProcessedImage.
Run AWS Textract
- form_analyzer.run_textract(folder: str, aws_region_name: Optional[str] = None, aws_access_key_id: Optional[str] = None, aws_secret_access_key: Optional[str] = None, s3_bucket_name: Optional[str] = None, s3_folder: str = '')
Run AWS Textract on all PNG files in a folder.
The function can either upload all files to an S3 bucket and process them from there or upload them directly to Textract. The analysis results are saved as JSON files. If a result JSON already exists for a PNG file, it will not be analyzed again.
- Parameters
folder – PNG folder name
aws_region_name – Optional AWS region name
aws_access_key_id – Optional AWS access key ID
aws_secret_access_key – Optional AWS secret access key
s3_bucket_name – Optional S3 bucket name, if given, the function will upload the files to S3
s3_folder – S3 bucket folder name, defaults to ‘’
Analyze form
- form_analyzer.analyze(form_folder: str, form_description_module_name: str, excel_file_name: str = 'results')
Analyzes the AWS Textract results in a folder based on a given form description and writes the results to an Excel file.
- Parameters
form_folder – Folder with the AWS Textract result files
form_description_module_name – Name of the form description Python module
excel_file_name – Name of the result Excel file, default is ‘results’