How It Works¶

IQDM-PDF uses pdfminer.six to extract text and coordinates from IMRT QA PDF files.

Step 1: Match Report Parser¶

Each report parser has an identifiers property which contain words and phrases used to uniquely pair a PDF to a report parser. If all of the identifiers are found in the PDF text, that report parser will be selected.

Step 2: Parse Data by Text Box Coordinates¶

The text data is collected with the selected report parser, which is stored by page and bounding box coordinates. Report parsers can look up a text value by page and coordinate.

Step 3: Apply Template¶

Unless customized logic is needed, a GenericParser class can be used, which reads in a JSON file containing three keys: report type, identifiers, and data. Required keys of data are column, page, and pos. For further customization, see the get_block_data function documentation in CustomPDFReader. All keys from data (except column) are passed.

Check out the report templates on GitHub for examples.

In the simplest case, a report parser class looks something like the following [source]:

class SNCPatientReport2020(GenericReport):
    """SNCPatientReport parser for the new format released in 2020"""

    def __init__(self):
        """Initialization of a SNCPatientReport class"""
        template = join(DIRECTORIES["REPORT_TEMPLATES"], "sncpatient2020.json")
        GenericReport.__init__(self, template)

Then update the REPORT_CLASSES list in parser.py to include the new report parser class.

Step 4: Iterate¶

From the command-line, you can iterate over all files in a provided directory, and save the results into a CSV file per vendor/template:

$ iqdmpdf your/initial/dir

Or from a python console:

>>> from IQDMPDF.file_processor import process_files
>>> process_files("your/initial/dir")

Non-Template Based Parsing¶

If the data in the reports have varying coordinates, the code needs more customization. See the Delta4 report parser for examples/inspiration.

Building a New Template¶

Currently, building a new JSON template requires some python scripting to determine coordinates. The output from the following code will show all text bounding box coordinates and contents.

>>> from IQDMPDF.pdf_reader import CustomPDFReader
>>> data = CustomPDFReader("path/to/report.pdf")
>>> print(data)

Below is a sample of the output from: example_reports/sncpatient/UChicago/DCAM_example_1.pdf

page_index: 0, data_index: 21
bbox: [6.24, 445.18, 140.33, 463.88]
Absolute Dose Comparison
Difference (%)

page_index: 0, data_index: 22
bbox: [79.2, 445.18, 88.84, 452.14]
 : 2

page_index: 0, data_index: 23
bbox: [6.24, 432.94, 51.47, 439.9]
Distance (mm)

page_index: 0, data_index: 24
bbox: [79.2, 432.94, 88.84, 439.9]
 : 2

page_index: 0, data_index: 25
bbox: [6.24, 420.7, 49.8, 427.66]
Threshold (%)

page_index: 0, data_index: 26
bbox: [79.2, 420.7, 98.37, 427.66]
 : 10.0

The data object in the resulting JSON file for this data would look like:

[
    {"column": "Difference (%)", "page": 0, "pos": [79.2, 441.02]},
    {"column": "Distance (mm)", "page": 0, "pos": [79.2, 432.94]},
    {"column": "Threshold (%)", "page": 0, "pos": [79.2, 420.7]}
]

Note that the value for column doesn’t need to match any text in the PDF.

The pos element is assumed to be the bottom left corner of the bounding box by default. If the PDF layout has centered or right-aligned elements, you can specify mode to be any combination of bottom/center/top and left/center/right. (e.g., top-right or center-left; center is equivalent to center-center).

For example, if an element is more consistently found at the center of a bounding box, the data element could look like:

{
  "column": "Difference (%)",
  "page": 0,
  "pos": [88.79, 424.18],
  "mode": "center"
}