--- module: Invoice Parsing date: 2026-02-07 problem_type: integration_failure component: pdf_template_parser symptoms: - "Bonanza Produce multi-invoice statement (13595522.pdf) fails to parse correctly" - "Single invoice template extracts only one invoice instead of four" - "Multi-invoice statement lacks I/L markers present in single invoices" - "Customer identifier extraction pattern requires different regex for statements" root_cause: template_inadequate resolution_type: template_fix severity: high tags: [pdf, parsing, invoice, bonanza-produce, multi-invoice, integration] --- # Bonanza Produce Multi-Invoice Statement Template Fix ## Problem Bonanza Produce sends two different invoice formats: 1. **Single invoices** (e.g., 03881260.pdf) with I/L markers and specific layout 2. **Multi-invoice statements** (e.g., 13595522.pdf) containing 4 invoices per page The single invoice template failed to parse multi-invoice statements because: - Multi-invoice statements lack the I/L (Invoice/Location) markers used in single invoice templates - The layout structure is completely different, with invoices listed as table rows instead of distinct sections - Customer identifier extraction requires a different regex pattern ## Environment - Component: PDF Template Parser (Clojure) - Date: 2026-02-07 - Test File: `test/clj/auto_ap/parse/templates_test.clj` - Template File: `src/clj/auto_ap/parse/templates.clj` - Test Document: `dev-resources/13595522.pdf` (4 invoices on single page) ## Symptoms - Single invoice template only parses first invoice from multi-invoice statement - Parse returns single result instead of 4 separate invoice records - `:customer-identifier` extraction returns empty or incorrect values for statements - Test `parse-bonanza-produce-statement-13595522` expects 4 results but receives 1 ## What Didn't Work **Attempted Solution 1: Reuse single invoice template with `:multi` flag** - Added `:multi #"\n"` and `:multi-match?` pattern to existing single invoice template - **Why it failed:** The single invoice template's regex patterns (e.g., `I\s+([A-Z][A-Z\s]+?)\s{2,}.*?L\s+`) expect I/L markers that don't exist in multi-invoice statements. The layout structure is fundamentally different. **Attempted Solution 2: Using simpler customer identifier pattern** - Tried pattern `#"(.*?)\s+RETURN"` extracted from multi-invoice statement text - **Why it failed:** This pattern alone doesn't account for the statement's column-based layout. Need to combine with `:multi` and `:multi-match?` flags to parse multiple invoices. ## Solution Added a dedicated multi-invoice template that: 1. Uses different keywords to identify multi-invoice statements 2. Employs `:multi` and `:multi-match?` flags for multiple invoice extraction 3. Uses simpler regex patterns suitable for the statement layout **Implementation:** ```clojure ;; Bonanza Produce Statement (multi-invoice) {:vendor "Bonanza Produce" :keywords [#"The perishable agricultural commodities" #"SPARKS, NEVADA"] :extract {:invoice-number #"^\s+[0-9]{2}/[0-9]{2}/[0-9]{2}\s+([0-9]+)\s+INVOICE" :customer-identifier #"(.*?)\s+RETURN" :date #"^\s+([0-9]{2}/[0-9]{2}/[0-9]{2})" :total #"^\s+[0-9]{2}/[0-9]{2}/[0-9]{2}\s+[0-9]+\s+INVOICE\s+([\d.]+)"} :parser {:date [:clj-time "MM/dd/yy"] :total [:trim-commas nil]} :multi #"\n" :multi-match? #"\s+[0-9]{2}/[0-9]{2}/[0-9]{2}\s+[0-9]+\s+INVOICE"} ``` **Key differences from single invoice template:** - `:keywords`: Look for statement header text instead of phone number - `:customer-identifier`: Pattern `#"(.*?)\s+RETURN"` works for statement format - `:multi #"\n"`: Split results on newline boundaries - `:multi-match?`: Match invoice header pattern to identify individual invoices - No I/L markers: Patterns scan from left margin without location markers ## Why This Works 1. **Statement-specific keywords:** "The perishable agricultural commodities" and "SPARKS, NEVADA" uniquely identify multi-invoice statements vs. single invoices (which have phone number 530-544-4136) 2. **Multi-flag parsing:** The `:multi` and `:multi-match?` flags tell the parser to split the document on newlines and identify individual invoices using the date/invoice-number pattern, rather than treating the whole page as one invoice 3. **Simplified patterns:** Without I/L markers, patterns scan from line start (`^\s+`) and extract columns based on whitespace positions. The `:customer-identifier` pattern `(.*?)\s+RETURN` captures everything before "RETURN" on each line 4. **Separate templates:** Having distinct templates for single invoices vs. statements prevents conflict and allows optimization for each format ## Prevention **When adding templates for vendors with multiple document formats:** 1. **Create separate templates:** Don't try to make one template handle both formats. Use distinct keywords to identify each format 2. **Test both single and multi-invoice documents:** Ensure templates parse expected number of invoices: ```clojure (is (= 4 (count results)) "Should parse 4 invoices from statement") ``` 3. **Verify `:multi` usage:** Multi-invoice templates should have both `:multi` and `:multi-match?` flags: ```clojure :multi #"\n" :multi-match? #"\s+[0-9]{2}/[0-9]{2}/[0-9]{2}\s+[0-9]+\s+INVOICE" ``` 4. **Check pattern scope:** Multi-invoice statements often lack structural markers (I/L), so patterns should: - Use `^\s+` to anchor at line start - Extract from whitespace-separated columns - Avoid patterns requiring specific markers 5. **Run all template tests:** Before committing, run: ```bash lein test auto-ap.parse.templates-test ``` ## Related Issues - Single invoice template: `src/clj/auto_ap/parse/templates.clj` lines 756-765 - Similar multi-invoice patterns: Search for `:multi` and `:multi-match?` in `src/clj/auto_ap/parse/templates.clj` ## Key Files - **Tests:** `test/clj/auto_ap/parse/templates_test.clj` (lines 36-53) - **Template:** `src/clj/auto_ap/parse/templates.clj` (lines 767-777) - **Test document:** `dev-resources/13595522.pdf` - **Template parser:** `src/clj/auto_ap/parse.clj`