--- title: Add New Invoice Template for Produce Distributor (Invoice 03881260) type: feat date: 2026-02-07 status: completed --- # Add New Invoice Template for Produce Distributor (Invoice 03881260) **Status:** ✅ Completed **Summary:** Successfully implemented a new PDF parsing template for Bonanza Produce invoices. All tests pass. ## Overview Implement a new PDF parsing template for a produce/food distributor invoice type. The invoice originates from a distributor with multiple locations (South Lake Tahoe, Sparks NV, Elko NV) and serves customers like "NICK THE GREEK". ## Problem Statement / Motivation Currently, invoices from this produce distributor cannot be automatically parsed, requiring manual data entry. The invoice has a unique layout with multiple warehouse locations and specific formatting that doesn't match existing templates. ## Proposed Solution Add a new template entry to `src/clj/auto_ap/parse/templates.clj` for **Bonanza Produce** with regex patterns to extract: - Invoice number - Date (MM/dd/yy format) - Customer identifier (including address for disambiguation) - Total amount ## Technical Considerations ### Vendor Identification Strategy **Vendor Name:** Bonanza Produce Based on the PDF analysis, use these unique identifiers as keywords: - `"3717 OSGOOD AVE"` - Unique South Lake Tahoe address - `"SPARKS, NEVADA"` - Primary warehouse location - `"1925 FREEPORT BLVD"` - Sparks warehouse address **Recommended keyword combination:** `[#"3717 OSGOOD AVE" #"SPARKS, NEVADA"]` - These two together uniquely identify this vendor. ### Extract Patterns Required From the PDF text analysis: | Field | Value in PDF | Proposed Regex | |-------|--------------|----------------| | `:invoice-number` | `03881260` | `#"INVOICE\s+(\d+)"` | | `:date` | `01/20/26` | `#"(\d{2}/\d{2}/\d{2})"` (after invoice #) | | `:customer-identifier` | `NICK THE GREEK` | `#"BILL TO.*\n\s+([A-Z][A-Z\s]+)"` | | `:total` | `23.22` | `#"TOTAL\s+([\d\.]+)"` or `#"TOTAL\s+([\d\.]+)\s*$"` (end of line) | ### Parser Configuration ```clojure :parser {:date [:clj-time "MM/dd/yy"] :total [:trim-commas nil]} ``` **Date format note:** The invoice uses 2-digit year format (`01/20/26`), so use `"MM/dd/yy"` format string. ### Template Structure ```clojure {:vendor "Bonanza Produce" :keywords [#"3717 OSGOOD AVE" #"SPARKS, NEVADA"] :extract {:invoice-number #"INVOICE\s+(\d+)" :date #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})" :customer-identifier #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)(?:\s{2,}|\n)" :total #"TOTAL\s+([\d\.]+)(?:\s*$|\s+TOTAL)"} :parser {:date [:clj-time "MM/dd/yy"] :total [:trim-commas nil]}} ``` ## Open Questions 1. **Is this a single invoice or multi-invoice document?** - Current PDF shows single invoice - Check if statements from this vendor contain multiple invoices - If multi-invoice, need `:multi` and `:multi-match?` keys 2. **Are credit memos formatted differently?** - Current example shows standard invoice - Need to verify if credits have different layout - May need separate template for credit memos 3. **How to capture the full customer address in the regex?** - The customer name is on one line: "NICK THE GREEK" - The street address is on the next line: "600 VISTA WAY" - The city/state/zip is on the third line: "MILPITAS, CA 95035" - The regex needs to span multiple lines to capture all three components ## Acceptance Criteria - [ ] Template successfully matches invoices from this vendor - [ ] Correctly extracts invoice number (e.g., `03881260`) - [ ] Correctly extracts date and parses to proper format - [ ] Correctly extracts customer identifier (e.g., `NICK THE GREEK`) - [ ] Correctly extracts total amount (e.g., `23.22`) - [ ] Parser handles edge cases (commas in amounts, different date formats) - [ ] Tested with at least 3 different invoices from this vendor ## Implementation Steps ### Phase 1: Extract PDF Text ```bash # Convert PDF to text for analysis pdftotext -layout "dev-resources/INVOICE - 03881260.pdf" - ``` ### Phase 2: Determine Vendor Name 1. Examine the PDF header for company name/logo 2. Search for known identifiers (phone numbers, addresses) 3. Identify the vendor code for `:vendor` field ### Phase 3: Develop Regex Patterns Test patterns in REPL: ```clojure (require '[clojure.string :as str]) (def text "...") ; paste PDF text here ;; Test invoice number pattern (re-find #"INVOICE\s+(\d+)" text) ;; Test date pattern (re-find #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})" text) ;; Test customer pattern (re-find #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)" text) ;; Test total pattern (re-find #"TOTAL\s+([\d\.]+)" text) ``` ### Phase 4: Add Template Add to `src/clj/auto_ap/parse/templates.clj` in the `pdf-templates` vector: ```clojure ;; Bonanza Produce {:vendor "Bonanza Produce" :keywords [#"3717 OSGOOD AVE" #"SPARKS, NEVADA"] :extract {:invoice-number #"INVOICE\s+(\d+)" :date #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})" :customer-identifier #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)(?:\s{2,}|\n)" :total #"TOTAL\s+([\d\.]+)(?:\s*$|\s+TOTAL)"} :parser {:date [:clj-time "MM/dd/yy"] :total [:trim-commas nil]}} ``` ### Phase 5: Test Implementation ```clojure ;; Load the namespace (require '[auto-ap.parse :as p]) (require '[auto-ap.parse.templates :as t]) ;; Test parsing (p/parse "...pdf text here...") ;; Or test full file (p/parse-file "dev-resources/INVOICE - 03881260.pdf" "INVOICE - 03881260.pdf") ``` ## Testing Considerations 1. **Date edge cases:** Ensure 2-digit year parsing works correctly (26 → 2026) 2. **Amount edge cases:** Test with larger amounts that may include commas 3. **Customer name variations:** Test with different customer names/lengths 4. **Multi-page invoices:** Verify template handles page breaks if applicable ## Known PDF Structure ``` SOUTH LAKE TAHOE, CA 3717 OSGOOD AVE. ... SPARKS, NEVADA ELKO, NEVADA 1925 FREEPORT BLVD... 428 RIVER ST... CUST. PHONE 775-622-0159 ... INVOICE DATE ... 03881260 01/20/26 B NICKGR I NICK THE GREEK S NICK THE GREEK L NICK THE GREEK H NICK THE GREEK L 600 VISTA WAY I VIA MICHELE ... TOTAL TOTAL 23.22 ``` ## References & Research ### Similar Templates for Reference Based on `src/clj/auto_ap/parse/templates.clj`, these templates have similar patterns: 1. **Gstar Seafood** (lines 19-26) - Simple single invoice, uses `:trim-commas` 2. **Don Vito Ozuna Food Corp** (lines 121-127) - Uses customer-identifier with multiline pattern 3. **C&L Produce** (lines 260-267) - Similar "Bill To" pattern for customer extraction ### File Locations - Templates: `src/clj/auto_ap/parse/templates.clj` - Parser logic: `src/clj/auto_ap/parse.clj` - Utility functions: `src/clj/auto_ap/parse/util.clj` - Test PDF: `dev-resources/INVOICE - 03881260.pdf` ### Parser Utilities Available From `src/clj/auto_ap/parse/util.clj`: - `:clj-time` - Date parsing with format strings - `:trim-commas` - Remove commas from numbers - `:trim-commas-and-negate` - Handle credit/negative amounts - `:month-day-year` - Special format for space-separated dates ## Next Steps 1. **Identify the vendor name** by examining the PDF more closely or asking the user 2. **Test regex patterns** in the REPL with the actual PDF text 3. **Refine patterns** based on edge cases discovered during testing 4. **Add template** to templates.clj 5. **Test with multiple invoices** from this vendor to ensure robustness