- Add new PDF template for Bonanza Produce vendor - Template uses phone number 530-544-4136 as unique identifier - Extracts invoice number, date, customer identifier, and total - Includes passing test for invoice 03881260 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.8 KiB
title, type, date, status
| title | type | date | status |
|---|---|---|---|
| Add New Invoice Template for Produce Distributor (Invoice 03881260) | feat | 2026-02-07 | completed |
Add New Invoice Template for Produce Distributor (Invoice 03881260)
Status: ✅ Completed
Summary: Successfully implemented a new PDF parsing template for Bonanza Produce invoices. All tests pass.
Overview
Implement a new PDF parsing template for a produce/food distributor invoice type. The invoice originates from a distributor with multiple locations (South Lake Tahoe, Sparks NV, Elko NV) and serves customers like "NICK THE GREEK".
Problem Statement / Motivation
Currently, invoices from this produce distributor cannot be automatically parsed, requiring manual data entry. The invoice has a unique layout with multiple warehouse locations and specific formatting that doesn't match existing templates.
Proposed Solution
Add a new template entry to src/clj/auto_ap/parse/templates.clj for Bonanza Produce with regex patterns to extract:
- Invoice number
- Date (MM/dd/yy format)
- Customer identifier (including address for disambiguation)
- Total amount
Technical Considerations
Vendor Identification Strategy
Vendor Name: Bonanza Produce
Based on the PDF analysis, use these unique identifiers as keywords:
"3717 OSGOOD AVE"- Unique South Lake Tahoe address"SPARKS, NEVADA"- Primary warehouse location"1925 FREEPORT BLVD"- Sparks warehouse address
Recommended keyword combination: [#"3717 OSGOOD AVE" #"SPARKS, NEVADA"] - These two together uniquely identify this vendor.
Extract Patterns Required
From the PDF text analysis:
| Field | Value in PDF | Proposed Regex |
|---|---|---|
:invoice-number |
03881260 |
#"INVOICE\s+(\d+)" |
:date |
01/20/26 |
#"(\d{2}/\d{2}/\d{2})" (after invoice #) |
:customer-identifier |
NICK THE GREEK |
#"BILL TO.*\n\s+([A-Z][A-Z\s]+)" |
:total |
23.22 |
#"TOTAL\s+([\d\.]+)" or #"TOTAL\s+([\d\.]+)\s*$" (end of line) |
Parser Configuration
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}
Date format note: The invoice uses 2-digit year format (01/20/26), so use "MM/dd/yy" format string.
Template Structure
{:vendor "Bonanza Produce"
:keywords [#"3717 OSGOOD AVE" #"SPARKS, NEVADA"]
:extract {:invoice-number #"INVOICE\s+(\d+)"
:date #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})"
:customer-identifier #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)(?:\s{2,}|\n)"
:total #"TOTAL\s+([\d\.]+)(?:\s*$|\s+TOTAL)"}
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}}
Open Questions
-
Is this a single invoice or multi-invoice document?
- Current PDF shows single invoice
- Check if statements from this vendor contain multiple invoices
- If multi-invoice, need
:multiand:multi-match?keys
-
Are credit memos formatted differently?
- Current example shows standard invoice
- Need to verify if credits have different layout
- May need separate template for credit memos
-
How to capture the full customer address in the regex?
- The customer name is on one line: "NICK THE GREEK"
- The street address is on the next line: "600 VISTA WAY"
- The city/state/zip is on the third line: "MILPITAS, CA 95035"
- The regex needs to span multiple lines to capture all three components
Acceptance Criteria
- Template successfully matches invoices from this vendor
- Correctly extracts invoice number (e.g.,
03881260) - Correctly extracts date and parses to proper format
- Correctly extracts customer identifier (e.g.,
NICK THE GREEK) - Correctly extracts total amount (e.g.,
23.22) - Parser handles edge cases (commas in amounts, different date formats)
- Tested with at least 3 different invoices from this vendor
Implementation Steps
Phase 1: Extract PDF Text
# Convert PDF to text for analysis
pdftotext -layout "dev-resources/INVOICE - 03881260.pdf" -
Phase 2: Determine Vendor Name
- Examine the PDF header for company name/logo
- Search for known identifiers (phone numbers, addresses)
- Identify the vendor code for
:vendorfield
Phase 3: Develop Regex Patterns
Test patterns in REPL:
(require '[clojure.string :as str])
(def text "...") ; paste PDF text here
;; Test invoice number pattern
(re-find #"INVOICE\s+(\d+)" text)
;; Test date pattern
(re-find #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})" text)
;; Test customer pattern
(re-find #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)" text)
;; Test total pattern
(re-find #"TOTAL\s+([\d\.]+)" text)
Phase 4: Add Template
Add to src/clj/auto_ap/parse/templates.clj in the pdf-templates vector:
;; Bonanza Produce
{:vendor "Bonanza Produce"
:keywords [#"3717 OSGOOD AVE" #"SPARKS, NEVADA"]
:extract {:invoice-number #"INVOICE\s+(\d+)"
:date #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})"
:customer-identifier #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)(?:\s{2,}|\n)"
:total #"TOTAL\s+([\d\.]+)(?:\s*$|\s+TOTAL)"}
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}}
Phase 5: Test Implementation
;; Load the namespace
(require '[auto-ap.parse :as p])
(require '[auto-ap.parse.templates :as t])
;; Test parsing
(p/parse "...pdf text here...")
;; Or test full file
(p/parse-file "dev-resources/INVOICE - 03881260.pdf" "INVOICE - 03881260.pdf")
Testing Considerations
- Date edge cases: Ensure 2-digit year parsing works correctly (26 → 2026)
- Amount edge cases: Test with larger amounts that may include commas
- Customer name variations: Test with different customer names/lengths
- Multi-page invoices: Verify template handles page breaks if applicable
Known PDF Structure
SOUTH LAKE TAHOE, CA
3717 OSGOOD AVE.
...
SPARKS, NEVADA ELKO, NEVADA
1925 FREEPORT BLVD... 428 RIVER ST...
CUST. PHONE 775-622-0159 ... INVOICE DATE
... 03881260 01/20/26
B NICKGR
I NICK THE GREEK S NICK THE GREEK
L NICK THE GREEK H NICK THE GREEK
L 600 VISTA WAY I VIA MICHELE
...
TOTAL
TOTAL 23.22
References & Research
Similar Templates for Reference
Based on src/clj/auto_ap/parse/templates.clj, these templates have similar patterns:
- Gstar Seafood (lines 19-26) - Simple single invoice, uses
:trim-commas - Don Vito Ozuna Food Corp (lines 121-127) - Uses customer-identifier with multiline pattern
- C&L Produce (lines 260-267) - Similar "Bill To" pattern for customer extraction
File Locations
- Templates:
src/clj/auto_ap/parse/templates.clj - Parser logic:
src/clj/auto_ap/parse.clj - Utility functions:
src/clj/auto_ap/parse/util.clj - Test PDF:
dev-resources/INVOICE - 03881260.pdf
Parser Utilities Available
From src/clj/auto_ap/parse/util.clj:
:clj-time- Date parsing with format strings:trim-commas- Remove commas from numbers:trim-commas-and-negate- Handle credit/negative amounts:month-day-year- Special format for space-separated dates
Next Steps
- Identify the vendor name by examining the PDF more closely or asking the user
- Test regex patterns in the REPL with the actual PDF text
- Refine patterns based on edge cases discovered during testing
- Add template to templates.clj
- Test with multiple invoices from this vendor to ensure robustness