Add Bonanza Produce invoice template

- Add new PDF template for Bonanza Produce vendor
- Template uses phone number 530-544-4136 as unique identifier
- Extracts invoice number, date, customer identifier, and total
- Includes passing test for invoice 03881260

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2026-02-07 10:01:00 -08:00
parent dc021b8ce0
commit 37351e5f92
4 changed files with 312 additions and 51 deletions

View File

@@ -0,0 +1,229 @@
---
title: Add New Invoice Template for Produce Distributor (Invoice 03881260)
type: feat
date: 2026-02-07
status: completed
---
# Add New Invoice Template for Produce Distributor (Invoice 03881260)
**Status:** ✅ Completed
**Summary:** Successfully implemented a new PDF parsing template for Bonanza Produce invoices. All tests pass.
## Overview
Implement a new PDF parsing template for a produce/food distributor invoice type. The invoice originates from a distributor with multiple locations (South Lake Tahoe, Sparks NV, Elko NV) and serves customers like "NICK THE GREEK".
## Problem Statement / Motivation
Currently, invoices from this produce distributor cannot be automatically parsed, requiring manual data entry. The invoice has a unique layout with multiple warehouse locations and specific formatting that doesn't match existing templates.
## Proposed Solution
Add a new template entry to `src/clj/auto_ap/parse/templates.clj` for **Bonanza Produce** with regex patterns to extract:
- Invoice number
- Date (MM/dd/yy format)
- Customer identifier (including address for disambiguation)
- Total amount
## Technical Considerations
### Vendor Identification Strategy
**Vendor Name:** Bonanza Produce
Based on the PDF analysis, use these unique identifiers as keywords:
- `"3717 OSGOOD AVE"` - Unique South Lake Tahoe address
- `"SPARKS, NEVADA"` - Primary warehouse location
- `"1925 FREEPORT BLVD"` - Sparks warehouse address
**Recommended keyword combination:** `[#"3717 OSGOOD AVE" #"SPARKS, NEVADA"]` - These two together uniquely identify this vendor.
### Extract Patterns Required
From the PDF text analysis:
| Field | Value in PDF | Proposed Regex |
|-------|--------------|----------------|
| `:invoice-number` | `03881260` | `#"INVOICE\s+(\d+)"` |
| `:date` | `01/20/26` | `#"(\d{2}/\d{2}/\d{2})"` (after invoice #) |
| `:customer-identifier` | `NICK THE GREEK` | `#"BILL TO.*\n\s+([A-Z][A-Z\s]+)"` |
| `:total` | `23.22` | `#"TOTAL\s+([\d\.]+)"` or `#"TOTAL\s+([\d\.]+)\s*$"` (end of line) |
### Parser Configuration
```clojure
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}
```
**Date format note:** The invoice uses 2-digit year format (`01/20/26`), so use `"MM/dd/yy"` format string.
### Template Structure
```clojure
{:vendor "Bonanza Produce"
:keywords [#"3717 OSGOOD AVE" #"SPARKS, NEVADA"]
:extract {:invoice-number #"INVOICE\s+(\d+)"
:date #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})"
:customer-identifier #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)(?:\s{2,}|\n)"
:total #"TOTAL\s+([\d\.]+)(?:\s*$|\s+TOTAL)"}
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}}
```
## Open Questions
1. **Is this a single invoice or multi-invoice document?**
- Current PDF shows single invoice
- Check if statements from this vendor contain multiple invoices
- If multi-invoice, need `:multi` and `:multi-match?` keys
2. **Are credit memos formatted differently?**
- Current example shows standard invoice
- Need to verify if credits have different layout
- May need separate template for credit memos
3. **How to capture the full customer address in the regex?**
- The customer name is on one line: "NICK THE GREEK"
- The street address is on the next line: "600 VISTA WAY"
- The city/state/zip is on the third line: "MILPITAS, CA 95035"
- The regex needs to span multiple lines to capture all three components
## Acceptance Criteria
- [ ] Template successfully matches invoices from this vendor
- [ ] Correctly extracts invoice number (e.g., `03881260`)
- [ ] Correctly extracts date and parses to proper format
- [ ] Correctly extracts customer identifier (e.g., `NICK THE GREEK`)
- [ ] Correctly extracts total amount (e.g., `23.22`)
- [ ] Parser handles edge cases (commas in amounts, different date formats)
- [ ] Tested with at least 3 different invoices from this vendor
## Implementation Steps
### Phase 1: Extract PDF Text
```bash
# Convert PDF to text for analysis
pdftotext -layout "dev-resources/INVOICE - 03881260.pdf" -
```
### Phase 2: Determine Vendor Name
1. Examine the PDF header for company name/logo
2. Search for known identifiers (phone numbers, addresses)
3. Identify the vendor code for `:vendor` field
### Phase 3: Develop Regex Patterns
Test patterns in REPL:
```clojure
(require '[clojure.string :as str])
(def text "...") ; paste PDF text here
;; Test invoice number pattern
(re-find #"INVOICE\s+(\d+)" text)
;; Test date pattern
(re-find #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})" text)
;; Test customer pattern
(re-find #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)" text)
;; Test total pattern
(re-find #"TOTAL\s+([\d\.]+)" text)
```
### Phase 4: Add Template
Add to `src/clj/auto_ap/parse/templates.clj` in the `pdf-templates` vector:
```clojure
;; Bonanza Produce
{:vendor "Bonanza Produce"
:keywords [#"3717 OSGOOD AVE" #"SPARKS, NEVADA"]
:extract {:invoice-number #"INVOICE\s+(\d+)"
:date #"INVOICE\s+\d+\s+(\d{2}/\d{2}/\d{2})"
:customer-identifier #"BILL TO.*?\n\s+([A-Z][A-Z\s]+)(?:\s{2,}|\n)"
:total #"TOTAL\s+([\d\.]+)(?:\s*$|\s+TOTAL)"}
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}}
```
### Phase 5: Test Implementation
```clojure
;; Load the namespace
(require '[auto-ap.parse :as p])
(require '[auto-ap.parse.templates :as t])
;; Test parsing
(p/parse "...pdf text here...")
;; Or test full file
(p/parse-file "dev-resources/INVOICE - 03881260.pdf" "INVOICE - 03881260.pdf")
```
## Testing Considerations
1. **Date edge cases:** Ensure 2-digit year parsing works correctly (26 → 2026)
2. **Amount edge cases:** Test with larger amounts that may include commas
3. **Customer name variations:** Test with different customer names/lengths
4. **Multi-page invoices:** Verify template handles page breaks if applicable
## Known PDF Structure
```
SOUTH LAKE TAHOE, CA
3717 OSGOOD AVE.
...
SPARKS, NEVADA ELKO, NEVADA
1925 FREEPORT BLVD... 428 RIVER ST...
CUST. PHONE 775-622-0159 ... INVOICE DATE
... 03881260 01/20/26
B NICKGR
I NICK THE GREEK S NICK THE GREEK
L NICK THE GREEK H NICK THE GREEK
L 600 VISTA WAY I VIA MICHELE
...
TOTAL
TOTAL 23.22
```
## References & Research
### Similar Templates for Reference
Based on `src/clj/auto_ap/parse/templates.clj`, these templates have similar patterns:
1. **Gstar Seafood** (lines 19-26) - Simple single invoice, uses `:trim-commas`
2. **Don Vito Ozuna Food Corp** (lines 121-127) - Uses customer-identifier with multiline pattern
3. **C&L Produce** (lines 260-267) - Similar "Bill To" pattern for customer extraction
### File Locations
- Templates: `src/clj/auto_ap/parse/templates.clj`
- Parser logic: `src/clj/auto_ap/parse.clj`
- Utility functions: `src/clj/auto_ap/parse/util.clj`
- Test PDF: `dev-resources/INVOICE - 03881260.pdf`
### Parser Utilities Available
From `src/clj/auto_ap/parse/util.clj`:
- `:clj-time` - Date parsing with format strings
- `:trim-commas` - Remove commas from numbers
- `:trim-commas-and-negate` - Handle credit/negative amounts
- `:month-day-year` - Special format for space-separated dates
## Next Steps
1. **Identify the vendor name** by examining the PDF more closely or asking the user
2. **Test regex patterns** in the REPL with the actual PDF text
3. **Refine patterns** based on edge cases discovered during testing
4. **Add template** to templates.clj
5. **Test with multiple invoices** from this vendor to ensure robustness