- Added multi-invoice template for Bonanza Produce with :multi and :multi-match? flags - Template uses keywords for statement header to identify multi-invoice format - Extracts invoice-number, date, customer-identifier (from RETURN line), and total - Parses 4 invoices from statement PDF 13595522.pdf - All tests pass (29 assertions, 0 failures, 0 errors) - Added test: parse-bonanza-produce-statement-13595522 - Updated invoice-template-creator skill: emphasized test-first approach
7.0 KiB
name, description, license
| name | description | license |
|---|---|---|
| invoice-template-creator | This skill creates PDF invoice parsing templates for the Integreat system. It should be used when adding support for a new vendor invoice format that needs to be automatically parsed. | Complete terms in LICENSE.txt |
Invoice Template Creator
This skill automates the creation of invoice parsing templates for the Integreat system. It generates both the template definition and a corresponding test file based on a sample PDF invoice.
When to Use This Skill
Use this skill when you need to add support for a new vendor invoice format that cannot be parsed by existing templates. This typically happens when:
- A new vendor sends invoices in a unique format
- An existing vendor changes their invoice layout
- You encounter an invoice that fails to parse with current templates
Prerequisites
Before using this skill, ensure you have:
- A sample PDF invoice file placed in
dev-resources/directory - Identified the vendor name
- Identified unique text patterns in the invoice (phone numbers, addresses, etc.) that can distinguish this vendor
- Know the expected values for key fields (invoice number, date, customer name, total)
Usage Workflow
Step 1: Analyze the PDF
First, extract and analyze the PDF text to understand its structure:
pdftotext -layout "dev-resources/FILENAME.pdf" -
Look for:
- Vendor identifiers: Phone numbers, addresses, or unique text that identifies this vendor
- Field patterns: How invoice number, date, customer name, and total appear in the text
- Layout quirks: Multi-line fields, special formatting, or unusual spacing
Step 2: Define Expected Values
Document the expected values for each field:
| Field | Expected Value | Notes |
|---|---|---|
| Vendor Name | "Vendor Name" | Company name as it should appear |
| Invoice Number | "12345" | The invoice identifier |
| Date | "01/15/26" | Format found in PDF |
| Customer Name | "Customer Name" | As it appears on invoice |
| Customer Address | "123 Main St" | Street address if available |
| Total | "100.00" | Amount |
Step 3: Create the Template and Test
The skill will:
-
Create a test file at
test/clj/auto_ap/parse/templates_test.clj(or add to existing)- Test parses the PDF file
- Verifies all expected values are extracted correctly
- Follows existing test patterns
-
Add template to
src/clj/auto_ap/parse/templates.clj- Adds entry to
pdf-templatesvector - Includes:
:vendor- Vendor name:keywords- Regex patterns to identify this vendor (must match all):extract- Regex patterns for each field:parser- Optional date/number parsers
- Adds entry to
Step 4: Iterative Refinement
Run the test to see if it passes:
lein test auto-ap.parse.templates-test
If it fails, examine the debug output and refine the regex patterns. Common issues:
- Template doesn't match: Keywords don't actually appear in the PDF text
- Field is nil: Regex capture group doesn't match the actual text format
- Wrong value captured: Regex is too greedy or matches wrong text
Template Structure Reference
Basic Template Format
{:vendor "Vendor Name"
:keywords [#"unique-pattern-1" #"unique-pattern-2"]
:extract {:invoice-number #"Invoice\s+#\s+(\d+)"
:date #"Date:\s+(\d{2}/\d{2}/\d{2})"
:customer-identifier #"Bill To:\s+([A-Za-z\s]+)"
:total #"Total:\s+\$([\d,]+\.\d{2})"}
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}}
Field Extraction Patterns
Invoice Number:
- Look for:
"Invoice #12345"or"INV: 12345" - Pattern:
#"Invoice\s*#?\s*(\d+)"or#"INV:\s*(\d+)"
Date:
- Common formats:
"01/15/26","Jan 15, 2026","2026-01-15" - Pattern:
#"(\d{2}/\d{2}/\d{2})"for MM/dd/yy - Parser:
:date [:clj-time "MM/dd/yy"]
Customer Identifier:
- Look for:
"Bill To: Customer Name"or"Sold To: Customer Name" - Pattern:
#"Bill To:\s+([A-Za-z\s]+?)(?=\s{2,}|\n)" - Use non-greedy
+?and lookahead(?=...)to stop at boundaries
Total:
- Look for:
"Total: $100.00"or"Amount Due: 100.00" - Pattern:
#"Total:\s+\$?([\d,]+\.\d{2})" - Parser:
:total [:trim-commas nil]removes commas
Advanced Patterns
Multi-line customer address: When customer info spans multiple lines (name + address):
:customer-identifier #"(?s)I\s+([A-Z][A-Z\s]+?)\s{2,}.*?L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)"
:account-number #"(?s)L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)"
The (?s) flag makes . match newlines. Use non-greedy +? and lookaheads (?=...) to capture clean values.
Multiple date formats:
:parser {:date [:clj-time ["MM/dd/yy" "yyyy-MM-dd"]]}
Credit memos (negative amounts):
:parser {:total [:trim-commas-and-negate nil]}
Testing Best Practices
- IMPORTANT, CRITICAL!! Start with a failing test - Define expected values before implementing
- Test actual PDF parsing - Use
parse-fileorparsewith real PDF text - Verify each field individually - Separate assertions for clarity
- Handle date comparisons carefully - Compare year/month/day separately if needed
- Use
str/trim- Account for extra whitespace in extracted values
Example Test Structure
(deftest parse-vendor-invoice-12345
(testing "Should parse Vendor invoice with expected values"
(let [results (sut/parse-file (io/file "dev-resources/INVOICE.pdf")
"INVOICE.pdf")
result (first results)]
(is (some? results) "Should return results")
(is (some? result) "Template should match")
(when result
(is (= "Vendor Name" (:vendor-code result)))
(is (= "12345" (:invoice-number result)))
(is (= "Customer Name" (:customer-identifier result)))
(is (= "100.00" (:total result)))))))
Common Pitfalls
- Keywords must all match - Every pattern in
:keywordsmust be found in the PDF - Capture groups required - Regexes need
()to extract values - PDF text != visual text - Layout may differ from what you see visually
- Greedy quantifiers - Use
+?instead of+to avoid over-matching - Case sensitivity - Regex is case-sensitive unless you use
(?i)flag
Post-Creation Checklist
After creating the template:
- Test passes:
lein test auto-ap.parse.templates-test - Format is correct:
lein cljfmt check - Code compiles:
lein check - Template is in correct position in
pdf-templatesvector - Keywords uniquely identify this vendor (won't match other templates)
- Test file follows naming conventions
Integration with Workflow
This skill is typically used as part of a larger workflow:
- User provides PDF and requirements
- This skill creates template and test
- User reviews and refines if needed
- Test is run to verify extraction
- Code is committed
The skill ensures consistency with existing patterns and reduces manual boilerplate when adding new vendor support.