Add invoice-template-creator skill for automated template generation

New repository-based skill at .claude/skills/invoice-template-creator/:
- SKILL.md: Complete guide for creating invoice parsing templates
- references/examples.md: Common patterns and template examples
- Covers vendor identification, regex patterns, field extraction
- Includes testing strategies and common pitfalls

Updated AGENTS.md with reference to the new skill.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2026-02-07 14:53:08 -08:00
parent 98a3e0dda6
commit 26dbde5bd3
3 changed files with 389 additions and 0 deletions

View File

@@ -0,0 +1,201 @@
---
name: invoice-template-creator
description: This skill creates PDF invoice parsing templates for the Integreat system. It should be used when adding support for a new vendor invoice format that needs to be automatically parsed.
license: Complete terms in LICENSE.txt
---
# Invoice Template Creator
This skill automates the creation of invoice parsing templates for the Integreat system. It generates both the template definition and a corresponding test file based on a sample PDF invoice.
## When to Use This Skill
Use this skill when you need to add support for a new vendor invoice format that cannot be parsed by existing templates. This typically happens when:
- A new vendor sends invoices in a unique format
- An existing vendor changes their invoice layout
- You encounter an invoice that fails to parse with current templates
## Prerequisites
Before using this skill, ensure you have:
1. A sample PDF invoice file placed in `dev-resources/` directory
2. Identified the vendor name
3. Identified unique text patterns in the invoice (phone numbers, addresses, etc.) that can distinguish this vendor
4. Know the expected values for key fields (invoice number, date, customer name, total)
## Usage Workflow
### Step 1: Analyze the PDF
First, extract and analyze the PDF text to understand its structure:
```bash
pdftotext -layout "dev-resources/FILENAME.pdf" -
```
Look for:
- **Vendor identifiers**: Phone numbers, addresses, or unique text that identifies this vendor
- **Field patterns**: How invoice number, date, customer name, and total appear in the text
- **Layout quirks**: Multi-line fields, special formatting, or unusual spacing
### Step 2: Define Expected Values
Document the expected values for each field:
| Field | Expected Value | Notes |
|-------|---------------|-------|
| Vendor Name | "Vendor Name" | Company name as it should appear |
| Invoice Number | "12345" | The invoice identifier |
| Date | "01/15/26" | Format found in PDF |
| Customer Name | "Customer Name" | As it appears on invoice |
| Customer Address | "123 Main St" | Street address if available |
| Total | "100.00" | Amount |
### Step 3: Create the Template and Test
The skill will:
1. **Create a test file** at `test/clj/auto_ap/parse/templates_test.clj` (or add to existing)
- Test parses the PDF file
- Verifies all expected values are extracted correctly
- Follows existing test patterns
2. **Add template to** `src/clj/auto_ap/parse/templates.clj`
- Adds entry to `pdf-templates` vector
- Includes:
- `:vendor` - Vendor name
- `:keywords` - Regex patterns to identify this vendor (must match all)
- `:extract` - Regex patterns for each field
- `:parser` - Optional date/number parsers
### Step 4: Iterative Refinement
Run the test to see if it passes:
```bash
lein test auto-ap.parse.templates-test
```
If it fails, examine the debug output and refine the regex patterns. Common issues:
- **Template doesn't match**: Keywords don't actually appear in the PDF text
- **Field is nil**: Regex capture group doesn't match the actual text format
- **Wrong value captured**: Regex is too greedy or matches wrong text
## Template Structure Reference
### Basic Template Format
```clojure
{:vendor "Vendor Name"
:keywords [#"unique-pattern-1" #"unique-pattern-2"]
:extract {:invoice-number #"Invoice\s+#\s+(\d+)"
:date #"Date:\s+(\d{2}/\d{2}/\d{2})"
:customer-identifier #"Bill To:\s+([A-Za-z\s]+)"
:total #"Total:\s+\$([\d,]+\.\d{2})"}
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}}
```
### Field Extraction Patterns
**Invoice Number:**
- Look for: `"Invoice #12345"` or `"INV: 12345"`
- Pattern: `#"Invoice\s*#?\s*(\d+)"` or `#"INV:\s*(\d+)"`
**Date:**
- Common formats: `"01/15/26"`, `"Jan 15, 2026"`, `"2026-01-15"`
- Pattern: `#"(\d{2}/\d{2}/\d{2})"` for MM/dd/yy
- Parser: `:date [:clj-time "MM/dd/yy"]`
**Customer Identifier:**
- Look for: `"Bill To: Customer Name"` or `"Sold To: Customer Name"`
- Pattern: `#"Bill To:\s+([A-Za-z\s]+?)(?=\s{2,}|\n)"`
- Use non-greedy `+?` and lookahead `(?=...)` to stop at boundaries
**Total:**
- Look for: `"Total: $100.00"` or `"Amount Due: 100.00"`
- Pattern: `#"Total:\s+\$?([\d,]+\.\d{2})"`
- Parser: `:total [:trim-commas nil]` removes commas
### Advanced Patterns
**Multi-line customer address:**
When customer info spans multiple lines (name + address):
```clojure
:customer-identifier #"(?s)I\s+([A-Z][A-Z\s]+?)\s{2,}.*?L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)"
:account-number #"(?s)L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)"
```
The `(?s)` flag makes `.` match newlines. Use non-greedy `+?` and lookaheads `(?=...)` to capture clean values.
**Multiple date formats:**
```clojure
:parser {:date [:clj-time ["MM/dd/yy" "yyyy-MM-dd"]]}
```
**Credit memos (negative amounts):**
```clojure
:parser {:total [:trim-commas-and-negate nil]}
```
## Testing Best Practices
1. **Start with a failing test** - Define expected values before implementing
2. **Test actual PDF parsing** - Use `parse-file` or `parse` with real PDF text
3. **Verify each field individually** - Separate assertions for clarity
4. **Handle date comparisons carefully** - Compare year/month/day separately if needed
5. **Use `str/trim`** - Account for extra whitespace in extracted values
## Example Test Structure
```clojure
(deftest parse-vendor-invoice-12345
(testing "Should parse Vendor invoice with expected values"
(let [results (sut/parse-file (io/file "dev-resources/INVOICE.pdf")
"INVOICE.pdf")
result (first results)]
(is (some? results) "Should return results")
(is (some? result) "Template should match")
(when result
(is (= "Vendor Name" (:vendor-code result)))
(is (= "12345" (:invoice-number result)))
(is (= "Customer Name" (:customer-identifier result)))
(is (= "100.00" (:total result)))))))
```
## Common Pitfalls
1. **Keywords must all match** - Every pattern in `:keywords` must be found in the PDF
2. **Capture groups required** - Regexes need `()` to extract values
3. **PDF text != visual text** - Layout may differ from what you see visually
4. **Greedy quantifiers** - Use `+?` instead of `+` to avoid over-matching
5. **Case sensitivity** - Regex is case-sensitive unless you use `(?i)` flag
## Post-Creation Checklist
After creating the template:
- [ ] Test passes: `lein test auto-ap.parse.templates-test`
- [ ] Format is correct: `lein cljfmt check`
- [ ] Code compiles: `lein check`
- [ ] Template is in correct position in `pdf-templates` vector
- [ ] Keywords uniquely identify this vendor (won't match other templates)
- [ ] Test file follows naming conventions
## Integration with Workflow
This skill is typically used as part of a larger workflow:
1. User provides PDF and requirements
2. This skill creates template and test
3. User reviews and refines if needed
4. Test is run to verify extraction
5. Code is committed
The skill ensures consistency with existing patterns and reduces manual boilerplate when adding new vendor support.

View File

@@ -0,0 +1,188 @@
# Invoice Template Examples
## Simple Single Invoice
```clojure
{:vendor "Gstar Seafood"
:keywords [#"G Star Seafood"]
:extract {:total #"Total\s{2,}([\d\-,]+\.\d{2,2}+)"
:customer-identifier #"(.*?)(?:\s+)Invoice #"
:date #"Invoice Date\s{2,}([0-9]+/[0-9]+/[0-9]+)"
:invoice-number #"Invoice #\s+(\d+)"}
:parser {:date [:clj-time "MM/dd/yyyy"]
:total [:trim-commas nil]}}
```
## Multi-Invoice Statement
```clojure
{:vendor "Southbay Fresh Produce"
:keywords [#"(SOUTH BAY FRESH PRODUCE|SOUTH BAY PRODUCE)"]
:extract {:date #"^([0-9]+/[0-9]+/[0-9]+)"
:customer-identifier #"To:[^\n]*\n\s+([A-Za-z' ]+)\s{2}"
:invoice-number #"INV #\/(\d+)"
:total #"\$([0-9.]+)\."}
:parser {:date [:clj-time "MM/dd/yyyy"]}
:multi #"\n"
:multi-match? #"^[0-9]+/[0-9]+/[0-9]+\s+INV "}
```
## Customer with Address (Multi-line)
```clojure
{:vendor "Bonanza Produce"
:keywords [#"530-544-4136"]
:extract {:invoice-number #"NO\s+(\d{8,})\s+\d{2}/\d{2}/\d{2}"
:date #"NO\s+\d{8,}\s+(\d{2}/\d{2}/\d{2})"
:customer-identifier #"(?s)I\s+([A-Z][A-Z\s]+?)\s{2,}.*?L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)"
:account-number #"(?s)L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)"
:total #"SHIPPED\s+[\d\.]+\s+TOTAL\s+([\d\.]+)"}
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas nil]}}
```
## Credit Memo (Negative Amounts)
```clojure
{:vendor "General Produce Company"
:keywords [#"916-552-6495"]
:extract {:date #"DATE.*\n.*\n.*?([0-9]+/[0-9]+/[0-9]+)"
:invoice-number #"CREDIT NO.*\n.*\n.*?(\d{5,}?)\s+"
:account-number #"CUST NO.*\n.*\n\s+(\d+)"
:total #"TOTAL:\s+\|\s*(.*)"}
:parser {:date [:clj-time "MM/dd/yy"]
:total [:trim-commas-and-negate nil]}}
```
## Complex Date Parsing
```clojure
{:vendor "Ben E. Keith"
:keywords [#"BEN E. KEITH"]
:extract {:date #"Customer No Mo Day Yr.*?\n.*?\d{5,}\s{2,}(\d+\s+\d+\s+\d+)"
:customer-identifier #"Customer No Mo Day Yr.*?\n.*?(\d{5,})"
:invoice-number #"Invoice No.*?\n.*?(\d{8,})"
:total #"Total Invoice.*?\n.*?([\-]?[0-9]+\.[0-9]{2,})"}
:parser {:date [:month-day-year nil]
:total [:trim-commas-and-negate nil]}}
```
## Multiple Date Formats
```clojure
{:vendor "RNDC"
:keywords [#"P.O.Box 743564"]
:extract {:date #"(?:INVOICE|CREDIT) DATE\n(?:.*?)(\S+)\n"
:account-number #"Store Number:\s+(\d+)"
:invoice-number #"(?:INVOICE|CREDIT) DATE\n(?:.*?)\s{2,}(\d+?)\s+\S+\n"
:total #"Net Amount(?:.*\n){4}(?:.*?)([\-]?[0-9\.]+)\n"}
:parser {:date [:clj-time ["MM/dd/yy" "dd-MMM-yy"]]
:total [:trim-commas-and-negate nil]}}
```
## Common Regex Patterns
### Phone Numbers
```clojure
#"\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}"
```
### Dollar Amounts
```clojure
#"\$?([0-9,]+\.[0-9]{2})"
```
### Dates (MM/dd/yy)
```clojure
#"([0-9]{2}/[0-9]{2}/[0-9]{2})"
```
### Dates (MM/dd/yyyy)
```clojure
#"([0-9]{2}/[0-9]{2}/[0-9]{4})"
```
### Multi-line Text (dotall mode)
```clojure
#"(?s)start.*?end"
```
### Non-greedy Match
```clojure
#"(pattern.+?)"
```
### Lookahead Boundary
```clojure
#"value(?=\s{2,}|\n)"
```
## Field Extraction Strategies
### 1. Simple Line-based
Use `[^\n]*` to match until end of line:
```clojure
#"Invoice:\s+([^\n]+)"
```
### 2. Whitespace Boundary
Use `(?=\s{2,}|\n)` to stop at multiple spaces or newline:
```clojure
#"Customer:\s+(.+?)(?=\s{2,}|\n)"
```
### 3. Specific Marker
Match until a specific pattern is found:
```clojure
#"(?s)Start(.*?)End"
```
### 4. Multi-part Extraction
Use multiple capture groups for related fields:
```clojure
#"Date:\s+(\d{2})/(\d{2})/(\d{2})"
```
## Parser Options
### Date Parsers
- `[:clj-time "MM/dd/yyyy"]` - Standard US date
- `[:clj-time "MM/dd/yy"]` - 2-digit year
- `[:clj-time "MMM dd, yyyy"]` - Named month
- `[:clj-time ["MM/dd/yy" "yyyy-MM-dd"]]` - Multiple formats
- `[:month-day-year nil]` - Space-separated (1 15 26)
### Number Parsers
- `[:trim-commas nil]` - Remove commas from numbers
- `[:trim-commas-and-negate nil]` - Handle negative/credit amounts
- `[:trim-commas-and-remove-dollars nil]` - Remove $ and commas
- `nil` - No parsing, return raw string
## Testing Patterns
### Basic Test Structure
```clojure
(deftest parse-vendor-invoice
(testing "Should parse vendor invoice"
(let [results (sut/parse-file (io/file "dev-resources/INVOICE.pdf")
"INVOICE.pdf")
result (first results)]
(is (some? result))
(is (= "Vendor" (:vendor-code result)))
(is (= "12345" (:invoice-number result))))))
```
### Date Testing
```clojure
(let [d (:date result)]
(is (= 2026 (time/year d)))
(is (= 1 (time/month d)))
(is (= 15 (time/day d))))
```
### Multi-field Verification
```clojure
(is (= "Expected Name" (:customer-identifier result)))
(is (= "Expected Street" (str/trim (:account-number result))))
(is (= "Expected City, ST 12345" (str/trim (:location result))))
```