diff --git a/.claude/skills/invoice-template-creator/SKILL.md b/.claude/skills/invoice-template-creator/SKILL.md new file mode 100644 index 00000000..2afd24a0 --- /dev/null +++ b/.claude/skills/invoice-template-creator/SKILL.md @@ -0,0 +1,201 @@ +--- +name: invoice-template-creator +description: This skill creates PDF invoice parsing templates for the Integreat system. It should be used when adding support for a new vendor invoice format that needs to be automatically parsed. +license: Complete terms in LICENSE.txt +--- + +# Invoice Template Creator + +This skill automates the creation of invoice parsing templates for the Integreat system. It generates both the template definition and a corresponding test file based on a sample PDF invoice. + +## When to Use This Skill + +Use this skill when you need to add support for a new vendor invoice format that cannot be parsed by existing templates. This typically happens when: + +- A new vendor sends invoices in a unique format +- An existing vendor changes their invoice layout +- You encounter an invoice that fails to parse with current templates + +## Prerequisites + +Before using this skill, ensure you have: + +1. A sample PDF invoice file placed in `dev-resources/` directory +2. Identified the vendor name +3. Identified unique text patterns in the invoice (phone numbers, addresses, etc.) that can distinguish this vendor +4. Know the expected values for key fields (invoice number, date, customer name, total) + +## Usage Workflow + +### Step 1: Analyze the PDF + +First, extract and analyze the PDF text to understand its structure: + +```bash +pdftotext -layout "dev-resources/FILENAME.pdf" - +``` + +Look for: +- **Vendor identifiers**: Phone numbers, addresses, or unique text that identifies this vendor +- **Field patterns**: How invoice number, date, customer name, and total appear in the text +- **Layout quirks**: Multi-line fields, special formatting, or unusual spacing + +### Step 2: Define Expected Values + +Document the expected values for each field: + +| Field | Expected Value | Notes | +|-------|---------------|-------| +| Vendor Name | "Vendor Name" | Company name as it should appear | +| Invoice Number | "12345" | The invoice identifier | +| Date | "01/15/26" | Format found in PDF | +| Customer Name | "Customer Name" | As it appears on invoice | +| Customer Address | "123 Main St" | Street address if available | +| Total | "100.00" | Amount | + +### Step 3: Create the Template and Test + +The skill will: + +1. **Create a test file** at `test/clj/auto_ap/parse/templates_test.clj` (or add to existing) + - Test parses the PDF file + - Verifies all expected values are extracted correctly + - Follows existing test patterns + +2. **Add template to** `src/clj/auto_ap/parse/templates.clj` + - Adds entry to `pdf-templates` vector + - Includes: + - `:vendor` - Vendor name + - `:keywords` - Regex patterns to identify this vendor (must match all) + - `:extract` - Regex patterns for each field + - `:parser` - Optional date/number parsers + +### Step 4: Iterative Refinement + +Run the test to see if it passes: + +```bash +lein test auto-ap.parse.templates-test +``` + +If it fails, examine the debug output and refine the regex patterns. Common issues: + +- **Template doesn't match**: Keywords don't actually appear in the PDF text +- **Field is nil**: Regex capture group doesn't match the actual text format +- **Wrong value captured**: Regex is too greedy or matches wrong text + +## Template Structure Reference + +### Basic Template Format + +```clojure +{:vendor "Vendor Name" + :keywords [#"unique-pattern-1" #"unique-pattern-2"] + :extract {:invoice-number #"Invoice\s+#\s+(\d+)" + :date #"Date:\s+(\d{2}/\d{2}/\d{2})" + :customer-identifier #"Bill To:\s+([A-Za-z\s]+)" + :total #"Total:\s+\$([\d,]+\.\d{2})"} + :parser {:date [:clj-time "MM/dd/yy"] + :total [:trim-commas nil]}} +``` + +### Field Extraction Patterns + +**Invoice Number:** +- Look for: `"Invoice #12345"` or `"INV: 12345"` +- Pattern: `#"Invoice\s*#?\s*(\d+)"` or `#"INV:\s*(\d+)"` + +**Date:** +- Common formats: `"01/15/26"`, `"Jan 15, 2026"`, `"2026-01-15"` +- Pattern: `#"(\d{2}/\d{2}/\d{2})"` for MM/dd/yy +- Parser: `:date [:clj-time "MM/dd/yy"]` + +**Customer Identifier:** +- Look for: `"Bill To: Customer Name"` or `"Sold To: Customer Name"` +- Pattern: `#"Bill To:\s+([A-Za-z\s]+?)(?=\s{2,}|\n)"` +- Use non-greedy `+?` and lookahead `(?=...)` to stop at boundaries + +**Total:** +- Look for: `"Total: $100.00"` or `"Amount Due: 100.00"` +- Pattern: `#"Total:\s+\$?([\d,]+\.\d{2})"` +- Parser: `:total [:trim-commas nil]` removes commas + +### Advanced Patterns + +**Multi-line customer address:** +When customer info spans multiple lines (name + address): + +```clojure +:customer-identifier #"(?s)I\s+([A-Z][A-Z\s]+?)\s{2,}.*?L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)" +:account-number #"(?s)L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)" +``` + +The `(?s)` flag makes `.` match newlines. Use non-greedy `+?` and lookaheads `(?=...)` to capture clean values. + +**Multiple date formats:** + +```clojure +:parser {:date [:clj-time ["MM/dd/yy" "yyyy-MM-dd"]]} +``` + +**Credit memos (negative amounts):** + +```clojure +:parser {:total [:trim-commas-and-negate nil]} +``` + +## Testing Best Practices + +1. **Start with a failing test** - Define expected values before implementing +2. **Test actual PDF parsing** - Use `parse-file` or `parse` with real PDF text +3. **Verify each field individually** - Separate assertions for clarity +4. **Handle date comparisons carefully** - Compare year/month/day separately if needed +5. **Use `str/trim`** - Account for extra whitespace in extracted values + +## Example Test Structure + +```clojure +(deftest parse-vendor-invoice-12345 + (testing "Should parse Vendor invoice with expected values" + (let [results (sut/parse-file (io/file "dev-resources/INVOICE.pdf") + "INVOICE.pdf") + result (first results)] + (is (some? results) "Should return results") + (is (some? result) "Template should match") + (when result + (is (= "Vendor Name" (:vendor-code result))) + (is (= "12345" (:invoice-number result))) + (is (= "Customer Name" (:customer-identifier result))) + (is (= "100.00" (:total result))))))) +``` + +## Common Pitfalls + +1. **Keywords must all match** - Every pattern in `:keywords` must be found in the PDF +2. **Capture groups required** - Regexes need `()` to extract values +3. **PDF text != visual text** - Layout may differ from what you see visually +4. **Greedy quantifiers** - Use `+?` instead of `+` to avoid over-matching +5. **Case sensitivity** - Regex is case-sensitive unless you use `(?i)` flag + +## Post-Creation Checklist + +After creating the template: + +- [ ] Test passes: `lein test auto-ap.parse.templates-test` +- [ ] Format is correct: `lein cljfmt check` +- [ ] Code compiles: `lein check` +- [ ] Template is in correct position in `pdf-templates` vector +- [ ] Keywords uniquely identify this vendor (won't match other templates) +- [ ] Test file follows naming conventions + +## Integration with Workflow + +This skill is typically used as part of a larger workflow: + +1. User provides PDF and requirements +2. This skill creates template and test +3. User reviews and refines if needed +4. Test is run to verify extraction +5. Code is committed + +The skill ensures consistency with existing patterns and reduces manual boilerplate when adding new vendor support. diff --git a/.claude/skills/invoice-template-creator/references/examples.md b/.claude/skills/invoice-template-creator/references/examples.md new file mode 100644 index 00000000..954b60ed --- /dev/null +++ b/.claude/skills/invoice-template-creator/references/examples.md @@ -0,0 +1,188 @@ +# Invoice Template Examples + +## Simple Single Invoice + +```clojure +{:vendor "Gstar Seafood" + :keywords [#"G Star Seafood"] + :extract {:total #"Total\s{2,}([\d\-,]+\.\d{2,2}+)" + :customer-identifier #"(.*?)(?:\s+)Invoice #" + :date #"Invoice Date\s{2,}([0-9]+/[0-9]+/[0-9]+)" + :invoice-number #"Invoice #\s+(\d+)"} + :parser {:date [:clj-time "MM/dd/yyyy"] + :total [:trim-commas nil]}} +``` + +## Multi-Invoice Statement + +```clojure +{:vendor "Southbay Fresh Produce" + :keywords [#"(SOUTH BAY FRESH PRODUCE|SOUTH BAY PRODUCE)"] + :extract {:date #"^([0-9]+/[0-9]+/[0-9]+)" + :customer-identifier #"To:[^\n]*\n\s+([A-Za-z' ]+)\s{2}" + :invoice-number #"INV #\/(\d+)" + :total #"\$([0-9.]+)\."} + :parser {:date [:clj-time "MM/dd/yyyy"]} + :multi #"\n" + :multi-match? #"^[0-9]+/[0-9]+/[0-9]+\s+INV "} +``` + +## Customer with Address (Multi-line) + +```clojure +{:vendor "Bonanza Produce" + :keywords [#"530-544-4136"] + :extract {:invoice-number #"NO\s+(\d{8,})\s+\d{2}/\d{2}/\d{2}" + :date #"NO\s+\d{8,}\s+(\d{2}/\d{2}/\d{2})" + :customer-identifier #"(?s)I\s+([A-Z][A-Z\s]+?)\s{2,}.*?L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)" + :account-number #"(?s)L\s+([0-9][A-Z0-9\s]+?)(?=\s{2,}|\n)" + :total #"SHIPPED\s+[\d\.]+\s+TOTAL\s+([\d\.]+)"} + :parser {:date [:clj-time "MM/dd/yy"] + :total [:trim-commas nil]}} +``` + +## Credit Memo (Negative Amounts) + +```clojure +{:vendor "General Produce Company" + :keywords [#"916-552-6495"] + :extract {:date #"DATE.*\n.*\n.*?([0-9]+/[0-9]+/[0-9]+)" + :invoice-number #"CREDIT NO.*\n.*\n.*?(\d{5,}?)\s+" + :account-number #"CUST NO.*\n.*\n\s+(\d+)" + :total #"TOTAL:\s+\|\s*(.*)"} + :parser {:date [:clj-time "MM/dd/yy"] + :total [:trim-commas-and-negate nil]}} +``` + +## Complex Date Parsing + +```clojure +{:vendor "Ben E. Keith" + :keywords [#"BEN E. KEITH"] + :extract {:date #"Customer No Mo Day Yr.*?\n.*?\d{5,}\s{2,}(\d+\s+\d+\s+\d+)" + :customer-identifier #"Customer No Mo Day Yr.*?\n.*?(\d{5,})" + :invoice-number #"Invoice No.*?\n.*?(\d{8,})" + :total #"Total Invoice.*?\n.*?([\-]?[0-9]+\.[0-9]{2,})"} + :parser {:date [:month-day-year nil] + :total [:trim-commas-and-negate nil]}} +``` + +## Multiple Date Formats + +```clojure +{:vendor "RNDC" + :keywords [#"P.O.Box 743564"] + :extract {:date #"(?:INVOICE|CREDIT) DATE\n(?:.*?)(\S+)\n" + :account-number #"Store Number:\s+(\d+)" + :invoice-number #"(?:INVOICE|CREDIT) DATE\n(?:.*?)\s{2,}(\d+?)\s+\S+\n" + :total #"Net Amount(?:.*\n){4}(?:.*?)([\-]?[0-9\.]+)\n"} + :parser {:date [:clj-time ["MM/dd/yy" "dd-MMM-yy"]] + :total [:trim-commas-and-negate nil]}} +``` + +## Common Regex Patterns + +### Phone Numbers +```clojure +#"\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}" +``` + +### Dollar Amounts +```clojure +#"\$?([0-9,]+\.[0-9]{2})" +``` + +### Dates (MM/dd/yy) +```clojure +#"([0-9]{2}/[0-9]{2}/[0-9]{2})" +``` + +### Dates (MM/dd/yyyy) +```clojure +#"([0-9]{2}/[0-9]{2}/[0-9]{4})" +``` + +### Multi-line Text (dotall mode) +```clojure +#"(?s)start.*?end" +``` + +### Non-greedy Match +```clojure +#"(pattern.+?)" +``` + +### Lookahead Boundary +```clojure +#"value(?=\s{2,}|\n)" +``` + +## Field Extraction Strategies + +### 1. Simple Line-based +Use `[^\n]*` to match until end of line: +```clojure +#"Invoice:\s+([^\n]+)" +``` + +### 2. Whitespace Boundary +Use `(?=\s{2,}|\n)` to stop at multiple spaces or newline: +```clojure +#"Customer:\s+(.+?)(?=\s{2,}|\n)" +``` + +### 3. Specific Marker +Match until a specific pattern is found: +```clojure +#"(?s)Start(.*?)End" +``` + +### 4. Multi-part Extraction +Use multiple capture groups for related fields: +```clojure +#"Date:\s+(\d{2})/(\d{2})/(\d{2})" +``` + +## Parser Options + +### Date Parsers +- `[:clj-time "MM/dd/yyyy"]` - Standard US date +- `[:clj-time "MM/dd/yy"]` - 2-digit year +- `[:clj-time "MMM dd, yyyy"]` - Named month +- `[:clj-time ["MM/dd/yy" "yyyy-MM-dd"]]` - Multiple formats +- `[:month-day-year nil]` - Space-separated (1 15 26) + +### Number Parsers +- `[:trim-commas nil]` - Remove commas from numbers +- `[:trim-commas-and-negate nil]` - Handle negative/credit amounts +- `[:trim-commas-and-remove-dollars nil]` - Remove $ and commas +- `nil` - No parsing, return raw string + +## Testing Patterns + +### Basic Test Structure +```clojure +(deftest parse-vendor-invoice + (testing "Should parse vendor invoice" + (let [results (sut/parse-file (io/file "dev-resources/INVOICE.pdf") + "INVOICE.pdf") + result (first results)] + (is (some? result)) + (is (= "Vendor" (:vendor-code result))) + (is (= "12345" (:invoice-number result)))))) +``` + +### Date Testing +```clojure +(let [d (:date result)] + (is (= 2026 (time/year d))) + (is (= 1 (time/month d))) + (is (= 15 (time/day d)))) +``` + +### Multi-field Verification +```clojure +(is (= "Expected Name" (:customer-identifier result))) +(is (= "Expected Street" (str/trim (:account-number result)))) +(is (= "Expected City, ST 12345" (str/trim (:location result)))) +``` diff --git a/dev-resources/13595522.pdf b/dev-resources/13595522.pdf new file mode 100644 index 00000000..4f4845a0 Binary files /dev/null and b/dev-resources/13595522.pdf differ