Files
email-organizer/docs/design/processed-emails-spec.md
2025-08-09 21:04:21 -07:00

293 lines
12 KiB
Markdown

# Processed Emails Feature Specification
## Overview
This document outlines the specification for implementing a feature to persistently track which emails have been processed by the Email Organizer system. The goal is to maintain a record of email processing status to avoid reprocessing the same emails during synchronization and provide accurate pending email counts.
## Current Implementation Status
The Processed Emails feature is fully implemented and operational:
### Core Implementation
- **ProcessedEmail Model**: Implemented in [`app/models.py`](app/models.py:51)
- **ProcessedEmails Service**: Implemented in [`app/processed_emails_service.py`](app/processed_emails_service.py:7)
- **Emails Blueprint**: Implemented in [`app/routes/emails.py`](app/routes/emails.py:1)
- **UI Integration**: Pending emails dialog and processing functionality
### Key Features
- Email UID tracking for processing status
- Pending email counts and management
- Bulk email processing operations
- Email metadata display and management
- Integration with IMAP synchronization process
## Requirements
### 1. Email Tracking Requirements
- **Unique Email Identification**: Track emails using a unique identifier (UID) provided by the IMAP server, along with the folder name and user ID
- **Processing Status**: Mark emails as either "pending" (unprocessed) or "processed"
- **Minimal Data Storage**: Store only essential information - email UID, folder, user, and processing status - not email content, subjects, or bodies
- **Persistence**: Maintain processing status across application restarts and synchronization cycles
- **Efficient Lookup**: Quickly determine which emails in a folder are pending processing
### 2. Synchronization Requirements
- **Initial Sync**: During first synchronization of a folder, all emails should be marked as "pending"
- **Incremental Sync**: On subsequent syncs, only emails that haven't been processed should be identified as pending
- **Status Update**: When an email is processed, update its status from "pending" to "processed"
- **Cleanup**: Remove records for emails that no longer exist on the IMAP server
### 3. Performance Requirements
- **Efficient Storage**: Use appropriate database indexing for fast lookups
- **Minimal Memory Usage**: Store only essential data to keep memory footprint low
- **Batch Processing**: Support batch operations for processing multiple emails efficiently
## Data Model Design
### ProcessedEmails Table
```mermaid
erDiagram
USER {
int id PK "Primary Key"
string email "Unique, Not Null"
string first_name "Not Null"
string last_name "Not Null"
string password_hash "Not Null"
json imap_config "JSON Configuration"
datetime created_at "Default: UTC Now"
datetime updated_at "Default: UTC Now, On Update"
}
FOLDER {
int id PK "Primary Key"
int user_id FK "Foreign Key to User"
string name "Not Null"
text rule_text "Natural Language Rule"
int priority "Processing Order"
boolean organize_enabled "Default: True"
int total_count "Default: 0"
int pending_count "Default: 0"
json recent_emails "JSON Array"
datetime created_at "Default: UTC Now"
datetime updated_at "Default: UTC Now, On Update"
}
PROCESSED_EMAIL {
int id PK "Primary Key"
int user_id FK "Foreign Key to User"
int folder_id FK "Foreign Key to Folder"
string email_uid "Not Null" "IMAP Email UID"
string folder_name "Not Null" "IMAP Folder Name"
boolean is_processed "Default: False" "Processing Status"
datetime first_seen_at "Default: UTC Now" "First seen during sync"
datetime processed_at "Nullable" "When email was processed"
datetime created_at "Default: UTC Now"
datetime updated_at "Default: UTC Now, On Update"
}
USER ||--o{ FOLDER : "has"
USER ||--o{ PROCESSED_EMAIL : "has"
FOLDER ||--o{ PROCESSED_EMAIL : "has"
```
### Column Specifications
| Table | Column | Data Type | Constraints | Description |
|-------|--------|-----------|--------------|-------------|
| PROCESSED_EMAIL | id | Integer | Primary Key, Autoincrement | Unique identifier for each processed email record |
| PROCESSED_EMAIL | user_id | Integer | Foreign Key to User, Not Null | Reference to the user who owns this email |
| PROCESSED_EMAIL | folder_id | Integer | Foreign Key to Folder, Not Null | Reference to the folder this email belongs to |
| PROCESSED_EMAIL | email_uid | String(255) | Not Null | Unique ID of the email from IMAP server |
| PROCESSED_EMAIL | folder_name | String(255) | Not Null | Name of the IMAP folder (for redundancy) |
| PROCESSED_EMAIL | is_processed | Boolean | Default: False | Processing status (false=pending, true=processed) |
| PROCESSED_EMAIL | first_seen_at | DateTime | Default: datetime.utcnow | First time this email was detected during sync |
| PROCESSED_EMAIL | processed_at | DateTime | Nullable | When the email was marked as processed |
| PROCESSED_EMAIL | created_at | DateTime | Default: datetime.utcnow | Record creation timestamp |
| PROCESSED_EMAIL | updated_at | DateTime | Default: datetime.utcnow, On Update | Record update timestamp |
### Relationships
- **User to ProcessedEmail**: One-to-many relationship - each user can have multiple processed email records
- **Folder to ProcessedEmail**: One-to-many relationship - each folder can have multiple processed email records
- **Composite Key**: The combination of (user_id, folder_name, email_uid) should be unique to prevent duplicate records
### Database Indexes
- Primary key index on `id`
- Foreign key indexes on `user_id` and `folder_id`
- Composite unique index on `(user_id, folder_name, email_uid)`
- Index on `folder_name` for faster folder-based queries
- Index on `is_processed` for filtering pending emails
- Index on `first_seen_at` for tracking recently added emails
## Service Design
### ProcessedEmailsService
The ProcessedEmailsService ([`app/processed_emails_service.py`](app/processed_emails_service.py:7)) provides:
```python
class ProcessedEmailsService:
def __init__(self, user: User):
self.user = user
def get_pending_emails(self, folder_name: str) -> List[str]:
"""Get list of email UIDs that are pending processing in a folder."""
def mark_email_processed(self, folder_name: str, email_uid: str) -> bool:
"""Mark an email as processed."""
def mark_emails_processed(self, folder_name: str, email_uids: List[str]) -> int:
"""Mark multiple emails as processed in bulk."""
def sync_folder_emails(self, folder_name: str, email_uids: List[str]) -> int:
"""Sync email UIDs for a folder, adding new ones as pending."""
def get_pending_count(self, folder_name: str) -> int:
"""Get count of pending emails for a folder."""
def cleanup_old_records(self, folder_name: str, current_uids: List[str]) -> int:
"""Remove records for emails that no longer exist in the folder."""
```
### IMAPService Integration
The IMAP service ([`app/imap_service.py`](app/imap_service.py:11)) integrates with the ProcessedEmailsService:
```python
class IMAPService:
def __init__(self, user: User):
self.user = user
self.config = user.imap_config or {}
self.connection = None
self.processed_emails_service = ProcessedEmailsService(user)
def get_folder_email_count(self, folder_name: str) -> int:
"""Get the count of emails in a specific folder, considering processed status."""
def get_pending_emails(self, folder_name: str) -> List[str]:
"""Get email UIDs that are pending processing."""
def sync_folders(self) -> Tuple[bool, str]:
"""Sync IMAP folders with local database, tracking email processing status."""
```
## API Endpoints
### HTMX Endpoints for Processed Email Management
1. **Get Pending Emails for a Folder**
- Method: GET
- Path: `/api/folders/<folder_id>/pending-emails`
- Response: Dialog with list of email metadata for pending emails (subject, date, UID)
- Features: Email preview, individual processing buttons
2. **Mark Email as Processed**
- Method: POST
- Path: `/api/folders/<folder_id>/emails/<email_uid>/process`
- Action: Mark a specific email as processed
- Response: Updated dialog body with new counts
3. **Sync Emails for a Folder**
- Method: POST
- Path: `/api/folders/<folder_id>/sync-emails`
- Action: Sync emails for a specific folder with processed email tracking
- Response: Updated counts and sync status
4. **Process Multiple Emails**
- Method: POST
- Path: `/api/folders/<folder_id>/process-emails`
- Action: Process multiple emails in a folder (mark as processed)
- Response: Success message with updated counts
## Workflow Integration
### Email Processing Flow
```mermaid
sequenceDiagram
participant U as User
participant B as Browser
participant M as Main Blueprint
participant I as IMAP Service
participant P as ProcessedEmails Service
participant DB as Database
U->>B: Click "Sync Folders"
B->>M: POST /api/imap/sync
M->>I: Sync folders with processed email tracking
I->>I: Connect to IMAP server
I->>I: Get list of email UIDs for folder
I->>P: sync_folder_emails(folder_name, email_uids)
P->>DB: Create pending email records
P->>I: Return list of pending email UIDs
I->>M: Return sync results
M->>B: Update UI with pending counts
```
### Email Processing Status Update
```mermaid
sequenceDiagram
participant U as User
participant B as Browser
participant M as Main Blueprint
participant P as ProcessedEmails Service
participant DB as Database
U->>B: Trigger email processing
B->>M: POST /api/folders/<folder_id>/process-emails
M->>P: mark_emails_processed(folder_name, email_uids)
P->>DB: Update email processing status
P->>M: Return success count
M->>B: Update UI with new counts
```
## Migration Strategy
### Current Implementation Status
#### Phase 1: Data Model Implementation ✅
1. Create the `processed_emails` table with appropriate indexes ✅
2. Implement the `ProcessedEmailsService` class ✅
3. Add basic CRUD operations for email processing records ✅
#### Phase 2: IMAP Service Integration ✅
1. Update `IMAPService` to use `ProcessedEmailsService`
2. Modify folder synchronization to track email UIDs ✅
3. Update email count methods to consider processing status ✅
#### Phase 3: API and UI Integration ✅
1. Add API endpoints for processed email management ✅
2. Update UI to display accurate pending counts ✅
3. Add bulk processing capabilities ✅
#### Phase 4: Optimization and Cleanup ✅
1. Implement batch processing for performance ✅
2. Add periodic cleanup of orphaned records ✅
3. Optimize database queries for large datasets ✅
## Security Considerations
1. **Access Control**: Ensure users can only access their own email processing records
2. **Data Validation**: Validate all email UIDs and folder names to prevent injection attacks
3. **Rate Limiting**: Implement rate limiting for email processing endpoints to prevent abuse
4. **Data Privacy**: Ensure no sensitive email content is stored in the database
## Performance Considerations
1. **Database Indexing**: Proper indexing on frequently queried fields
2. **Batch Operations**: Use batch operations for processing multiple emails
3. **Memory Management**: Process emails in batches to avoid memory issues with large mailboxes
4. **Caching**: Consider caching frequently accessed email processing status
## Future Enhancements
1. **Email Movement Tracking**: Track when emails are moved between folders
2. **Processing History**: Maintain a history of email processing actions
3. **Email Deduplication**: Handle duplicate emails across folders
4. **Automated Cleanup**: Periodic cleanup of old or orphaned processing records
5. **Analytics**: Provide insights into email processing patterns and efficiency