Files
red/.planning/phases/04-pdf-ingest/04-RESEARCH.md
2026-03-19 21:24:01 -06:00

476 lines
24 KiB
Markdown

# Phase 4: PDF Ingest - Research
**Researched:** 2026-03-19
**Domain:** PDF file ingestion, SkySlope/URE forms API investigation, PDF rendering, file storage
**Confidence:** MEDIUM
## Summary
Phase 4 adds document management to the agent portal: pulling PDF forms from the SkySlope/URE forms library, copying them into a client-specific storage folder, and rendering them in the browser. The critical unknown going into this phase was whether the utahrealestate.com vendor API exposes a forms library. Research confirms the URE Web API is a **RESO OData v4 MLS listing data API** — it does not expose forms/PDF downloads. The SkySlope forms integration is a member-facing web application (skyslope.utahrealestate.com), not a programmatic API accessible to third-party apps. There is no public SkySlope API for fetching form PDFs.
The practical conclusion: forms must be seeded manually. An agent (or developer) downloads the relevant forms from the SkySlope member portal and places them in a seed directory. The application reads that seed directory to populate the forms library. A cron job or manual re-seed script handles the monthly sync requirement (DOC-02). This approach is explicitly called out in REQUIREMENTS.md as the fallback if the vendor API does not expose forms — and it does not.
For PDF rendering, PDF.js (mozilla/pdf.js) is the correct choice. It renders PDFs natively in the browser with no iframe/plugin quirks, and — critically — it provides a canvas-based layer system that Phase 5 field overlay work will build on. Using a raw `<iframe>` or `<embed>` would make Phase 5 field placement impossible.
**Primary recommendation:** Seed forms from a static directory of manually downloaded PDFs; use `react-pdf` (which wraps PDF.js) for browser rendering; store per-client document copies under `uploads/clients/{clientId}/`.
<user_constraints>
## User Constraints (from CONTEXT.md)
### Locked Decisions
- **Forms library source:** Primary source is SkySlope / URE Legacy Forms Library from utahrealestate.com (MLS Forms + URE Legacy Forms Library). Research needed on public API; scraping is a route if no API exists. Forms library syncs at least monthly. File picker upload is a backup for custom/non-standard forms.
- **Upload entry point:** Agent uploads from client profile page via "Add Document" button. Modal shows forms library list with search. File picker option also in modal.
- **Document naming:** Modal pre-fills name from template name. Agent can edit. Name only — no extra metadata. Status auto-sets to Draft. Multiple instances of same template allowed.
- **File storage:** When agent adds a template to a client, a copy is saved to `uploads/clients/{clientId}/`. Soft delete: record hidden from UI, file kept on disk.
- **PDF viewer:** Minimal chrome. PDF fills the page. Controls: page nav, zoom, download. Back link to client profile. Render method: Claude's discretion (researcher evaluate PDF.js vs browser embed — PDF.js preferred if it sets up Phase 5 field overlay cleanly).
- **Post-upload flow:** Stay on client profile after upload. New document appears in documents list. Progress indicator inside modal while saving.
### Claude's Discretion
- PDF rendering library choice (PDF.js vs iframe/embed)
- Exact storage path conventions within `uploads/clients/{id}/`
- Error handling for failed uploads or missing templates
- Forms library sync mechanism implementation details
### Deferred Ideas (OUT OF SCOPE)
- None — discussion stayed within phase scope
</user_constraints>
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|-----------------|
| DOC-01 | Agent can browse and import PDF forms from the utahrealestate.com vendor API (vendor.utahrealestate.com/webapi) — investigate API capability; fall back to manual upload if forms API is not available | URE Web API is MLS listing data only (RESO OData v4). No forms/PDF endpoint exists. Fallback to manual seed directory is the correct path. |
| DOC-02 | Forms library syncs automatically on at least a monthly basis to reflect new/updated forms | Implement a seed script (`scripts/seed-forms.ts`) that reads a local `seeds/forms/` directory. "Sync" = re-run seed script after agent manually downloads updated PDFs from SkySlope portal. Can be triggered via npm script or cron. |
| DOC-03 | Agent can view an imported PDF document in the browser | Use `react-pdf` (PDF.js wrapper). Renders pages on canvas — required for Phase 5 field overlay. Supports page navigation and zoom. |
</phase_requirements>
## Standard Stack
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| react-pdf | ^9.x | PDF rendering in browser | Wraps PDF.js; canvas-based rendering enables Phase 5 field overlays; maintained by wojtekmaj; widely used in React ecosystem |
| multer | ^1.4.x | Multipart form handling for file picker uploads | Node.js standard for Express/Next.js API route file uploads |
| uuid | ^11.x | Generate unique filenames for stored PDFs | Prevents collisions; already likely in project from Phase 3 |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| pdfjs-dist | ^4.x | PDF.js core (peer dep of react-pdf) | Installed automatically with react-pdf |
| node:fs/promises | built-in | Read seed directory, copy files to client folders | Server-side file operations without extra deps |
| node:path | built-in | Safe path construction for uploads dir | Prevents path traversal bugs |
| node:crypto | built-in | Generate unique doc IDs if uuid not present | Fallback only |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| react-pdf | `<iframe src="...">` or `<embed>` | iframe/embed works for DOC-03 in isolation but makes Phase 5 field placement on canvas impossible — ruled out |
| react-pdf | @react-pdf/renderer | renderer is for GENERATING PDFs, not VIEWING them — wrong tool |
| multer | Next.js formData() built-in | Built-in works for small files; multer gives streaming + size limits for large PDFs |
| manual seed dir | Scraping SkySlope portal | Scraping violates ToS (explicit in REQUIREMENTS.md Out of Scope table); no public API exists |
**Installation:**
```bash
npm install react-pdf multer uuid
npm install --save-dev @types/multer
```
Note: `react-pdf` v9+ requires `pdfjs-dist` as a peer dep. Follow react-pdf docs for worker configuration in Next.js.
## Architecture Patterns
### Recommended Project Structure
```
src/
├── app/
│ ├── portal/
│ │ ├── clients/[id]/
│ │ │ └── page.tsx # Already exists (Phase 3) — add "Add Document" button
│ │ └── documents/[docId]/
│ │ └── page.tsx # NEW: PDF viewer page
│ └── api/
│ ├── documents/
│ │ ├── route.ts # POST: create document record (from library or file picker)
│ │ └── [id]/
│ │ └── route.ts # GET: serve PDF file (authenticated); DELETE: soft delete
│ └── forms-library/
│ └── route.ts # GET: list available seed forms
seeds/
└── forms/ # Manually downloaded PDFs from SkySlope portal
├── purchase-agreement.pdf
├── listing-agreement.pdf
└── ...
uploads/
└── clients/
└── {clientId}/
└── {uuid}.pdf # Per-client document copies
scripts/
└── seed-forms.ts # Reads seeds/forms/, upserts into form_templates table
```
### Pattern 1: Forms Library Seed Table
**What:** A `form_templates` table in the database holds metadata (name, filename) for each seeded PDF. The `seeds/forms/` directory holds the actual PDF files. The seed script syncs the directory to the table.
**When to use:** All forms library browsing queries hit the DB (fast, searchable). File reads only happen when a form is copied to a client folder.
**Example:**
```typescript
// scripts/seed-forms.ts
import { readdir } from 'node:fs/promises';
import path from 'node:path';
import { db } from '@/db';
import { formTemplates } from '@/db/schema';
const SEEDS_DIR = path.join(process.cwd(), 'seeds/forms');
async function seedForms() {
const files = await readdir(SEEDS_DIR);
const pdfs = files.filter(f => f.endsWith('.pdf'));
for (const filename of pdfs) {
const name = filename.replace('.pdf', '').replace(/-/g, ' ');
await db.insert(formTemplates)
.values({ name, filename })
.onConflictDoUpdate({
target: formTemplates.filename,
set: { name, updatedAt: new Date() }
});
}
console.log(`Seeded ${pdfs.length} forms`);
}
seedForms();
```
### Pattern 2: Copy-on-Add (Template → Client Document)
**What:** When agent adds a form to a client, copy the seed PDF to `uploads/clients/{clientId}/{uuid}.pdf`. Insert a `documents` record pointing to the copy. Phase 5 writes field data against the copy only, never the template.
**When to use:** Every time agent clicks "Add Document" from the library.
**Example:**
```typescript
// src/app/api/documents/route.ts (POST handler)
import { copyFile, mkdir } from 'node:fs/promises';
import path from 'node:path';
import { v4 as uuidv4 } from 'uuid';
export async function POST(req: Request) {
const { clientId, formTemplateId, name } = await req.json();
const template = await db.query.formTemplates.findFirst({
where: eq(formTemplates.id, formTemplateId)
});
const docId = uuidv4();
const destDir = path.join(process.cwd(), 'uploads/clients', clientId);
const destPath = path.join(destDir, `${docId}.pdf`);
const srcPath = path.join(process.cwd(), 'seeds/forms', template.filename);
await mkdir(destDir, { recursive: true });
await copyFile(srcPath, destPath);
const [doc] = await db.insert(documents).values({
id: docId,
clientId,
formTemplateId,
name,
filePath: `clients/${clientId}/${docId}.pdf`,
status: 'draft'
}).returning();
return Response.json(doc, { status: 201 });
}
```
### Pattern 3: Authenticated PDF Serving
**What:** PDFs are served via an API route that checks auth before streaming the file. Never expose the `uploads/` directory as a static asset.
**When to use:** The PDF viewer page fetches the PDF via `/api/documents/{id}/file` — an authenticated GET route.
**Example:**
```typescript
// src/app/api/documents/[id]/file/route.ts
import { auth } from '@/auth';
import { readFile } from 'node:fs/promises';
import path from 'node:path';
export async function GET(req: Request, { params }: { params: { id: string } }) {
const session = await auth();
if (!session) return new Response('Unauthorized', { status: 401 });
const doc = await db.query.documents.findFirst({
where: eq(documents.id, params.id)
});
if (!doc) return new Response('Not found', { status: 404 });
const filePath = path.join(process.cwd(), 'uploads', doc.filePath);
const buffer = await readFile(filePath);
return new Response(buffer, {
headers: { 'Content-Type': 'application/pdf' }
});
}
```
### Pattern 4: react-pdf Viewer Component
**What:** Client component using `react-pdf` to render PDF pages on canvas. Supports page navigation and zoom.
**When to use:** Document detail page (`/portal/documents/[docId]`).
**Example:**
```typescript
// src/app/portal/documents/[docId]/_components/PdfViewer.tsx
'use client';
import { useState } from 'react';
import { Document, Page, pdfjs } from 'react-pdf';
import 'react-pdf/dist/Page/AnnotationLayer.css';
import 'react-pdf/dist/Page/TextLayer.css';
// Worker setup — required for Next.js
pdfjs.GlobalWorkerOptions.workerSrc = new URL(
'pdfjs-dist/build/pdf.worker.min.mjs',
import.meta.url
).toString();
export function PdfViewer({ docId }: { docId: string }) {
const [numPages, setNumPages] = useState<number>(0);
const [pageNumber, setPageNumber] = useState(1);
const [scale, setScale] = useState(1.0);
return (
<div>
<Document
file={`/api/documents/${docId}/file`}
onLoadSuccess={({ numPages }) => setNumPages(numPages)}
>
<Page pageNumber={pageNumber} scale={scale} />
</Document>
<div>
<button onClick={() => setPageNumber(p => Math.max(1, p - 1))}>Prev</button>
<span>{pageNumber} / {numPages}</span>
<button onClick={() => setPageNumber(p => Math.min(numPages, p + 1))}>Next</button>
<button onClick={() => setScale(s => s + 0.2)}>Zoom In</button>
<button onClick={() => setScale(s => Math.max(0.4, s - 0.2))}>Zoom Out</button>
</div>
</div>
);
}
```
### Pattern 5: Forms Library Modal with Search
**What:** Client component modal. Fetches form templates list from `/api/forms-library`. Filters client-side by name as agent types. Has separate "Browse files" button for custom PDF upload.
**When to use:** "Add Document" button on client profile page.
### Schema Additions
```typescript
// Addition to src/db/schema.ts
export const formTemplates = pgTable('form_templates', {
id: serial('id').primaryKey(),
name: text('name').notNull(),
filename: text('filename').notNull().unique(), // filename in seeds/forms/
createdAt: timestamp('created_at').defaultNow(),
updatedAt: timestamp('updated_at').defaultNow(),
});
// documents table already partially defined in Phase 3 (documentStatusEnum exists)
// Add columns:
// formTemplateId: integer('form_template_id').references(() => formTemplates.id)
// filePath: text('file_path').notNull() // relative path within uploads/
// name: text('name').notNull()
```
### Anti-Patterns to Avoid
- **Serving uploads as static assets:** Never put `uploads/` under `public/`. All PDF serving must go through authenticated API routes.
- **Storing absolute paths in DB:** Store relative paths only (e.g., `clients/{id}/{uuid}.pdf`). Absolute paths break when the Docker volume mount changes.
- **Mutating seed templates:** The `seeds/forms/` directory is read-only source of truth. Never write or delete files there programmatically.
- **Using `<iframe src="/api/documents/id/file">` for the viewer:** Works for DOC-03 but blocks Phase 5 canvas overlay. Use react-pdf.
- **Loading the PDF worker from CDN:** In local/Docker environment with no internet guarantee, configure the worker from `node_modules` via `import.meta.url`.
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| PDF rendering in browser | Custom canvas renderer | `react-pdf` (PDF.js) | PDF rendering handles fonts, embedded images, encodings, cross-browser canvas quirks — thousands of edge cases |
| Multipart file upload parsing | Custom body parser | `multer` or Next.js built-in formData | Boundary parsing, temp file buffering, size limits — complex protocol details |
| Unique filename generation | `Date.now() + Math.random()` | `uuid` v4 | UUID v4 provides genuine collision resistance; timestamp-based names can collide under concurrent requests |
| Path traversal prevention | String sanitization | `path.join()` + allowlist check | path.join normalizes `../` sequences; add explicit check that resolved path starts with allowed base dir |
**Key insight:** PDF rendering looks simple (just show the bytes) but requires a complete PDF interpreter for production use. PDF.js is the only production-grade open-source PDF renderer for the browser.
## Common Pitfalls
### Pitfall 1: react-pdf Worker Not Configured for Next.js
**What goes wrong:** `pdfjs.GlobalWorkerOptions.workerSrc` is not set, or set to a CDN URL. PDF renders blank or throws "Setting up fake worker" warning. In Next.js with webpack, the worker import requires special handling.
**Why it happens:** PDF.js uses a Web Worker for parsing. Next.js webpack doesn't automatically handle `.worker.mjs` imports.
**How to avoid:** Set `workerSrc` using `new URL('pdfjs-dist/build/pdf.worker.min.mjs', import.meta.url).toString()` inside the component. This is the officially documented pattern for bundlers. Do NOT use `cdnjs.cloudflare.com` URL in a local/offline environment.
**Warning signs:** Console shows "Setting up fake worker" or PDF page renders blank.
### Pitfall 2: Path Traversal in File Serving Route
**What goes wrong:** `doc.filePath` from the DB contains `../../etc/passwd`. The file serving route reads it and returns sensitive data.
**Why it happens:** DB record was tampered with, or filePath validation was skipped.
**How to avoid:** After building the absolute path with `path.join()`, assert that it starts with the expected `uploads/` base directory before reading:
```typescript
const UPLOADS_BASE = path.join(process.cwd(), 'uploads');
if (!filePath.startsWith(UPLOADS_BASE)) {
return new Response('Forbidden', { status: 403 });
}
```
**Warning signs:** Any `..` sequence appearing in a file path.
### Pitfall 3: `uploads/` Excluded from Docker Volume
**What goes wrong:** Files are saved to `uploads/` successfully in development but disappear on container restart because `uploads/` is not mounted as a Docker volume.
**Why it happens:** Docker containers have ephemeral filesystems by default. If the volume mount targets `/app` but excludes subdirectories, files don't persist.
**How to avoid:** Ensure `docker-compose.yml` mounts `./uploads:/app/uploads` as a named or bind-mount volume. Validate survival of uploads across container restarts as an explicit acceptance criterion.
**Warning signs:** Uploaded files exist in dev but disappear after `docker compose restart`.
### Pitfall 4: react-pdf CSS Not Imported
**What goes wrong:** PDF renders but text is missing or annotation links are invisible.
**Why it happens:** react-pdf requires explicit CSS imports for the annotation and text layers.
**How to avoid:** Import both CSS files in the viewer component:
```typescript
import 'react-pdf/dist/Page/AnnotationLayer.css';
import 'react-pdf/dist/Page/TextLayer.css';
```
### Pitfall 5: File Size Limits on API Route
**What goes wrong:** Agent uploads a large PDF via the file picker (some real estate contracts are 10-15MB). Next.js API routes have a default 4MB body size limit.
**Why it happens:** Next.js `bodyParser` limit is 4MB by default.
**How to avoid:** Export `config` from the API route to disable body parsing and use `multer` for streaming, or increase the limit:
```typescript
export const config = { api: { bodyParser: { sizeLimit: '20mb' } } };
```
For App Router, use `formData()` with multer middleware configured for appropriate limits.
### Pitfall 6: Missing `seeds/forms/` Directory in Docker Build
**What goes wrong:** Container starts but forms library is empty because `seeds/forms/` was not included in the Docker image or volume mount.
**Why it happens:** `.dockerignore` or volume mounts exclude the seeds directory.
**How to avoid:** Either include `seeds/forms/` in the Docker image (acceptable since these are non-sensitive read-only PDFs), or mount it as a read-only volume. Document the required setup step in a project README.
## Code Examples
### Seed Script Run Command
```bash
# package.json scripts
"seed:forms": "npx tsx scripts/seed-forms.ts"
# Monthly re-sync: download new PDFs to seeds/forms/, re-run seed:forms
```
### Forms Library API Route
```typescript
// src/app/api/forms-library/route.ts
import { auth } from '@/auth';
import { db } from '@/db';
import { formTemplates } from '@/db/schema';
export async function GET() {
const session = await auth();
if (!session) return new Response('Unauthorized', { status: 401 });
const forms = await db.select({
id: formTemplates.id,
name: formTemplates.name,
}).from(formTemplates).orderBy(formTemplates.name);
return Response.json(forms);
}
```
### Drizzle Migration for New Tables
```bash
npx drizzle-kit generate # generates new migration
npx drizzle-kit migrate # applies migration
```
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| `pdf.js` manual integration | `react-pdf` wrapper | ~2018 | react-pdf handles worker config, React lifecycle, and canvas management |
| URE RETS data feed | RESO Web API (OData v4) | 2020 | Listing data only — no forms endpoint has ever been exposed |
| SkySlope separate product | SkySlope integrated into URE member portal | ~2022 | Forms are now SkySlope-managed; no independent URE forms API |
**Deprecated/outdated:**
- RETS protocol: URE killed RETS in 2020; all data feeds now via RESO Web API
- `@react-pdf/renderer`: This is a PDF *generator* not a viewer — completely different use case, do not confuse
## Open Questions
1. **Does the vendor.utahrealestate.com/webapi expose any forms endpoints?**
- What we know: The API is RESO-certified OData v4. Documented endpoints cover MLS listing resources (Property, Member, Office, Media, OpenHouse). No forms/documents endpoint is documented publicly.
- What's unclear: Whether a vendor account with elevated permissions might expose additional endpoints. The SkySlope integration is member-facing only.
- Recommendation: Proceed with manual seed directory approach (DOC-01 fallback explicitly stated in REQUIREMENTS.md). If Teressa has vendor API credentials, check the `/api/$metadata` endpoint for any `Form` or `Document` resource — but don't block Phase 4 on this.
2. **react-pdf v9 Next.js App Router compatibility**
- What we know: react-pdf v9 ships as ESM. Next.js App Router has improved ESM support but some configurations require `transpilePackages`.
- What's unclear: Whether `react-pdf` needs to be added to `next.config.js` `transpilePackages` array for this project's Next.js version.
- Recommendation: After installing, if build errors appear referencing react-pdf ESM, add `transpilePackages: ['react-pdf', 'pdfjs-dist']` to `next.config.js`.
3. **How many seed forms are needed initially?**
- What we know: Teressa uses forms from the SkySlope/URE portal. The exact list is not known.
- What's unclear: Whether 5 forms or 50+ forms are typical for a Utah real estate agent's workflow.
- Recommendation: Start with the most common forms (Purchase Agreement, Listing Agreement, Buyer Representation Agreement, Counter Offer, Addendum). Agent can expand the seed directory as needed.
## Sources
### Primary (HIGH confidence)
- [vendor.utahrealestate.com/webapi/docs](https://vendor.utahrealestate.com/webapi/docs) — Verified URE Web API is RESO OData v4, listing data only, no forms endpoint documented
- [resoapi.utahrealestate.com/login](https://resoapi.utahrealestate.com/login) — URE RESO API login page
- [skyslope.utahrealestate.com](https://skyslope.utahrealestate.com/) — SkySlope/URE integration is member-facing portal, no public API for forms
### Secondary (MEDIUM confidence)
- [SkySlope press release: URE partnership](https://skyslope.com/press-release/utahrealestate-com-partners-with-skyslope-forms-to-provide-a-best-in-class-digital-forms-solution/) — Confirms SkySlope Forms is the forms product for URE members; member benefit, not API
- react-pdf documentation and GitHub (wojtekmaj/react-pdf) — Canvas-based PDF rendering pattern, worker configuration for bundlers
- REQUIREMENTS.md Out of Scope table — Explicitly lists "utahrealestate.com forms scraping (automated credential login) — Likely violates ToS; use vendor API or manual upload instead"
### Tertiary (LOW confidence)
- WebSearch results on URE GitHub organization — No forms-related repositories found; confirms no open-source forms API
## Metadata
**Confidence breakdown:**
- Standard stack (react-pdf, multer, uuid): MEDIUM — react-pdf is well-established but Next.js App Router ESM compatibility needs verification at install time
- Architecture (seed dir, copy-on-add, authenticated serving): HIGH — standard file management patterns, no novel approaches
- SkySlope/URE API finding (no forms API): MEDIUM — based on public documentation; a vendor account might reveal more endpoints, but public evidence strongly indicates forms are member-portal-only
- Pitfalls: HIGH — react-pdf worker config, path traversal, Docker volume persistence are well-documented real-world issues
**Research date:** 2026-03-19
**Valid until:** 2026-04-19 (stable domain; re-check react-pdf major version if >30 days)