Discovery Phase¶
The Discovery phase is a non-destructive, read-only analysis pass over a DITA package.
Its sole purpose is to observe, classify, and report what exists in the package before any transformation occurs.
Discovery does not: - modify files - rename artifacts - infer intent heuristically - “fix” malformed content
Discovery exists to prevent the processor from lying to itself.
Why Discovery Exists¶
Bulk-generated DITA packages are not uniform. They vary across:
- authoring systems
- export pipelines
- organizational conventions
- historical drift
- partial migrations
Earlier versions of the processor implicitly assumed: - a single main map - predictable map roles - consistent filename conventions
Those assumptions held for tests, not reality.
Discovery formalizes the gap between what we expect and what we observe.
Discovery Responsibilities¶
Discovery performs the following responsibilities in strict order:
- Scan the filesystem
- Parse XML safely
- Identify artifacts
- Classify artifacts
- Validate invariants
- Produce a report
Each responsibility is isolated and auditable.
Artifact Types¶
Discovery recognizes two primary artifact categories:
| Category | Description |
|---|---|
| Map | .ditamap files |
| Topic | .dita topic files |
Classification is descriptive, not normative.
Map Classification¶
Each discovered map is assigned exactly one MapType.
Defined in:
dita_package_processor.knowledge.map_types
Supported Map Types¶
| MapType | Meaning |
|---|---|
| MAIN | Primary entry-point map |
| ABSTRACT | Overview / abstract content |
| CONTENT | Regular content map |
| GLOSSARY | Definition or glossary map |
| CONTAINER | Structural wrapper map |
| UNKNOWN | Unclassifiable |
Classification is determined by: - structural signatures - referenced content - known patterns - filename signals (last resort)
Map Classification Flow¶
flowchart TD
A[DITA Map] --> B{Matches Known Pattern?}
B -->|Yes| C[Assign MapType]
B -->|No| D{Structural Signals?}
D -->|Yes| C
D -->|No| E[UNKNOWN]
Important:
UNKNOWN is a valid classification. It is not an error.
Topic Classification¶
Topics are classified separately from maps.
Supported Topic Types¶
| TopicType | Meaning |
|---|---|
| CONTENT | Concept, task, reference |
| GLOSSARY | <glossentry> |
| UNKNOWN | Unclear or malformed |
Topic classification is currently shallow and intentionally conservative.
Structural Signatures¶
Discovery relies on signatures, not guesses.
Signatures are structural markers such as:
- root element names
- presence of <glossentry>
- mapref-only maps
- empty wrapper maps
Signatures are defined in:
dita_package_processor/discovery/signatures.py
Signatures may evolve, but they are always: - explicit - testable - documented
Known Patterns¶
Discovery also consults known patterns expressed in YAML.
Defined in:
dita_package_processor/knowledge/known_patterns.yaml
Patterns encode: - historical structures - vendor-specific exports - known glossary layouts - migration artifacts
Patterns are data, not code.
This allows: - incremental expansion - corpus-specific tuning - safe experimentation
Invariants¶
After classification, Discovery validates invariants.
Invariants are structural truths that must hold for transformation to proceed.
Examples: - Exactly one MAIN map - At most one ABSTRACT map - At most one GLOSSARY map
Defined in:
dita_package_processor/knowledge/invariants.py
Violation of invariants: - does not crash Discovery - does block transformation - produces explicit diagnostics
Invariant Validation Flow¶
sequenceDiagram
participant Scanner
participant Classifier
participant Validator
participant Report
Scanner->>Classifier: Artifacts
Classifier->>Validator: Classified Maps
Validator->>Report: Violations (if any)
Discovery Report¶
Discovery produces a structured report capturing:
- discovered maps and topics
- assigned classifications
- violated invariants
- unresolved ambiguities
Reports are designed to be: - human-readable - machine-readable - diffable - archived
Defined in:
dita_package_processor/discovery/report.py
Discovery reports are artifacts, not logs.
What Discovery Does Not Do¶
Discovery explicitly avoids:
- guessing user intent
- renaming files
- mutating XML
- repairing broken content
- deciding transformation order
Discovery answers one question only:
“What is actually here?”
Relationship to the Transformation Pipeline¶
Discovery runs before any pipeline steps.
Transformation steps must: - consume Discovery output - obey classifications - respect invariant failures
No step is allowed to “reinterpret” Discovery findings.
Failure Modes (By Design)¶
Discovery may report: - UNKNOWN maps - ambiguous classifications - invariant violations
This is not failure.
Failure is transforming content without understanding it.
Summary¶
Discovery turns a DITA package from an opaque blob into a described system.
It: - narrows uncertainty - surfaces assumptions - creates defensible boundaries
Everything that follows is only as reliable as Discovery.
If Discovery is wrong, the processor must stop.
That is not caution.
That is integrity.