Discovery Phase¶

The Discovery phase is a non-destructive, read-only analysis pass over a DITA package.

Its sole purpose is to observe, classify, and report what exists in the package before any transformation occurs.

Discovery does not: - modify files - rename artifacts - infer intent heuristically - “fix” malformed content

Discovery exists to prevent the processor from lying to itself.

Why Discovery Exists¶

Bulk-generated DITA packages are not uniform. They vary across:

authoring systems
export pipelines
organizational conventions
historical drift
partial migrations

Earlier versions of the processor implicitly assumed: - a single main map - predictable map roles - consistent filename conventions

Those assumptions held for tests, not reality.

Discovery formalizes the gap between what we expect and what we observe.

Discovery Responsibilities¶

Discovery performs the following responsibilities in strict order:

Scan the filesystem
Parse XML safely
Identify artifacts
Classify artifacts
Validate invariants
Produce a report

Each responsibility is isolated and auditable.

Artifact Types¶

Discovery recognizes two primary artifact categories:

Category	Description
Map	`.ditamap` files
Topic	`.dita` topic files

Classification is descriptive, not normative.

Map Classification¶

Each discovered map is assigned exactly one MapType.

Defined in:
dita_package_processor.knowledge.map_types

Supported Map Types¶

MapType	Meaning
MAIN	Primary entry-point map
ABSTRACT	Overview / abstract content
CONTENT	Regular content map
GLOSSARY	Definition or glossary map
CONTAINER	Structural wrapper map
UNKNOWN	Unclassifiable

Classification is determined by: - structural signatures - referenced content - known patterns - filename signals (last resort)

Map Classification Flow¶

flowchart TD
    A[DITA Map] --> B{Matches Known Pattern?}
    B -->|Yes| C[Assign MapType]
    B -->|No| D{Structural Signals?}
    D -->|Yes| C
    D -->|No| E[UNKNOWN]

Important:
UNKNOWN is a valid classification. It is not an error.

Topic Classification¶

Topics are classified separately from maps.

Supported Topic Types¶

TopicType	Meaning
CONTENT	Concept, task, reference
GLOSSARY	`<glossentry>`
UNKNOWN	Unclear or malformed

Topic classification is currently shallow and intentionally conservative.

Structural Signatures¶

Discovery relies on signatures, not guesses.

Signatures are structural markers such as: - root element names - presence of <glossentry> - mapref-only maps - empty wrapper maps

Signatures are defined in:

dita_package_processor/discovery/signatures.py

Signatures may evolve, but they are always: - explicit - testable - documented

Known Patterns¶

Discovery also consults known patterns expressed in YAML.

Defined in:

dita_package_processor/knowledge/known_patterns.yaml

Patterns encode: - historical structures - vendor-specific exports - known glossary layouts - migration artifacts

Patterns are data, not code.

This allows: - incremental expansion - corpus-specific tuning - safe experimentation

Invariants¶

After classification, Discovery validates invariants.

Invariants are structural truths that must hold for transformation to proceed.

Examples: - Exactly one MAIN map - At most one ABSTRACT map - At most one GLOSSARY map

Defined in:

dita_package_processor/knowledge/invariants.py

Violation of invariants: - does not crash Discovery - does block transformation - produces explicit diagnostics

Invariant Validation Flow¶

sequenceDiagram
    participant Scanner
    participant Classifier
    participant Validator
    participant Report

    Scanner->>Classifier: Artifacts
    Classifier->>Validator: Classified Maps
    Validator->>Report: Violations (if any)

Discovery Report¶

Discovery produces a structured report capturing:

discovered maps and topics
assigned classifications
violated invariants
unresolved ambiguities

Reports are designed to be: - human-readable - machine-readable - diffable - archived

Defined in:

dita_package_processor/discovery/report.py

Discovery reports are artifacts, not logs.

What Discovery Does Not Do¶

Discovery explicitly avoids:

guessing user intent
renaming files
mutating XML
repairing broken content
deciding transformation order

Discovery answers one question only:

“What is actually here?”

Relationship to the Transformation Pipeline¶

Discovery runs before any pipeline steps.

Transformation steps must: - consume Discovery output - obey classifications - respect invariant failures

No step is allowed to “reinterpret” Discovery findings.

Failure Modes (By Design)¶

Discovery may report: - UNKNOWN maps - ambiguous classifications - invariant violations

This is not failure.

Failure is transforming content without understanding it.

Summary¶

Discovery turns a DITA package from an opaque blob into a described system.

It: - narrows uncertainty - surfaces assumptions - creates defensible boundaries

Everything that follows is only as reliable as Discovery.

If Discovery is wrong, the processor must stop.

That is not caution.
That is integrity.