Skip to content

Discovery

The Discovery module performs a read-only structural scan of a DITA package. Its job is to observe what exists, not to interpret intent or make decisions.

Discovery identifies artifacts, extracts relationships, and produces a graph-based representation of the package. It does not normalize structure, enforce rules, or infer transformations. All findings are recorded with evidence and confidence where applicable.

The output of discovery is a durable, schema-validated artifact that represents the structural truth of the package at a point in time. All downstream stages depend on this output and must treat it as authoritative.

dita_package_processor.discovery.classifiers

Discovery-time classifiers for DITA packages.

This module adapts declarative pattern evaluation into concrete classification outcomes used during discovery.

It performs no transformation and no inference beyond deterministic resolution of emitted pattern evidence.

Contract (Iteration 7 – locked):

• Evidence means something was observed. • No evidence means nothing was observed. • Fallback evidence is not evidence and must never appear in output. • If classification is None: - confidence must be None - evidence must be []

classify_map(*, path, metadata)

Classify a DITA map using declarative pattern evaluation.

Map classifications are returned as MapType enum values.

classify_topic(*, path, metadata)

Classify a DITA topic using declarative pattern evaluation.

Topics are classified using TopicType.

dita_package_processor.discovery.graph

Dependency graph data structures for discovery.

This module defines the read-only graph model derived from discovery output.

Discovery is authoritative. The graph is a computed structure.

Schema contract: - discovery.relationships use: from / to / type / pattern_id - graph edges use: source / target / type / pattern_id

This module: - consumes discovery relationships - emits a stable graph contract - never invents structure

DependencyEdge dataclass

Directed relationship between two artifacts.

Parameters:

Name Type Description Default
source str

Source artifact path.

required
target str

Target artifact path.

required
edge_type str

Relationship type (topicref, image, xref, etc).

required
pattern_id str

Discovery pattern identifier.

required
from_dict(data) classmethod

Build edge from graph serialization.

Expected keys: - source - target - type - pattern_id

from_relationship(data) classmethod

Build edge from discovery.relationship entry.

Expected keys: - from - to - type - pattern_id

to_dict()

Serialize edge to graph contract.

Output: { "source": "...", "target": "...", "type": "...", "pattern_id": "..." }

DependencyGraph dataclass

Derived dependency graph.

Nodes are artifact paths. Edges are DependencyEdge instances.

from_dict(data) classmethod

Deserialize from graph serialization, not discovery JSON.

from_discovery(*, artifacts, relationships) classmethod

Build a graph from discovery JSON.

Discovery is authoritative. Graph must not contain unknown nodes.

Parameters:

Name Type Description Default
artifacts Iterable[Dict[str, Any]]

discovery["artifacts"]

required
relationships Iterable[Dict[str, Any]]

discovery["relationships"]

required
incoming(node)

Edges that point to node.

outgoing(node)

Edges that originate from node.

to_dict()

Serialize graph.

{ "nodes": [...], "edges": [...] }

dita_package_processor.discovery.models

Discovery data models.

These models represent strictly observational records of what was found during DITA package discovery.

Design Principles
  • No inference
  • No transformation
  • No mutation of semantic meaning
  • Deterministic structure
  • Explicit invariants

Discovery records facts. Planning interprets them.

DiscoveryArtifact dataclass

Observational record of a discovered filesystem artifact.

Rules
  • Media artifacts are structural only.
  • classification requires confidence.
  • evidence requires classification.
  • No semantic inference is performed here.
classification_label()

Return normalized string label for classification.

Returns

Optional[str]

to_dict()

Serialize artifact for contract transfer.

DiscoveryInventory dataclass

Aggregational container of discovered artifacts.

Mutable during discovery.

add_artifact(artifact=None, **kwargs)

Add a discovered artifact.

resolve_main_map()

Resolve the single MAIN map.

Returns

Path

Raises

ValueError

DiscoveryResult dataclass

Immutable result of discovery.

main_map()

Return resolved MAIN map.

DiscoverySummary dataclass

High-level summary of discovery.

dita_package_processor.discovery.path_normalizer

Path normalization utilities for discovery.

This module normalizes relative and absolute references found in DITA documents into deterministic, package-root–relative paths.

It ensures that the dependency graph is stable and free from duplicate edges caused by path variation such as:

- ./topics/a.dita
- ../topics/a.dita
- topics/../topics/a.dita

All of these must resolve to the same canonical path string.

Single responsibility: Input: - source file path - raw reference string - package root Output: - normalized package-root–relative POSIX path string

This module performs: - no filesystem mutation - no semantic validation - no file existence checks

normalize_reference_path(*, source_path, reference, package_root)

Normalize a referenced path relative to the source file and package root.

The returned value is always: - relative to the package root - using POSIX-style separators - fully normalized (no .. or . segments) - stable across platforms

Absolute references are interpreted as package-root–anchored paths.

Parameters:

Name Type Description Default
source_path Path

Path of the file containing the reference.

required
reference str

Raw reference value (e.g. from href or data).

required
package_root Path

Root directory of the DITA package.

required

Returns:

Type Description
str

Normalized, package-root–relative path using POSIX separators.

Raises:

Type Description
ValueError

If the normalized path escapes the package root.

dita_package_processor.discovery.patterns

Pattern evaluation for DITA discovery.

This module defines the declarative pattern model and the evaluation engine that converts observed discovery signals into evidence.

It performs:

  • No resolution
  • No ranking
  • No inference
  • No mutation

It only answers:

“Given this artifact and these signals, what evidence exists?”

Fallback patterns are classification helpers only. They fire only when explicitly enabled.

Evidence dataclass

Evidence emitted when a pattern matches an artifact.

Pattern dataclass

Declarative structural pattern.

Parameters

id : str Unique pattern identifier. applies_to : str Artifact type this pattern applies to. signals : Dict[str, Any] Signal requirements. asserts : Dict[str, Any] Assertion payload. Must contain: - role - confidence rationale : List[str] Human-readable reasoning.

PatternEvaluator

Evaluate patterns against a single discovery artifact.

Modes

Observation mode (default) - fallback patterns ignored

Classification mode (allow_fallback=True) - fallback patterns fire only if no semantic match

evaluate(artifact, *, allow_fallback=False)

Evaluate all patterns against an artifact.

Parameters

artifact : DiscoveryArtifact Artifact to evaluate. allow_fallback : bool Enable fallback emission if no semantic match.

Returns

List[Evidence]

dita_package_processor.discovery.relationships

Relationship extraction for DITA discovery.

This module extracts explicit, syntactic relationships between already- discovered artifacts by parsing DITA XML files.

It does NOT: - classify artifacts - mutate files - infer semantic intent

It ONLY records factual dependencies expressed in XML:

  • map → topic via
  • map → map via
  • topic → media via , ,

    All emitted relationships conform strictly to the discovery schema:

    { "source": "", "target": "", "type": "", "pattern_id": "" }

    RelationshipExtractor

    Extracts structural relationships between DITA artifacts by parsing XML.

    This extractor is purely syntactic and observational. It assumes: - Artifacts have already been discovered - Paths are package-relative - Files are valid XML

    __init__(package_root)

    Parameters:

    Name Type Description Default
    package_root Path

    Root directory of the DITA package.

    required
    extract(artifacts)

    Extract relationships from a set of discovered artifacts.

    Parameters:

    Name Type Description Default
    artifacts List[Dict[str, str]]

    List of discovery artifact dictionaries.

    required

    Returns:

    Type Description
    List[Dict[str, str]]

    List of relationship dictionaries conforming to schema.

    dita_package_processor.discovery.report

    Reporting utilities for DITA package discovery.

    This module converts a fully populated DiscoveryInventory into a stable, JSON-serializable discovery contract defined by discovery.schema.json.

    It performs no classification, parsing, or filesystem access. It only serializes already-discovered data into a schema-valid structure.

    DiscoveryReport dataclass

    Materialized discovery report summarizing a DiscoveryInventory.

    This is a pure reporting layer that emits a schema-locked discovery contract.

    summary()

    Return a simple artifact type histogram.

    Contract: { "map": , "topic": , "media": , }

    to_dict()

    Serialize the discovery report into a schema-valid structure:

    { "artifacts": [...], "relationships": [...], "summary": {...} }

    Note: - Graph internals (source/target) are normalized into discovery contract fields (from/to) here. - The graph itself is not exposed.

    dita_package_processor.discovery.scanner

    Filesystem and XML scanner for DITA package discovery.

    This module performs strictly read-only inspection of a DITA package directory and produces a DiscoveryInventory.

    Responsibilities
    • Identify maps, topics, and media artifacts
    • Extract shallow metadata
    • Perform classification via classifier modules
    • Extract relationships
    • Build dependency graph
    • Annotate structural weight (node_count)
    • Resolve a single deterministic MAIN map

    This module does NOT: - Infer intent beyond structural evidence - Modify files - Perform planning

    DiscoveryScanner

    Scan a DITA package directory and produce a DiscoveryInventory.

    Guarantees: - Exactly one MAIN map is selected deterministically. - node_count metadata is annotated for all maps.

    __init__(package_dir)

    Initialize scanner.

    Parameters

    package_dir : Path Root directory of DITA package.

    scan()

    Perform full discovery scan.

    Returns

    DiscoveryInventory

    dita_package_processor.discovery.signatures

    Structural signature extraction for DITA discovery.

    This module defines signatures: normalized, comparable observations derived from DITA XML structure. Signatures are pure data summaries and contain no classification logic.

    Signatures exist to separate:

    • observation (what is present)
    • interpretation (what it means)

    They are stable, testable, and safe to evolve independently.

    MapSignature dataclass

    Structural signature extracted from a DITA map.

    This signature captures what the map contains, not what it represents or how it should be classified.

    TopicSignature dataclass

    Structural signature extracted from a DITA topic.

    extract_map_signature(map_path)

    Extract a structural signature from a DITA map.

    Extraction failures are non-fatal and result in partial signatures.

    extract_topic_signature(topic_path)

    Extract a structural signature from a DITA topic.

    Extraction failures are non-fatal and result in partial signatures.

    has_maprefs(root)

    Return True if the XML element contains any <mapref> elements.

    has_title(root)

    Return True if the XML element contains a <title> element.

    has_topicrefs(root)

    Return True if the XML element contains any <topicref> elements.