Discovery¶
The Discovery module performs a read-only structural scan of a DITA package. Its job is to observe what exists, not to interpret intent or make decisions.
Discovery identifies artifacts, extracts relationships, and produces a graph-based representation of the package. It does not normalize structure, enforce rules, or infer transformations. All findings are recorded with evidence and confidence where applicable.
The output of discovery is a durable, schema-validated artifact that represents the structural truth of the package at a point in time. All downstream stages depend on this output and must treat it as authoritative.
dita_package_processor.discovery.classifiers
¶
Discovery-time classifiers for DITA packages.
This module adapts declarative pattern evaluation into concrete classification outcomes used during discovery.
It performs no transformation and no inference beyond deterministic resolution of emitted pattern evidence.
Contract (Iteration 7 – locked):
• Evidence means something was observed. • No evidence means nothing was observed. • Fallback evidence is not evidence and must never appear in output. • If classification is None: - confidence must be None - evidence must be []
dita_package_processor.discovery.graph
¶
Dependency graph data structures for discovery.
This module defines the read-only graph model derived from discovery output.
Discovery is authoritative. The graph is a computed structure.
Schema contract: - discovery.relationships use: from / to / type / pattern_id - graph edges use: source / target / type / pattern_id
This module: - consumes discovery relationships - emits a stable graph contract - never invents structure
DependencyEdge
dataclass
¶
Directed relationship between two artifacts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str
|
Source artifact path. |
required |
target
|
str
|
Target artifact path. |
required |
edge_type
|
str
|
Relationship type (topicref, image, xref, etc). |
required |
pattern_id
|
str
|
Discovery pattern identifier. |
required |
from_dict(data)
classmethod
¶
Build edge from graph serialization.
Expected keys: - source - target - type - pattern_id
from_relationship(data)
classmethod
¶
Build edge from discovery.relationship entry.
Expected keys: - from - to - type - pattern_id
to_dict()
¶
Serialize edge to graph contract.
Output: { "source": "...", "target": "...", "type": "...", "pattern_id": "..." }
DependencyGraph
dataclass
¶
Derived dependency graph.
Nodes are artifact paths. Edges are DependencyEdge instances.
from_dict(data)
classmethod
¶
Deserialize from graph serialization, not discovery JSON.
from_discovery(*, artifacts, relationships)
classmethod
¶
Build a graph from discovery JSON.
Discovery is authoritative. Graph must not contain unknown nodes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
artifacts
|
Iterable[Dict[str, Any]]
|
discovery["artifacts"] |
required |
relationships
|
Iterable[Dict[str, Any]]
|
discovery["relationships"] |
required |
incoming(node)
¶
Edges that point to node.
outgoing(node)
¶
Edges that originate from node.
to_dict()
¶
Serialize graph.
{ "nodes": [...], "edges": [...] }
dita_package_processor.discovery.models
¶
Discovery data models.
These models represent strictly observational records of what was found during DITA package discovery.
Design Principles¶
- No inference
- No transformation
- No mutation of semantic meaning
- Deterministic structure
- Explicit invariants
Discovery records facts. Planning interprets them.
DiscoveryArtifact
dataclass
¶
Observational record of a discovered filesystem artifact.
Rules¶
- Media artifacts are structural only.
- classification requires confidence.
- evidence requires classification.
- No semantic inference is performed here.
DiscoveryInventory
dataclass
¶
DiscoverySummary
dataclass
¶
High-level summary of discovery.
dita_package_processor.discovery.path_normalizer
¶
Path normalization utilities for discovery.
This module normalizes relative and absolute references found in DITA documents into deterministic, package-root–relative paths.
It ensures that the dependency graph is stable and free from duplicate edges caused by path variation such as:
- ./topics/a.dita
- ../topics/a.dita
- topics/../topics/a.dita
All of these must resolve to the same canonical path string.
Single responsibility: Input: - source file path - raw reference string - package root Output: - normalized package-root–relative POSIX path string
This module performs: - no filesystem mutation - no semantic validation - no file existence checks
normalize_reference_path(*, source_path, reference, package_root)
¶
Normalize a referenced path relative to the source file and package root.
The returned value is always:
- relative to the package root
- using POSIX-style separators
- fully normalized (no .. or . segments)
- stable across platforms
Absolute references are interpreted as package-root–anchored paths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_path
|
Path
|
Path of the file containing the reference. |
required |
reference
|
str
|
Raw reference value (e.g. from |
required |
package_root
|
Path
|
Root directory of the DITA package. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Normalized, package-root–relative path using POSIX separators. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the normalized path escapes the package root. |
dita_package_processor.discovery.patterns
¶
Pattern evaluation for DITA discovery.
This module defines the declarative pattern model and the evaluation engine that converts observed discovery signals into evidence.
It performs:
- No resolution
- No ranking
- No inference
- No mutation
It only answers:
“Given this artifact and these signals, what evidence exists?”
Fallback patterns are classification helpers only. They fire only when explicitly enabled.
Evidence
dataclass
¶
Evidence emitted when a pattern matches an artifact.
Pattern
dataclass
¶
Declarative structural pattern.
Parameters¶
id : str Unique pattern identifier. applies_to : str Artifact type this pattern applies to. signals : Dict[str, Any] Signal requirements. asserts : Dict[str, Any] Assertion payload. Must contain: - role - confidence rationale : List[str] Human-readable reasoning.
dita_package_processor.discovery.relationships
¶
Relationship extraction for DITA discovery.
This module extracts explicit, syntactic relationships between already- discovered artifacts by parsing DITA XML files.
It does NOT: - classify artifacts - mutate files - infer semantic intent
It ONLY records factual dependencies expressed in XML:
- map → topic via
- map → map via
- topic → media via
,
All emitted relationships conform strictly to the discovery schema:
{
"source": "
RelationshipExtractor
¶
Extracts structural relationships between DITA artifacts by parsing XML.
This extractor is purely syntactic and observational. It assumes: - Artifacts have already been discovered - Paths are package-relative - Files are valid XML
__init__(package_root)
¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
package_root
|
Path
|
Root directory of the DITA package. |
required |
extract(artifacts)
¶
Extract relationships from a set of discovered artifacts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
artifacts
|
List[Dict[str, str]]
|
List of discovery artifact dictionaries. |
required |
Returns:
| Type | Description |
|---|---|
List[Dict[str, str]]
|
List of relationship dictionaries conforming to schema. |
dita_package_processor.discovery.report
¶
Reporting utilities for DITA package discovery.
This module converts a fully populated DiscoveryInventory into a stable, JSON-serializable discovery contract defined by discovery.schema.json.
It performs no classification, parsing, or filesystem access. It only serializes already-discovered data into a schema-valid structure.
DiscoveryReport
dataclass
¶
Materialized discovery report summarizing a DiscoveryInventory.
This is a pure reporting layer that emits a schema-locked discovery contract.
summary()
¶
Return a simple artifact type histogram.
Contract:
{
"map":
to_dict()
¶
Serialize the discovery report into a schema-valid structure:
{ "artifacts": [...], "relationships": [...], "summary": {...} }
Note: - Graph internals (source/target) are normalized into discovery contract fields (from/to) here. - The graph itself is not exposed.
dita_package_processor.discovery.scanner
¶
Filesystem and XML scanner for DITA package discovery.
This module performs strictly read-only inspection of a DITA package directory and produces a DiscoveryInventory.
Responsibilities¶
- Identify maps, topics, and media artifacts
- Extract shallow metadata
- Perform classification via classifier modules
- Extract relationships
- Build dependency graph
- Annotate structural weight (node_count)
- Resolve a single deterministic MAIN map
This module does NOT: - Infer intent beyond structural evidence - Modify files - Perform planning
DiscoveryScanner
¶
Scan a DITA package directory and produce a DiscoveryInventory.
Guarantees: - Exactly one MAIN map is selected deterministically. - node_count metadata is annotated for all maps.
dita_package_processor.discovery.signatures
¶
Structural signature extraction for DITA discovery.
This module defines signatures: normalized, comparable observations derived from DITA XML structure. Signatures are pure data summaries and contain no classification logic.
Signatures exist to separate:
- observation (what is present)
- interpretation (what it means)
They are stable, testable, and safe to evolve independently.
MapSignature
dataclass
¶
Structural signature extracted from a DITA map.
This signature captures what the map contains, not what it represents or how it should be classified.
TopicSignature
dataclass
¶
Structural signature extracted from a DITA topic.
extract_map_signature(map_path)
¶
Extract a structural signature from a DITA map.
Extraction failures are non-fatal and result in partial signatures.
extract_topic_signature(topic_path)
¶
Extract a structural signature from a DITA topic.
Extraction failures are non-fatal and result in partial signatures.
has_maprefs(root)
¶
Return True if the XML element contains any <mapref> elements.
has_title(root)
¶
Return True if the XML element contains a <title> element.
has_topicrefs(root)
¶
Return True if the XML element contains any <topicref> elements.