Design: DITA Package Processor¶
Overview¶
The DITA Package Processor is a deterministic, batch-oriented transformation engine for bulk-generated DITA 1.3 packages.
It is designed to:
- Normalize inconsistent, machine-generated DITA into a predictable structure
- Execute transformations in a strictly ordered, auditable pipeline
- Support incremental extension without destabilizing existing behavior
- Operate safely on real-world XML, not idealized samples
- Be testable end-to-end using real filesystem fixtures
This system deliberately avoids: - Implicit behavior - Runtime inference - Interactive workflows - Framework-style indirection
If it runs, it runs because the configuration explicitly says so.
Core Design Goals¶
1. Deterministic Execution¶
- All behavior is driven by explicit configuration
- Execution order is fixed and visible
- Each step runs exactly once per invocation
- No conditional branching hidden inside steps
The same input and configuration always produce the same output.
2. Extensibility Without Refactoring¶
- New transformations are added as new pipeline steps
- Existing steps are not modified to accommodate new behavior
- Extensions do not silently alter core semantics
Growth happens by addition, not mutation.
3. Strong Separation of Concerns¶
Each concern lives in exactly one place:
| Concern | Location |
|---|---|
| CLI contract | cli.py |
| Runtime configuration | pyproject.toml |
| Execution orchestration | pipeline.py |
| Shared state | context.py |
| XML manipulation | dita_xml.py |
| Individual transformations | steps/* |
This separation is enforced structurally, not by convention.
4. Testability Over Cleverness¶
- Steps are testable in isolation
- The full pipeline is testable end-to-end
- Tests operate on real XML and real directories
- No heavy mocking of XML trees or filesystem behavior
If a transformation matters, it is validated structurally.
5. DITA-Aware, Not DITA-Fragile¶
The processor uses conservative heuristics that tolerate:
- Imperfect XML
- Inconsistent authoring practices
- Slight schema deviations common in bulk exports
This is intentional. Real DITA packages are messy.
High-Level Architecture¶
CLI
│
▼
Pipeline
│
├── RemoveIndexMapStep
├── RenameMainMapStep
├── ProcessMapsStep
└── RefactorGlossaryStep
│
▼
ProcessingContext
Supporting layers:
dita_xml.py– Safe XML parsing and transformation helpersutils.py– Filename and string utilitiessteps/*– Independent, single-responsibility processing units
Key Design Patterns¶
1. Pipeline Pattern¶
Where: pipeline.Pipeline
The processor uses a classic Pipeline pattern:
- A pipeline is an ordered list of steps
- Each step performs a transformation
- The pipeline controls execution order, logging, and error propagation
for step in self._steps:
logger.info("Running step: %s", step.name)
step.run(context, logger)
Why this matters
- Execution order is explicit and reviewable
- Steps can be added, removed, or reordered safely
- Behavior remains boring, predictable, and auditable
Boring is a feature.
2. Command / Strategy Hybrid (ProcessingStep)¶
Where: steps.base.ProcessingStep
Each step implements a shared interface:
class ProcessingStep(abc.ABC):
name: str
@abc.abstractmethod
def run(self, context, logger) -> None:
...
Each step acts as:
- A Command: “Perform this transformation”
- A Strategy: One interchangeable behavior in the pipeline
Benefits
- Steps are self-contained
- Steps do not call each other
- Steps do not manage execution flow
- Steps can be tested independently
This avoids the “giant script with flags” failure mode.
3. Context Object Pattern¶
Where: context.ProcessingContext
The ProcessingContext centralizes:
- Runtime configuration values
- Resolved filesystem paths
- Derived state shared across steps
Instead of globals or parameter sprawl, the pipeline passes a single context object:
@dataclass
class ProcessingContext:
package_dir: Path
docx_stem: str
main_map_path: Optional[Path]
renamed_main_map_path: Optional[Path]
Why this matters
- Shared state is explicit and inspectable
- Steps remain loosely coupled
- New derived values can be added without breaking existing steps
4. Template Method (Implicit)¶
Where: Pipeline execution loop
The pipeline enforces a fixed execution structure:
- Setup
- Step execution
- Logging
- Failure propagation
Steps decide what to do.
The pipeline decides when and how they run.
This prevents steps from: - Calling other steps - Managing logging inconsistently - Reordering execution
5. Facade Pattern for XML Operations¶
Where: dita_xml.py
All XML manipulation is wrapped behind a small, focused API:
read_xmlwrite_xmlget_map_titleget_top_level_topicrefscreate_concept_topic_xmltransform_to_glossentry
This creates a Facade over lxml.
Benefits
- XPath logic is centralized
- XML handling remains consistent across steps
- DITA edge cases can be fixed in one place
If DITA conventions change, the blast radius is contained.
6. Functional Core, Imperative Shell¶
- Functional core
- XML tree transformations
- Slug generation
- Structural rewrites
- Imperative shell
- File I/O
- Logging
- CLI parsing
- Configuration loading
This separation improves: - Reasoning about correctness - Unit testing - Debugging failed runs
Step Responsibilities¶
RemoveIndexMapStep¶
- Reads
index.ditamap - Resolves the referenced main map
- Deletes
index.ditamap
Responsibility
Establish the true entry point and remove indirection.
RenameMainMapStep¶
- Renames the resolved main map to
<docx_stem>.ditamap
Why separate
Renaming is a structural operation and should not be entangled with content rewriting.
ProcessMapsStep¶
This step performs the core normalization work:
- Detects the abstract map
- Injects abstract content into the main map
- Numbers remaining maps deterministically
- Creates wrapper concept topics
- Reparents existing topicrefs under the wrapper
Cohesive responsibility
Normalize map structure and impose a deterministic hierarchy.
RefactorGlossaryStep¶
- Locates the definition node in the definition map
- Iterates its child topicrefs
- Converts each referenced topic into a
glossentryin place
This logic is isolated because glossary behavior evolves independently.
Error Handling Philosophy¶
- Fail fast for structural impossibilities
(missing index map, unresolved main map) - Warn and continue for content inconsistencies
(missing topics, unmatched navtitles) - No silent failures
Every failure mode is logged with step context.
Testing Strategy¶
Integration-First Testing¶
- Tests use
pytestandtmp_path - Real directories and XML files are created
- Assertions validate structural outcomes, not internal state
This avoids brittle mocks and validates real-world behavior.
Extensibility Scenarios¶
New behavior requires no refactoring:
Examples:
- Regex-based cleanup step
- Attribute normalization step
- DITA 1.2 → 1.3 migration step
- Metadata enrichment step
- Validation or linting step
To add behavior:
- Create a new step in
steps/ - Register it
- Add it to
pipeline.steps
Nothing else changes.
Design Non-Goals (Intentional)¶
This project does not attempt to be:
- A plugin framework
- An interactive assistant
- A workflow engine
- A schema repair tool
- A dynamic inference system
It is a batch processor.
Summary¶
The DITA Package Processor applies well-understood, conservative design patterns to a messy, real-world problem:
- Pipeline for orchestration
- Command/Strategy for extensibility
- Context object for shared state
- Facade for XML safety
The result is a system that scales in capability without collapsing under cleverness.
That is the design.