Pipeline Module
pipeline.py
End-to-end conversion pipeline linking DOCX reading, classification, and DITA writing.
Responsibilities
Read DOCX into RawDocument structures
Classify sections into semantic TopicModels
Persist topics/media + root DITA map via DitaWriter
This orchestrator performs orchestration only and avoids doing any structural transformations, which are delegated to collaborators.
- class dita_sop_converter.pipeline.ConverterPipeline(classifier=None)
Bases:
objectDOCX → DITA conversion orchestrator.
Sequence
Read DOCX → RawDocument
Classify → TopicModels
Serialize to DITA topics + map
The pipeline enforces directory creation ahead of writing to eliminate lazy I/O errors during media conversion.
- run(input_path, output_dir, map_id=None)
Execute the pipeline.
- Return type:
str- Parameters:
input_path (str)
output_dir (str)
map_id (str | None)
Parameters
- input_pathstr
Path to the .docx SOP.
- output_dirstr
Destination folder for DITA output (topics/, media/, map).
- map_idstr or None
Optional explicit map ID override.
Returns
- str
Absolute path to generated .ditamap file.
Raises
- ValueError
When classifier returns no topics.
- Parameters:
classifier (Optional[BaseClassifier])