Pipeline Module

pipeline.py

End-to-end conversion pipeline linking DOCX reading, classification, and DITA writing.

Responsibilities

  • Read DOCX into RawDocument structures

  • Classify sections into semantic TopicModels

  • Persist topics/media + root DITA map via DitaWriter

This orchestrator performs orchestration only and avoids doing any structural transformations, which are delegated to collaborators.

class dita_sop_converter.pipeline.ConverterPipeline(classifier=None)

Bases: object

DOCX → DITA conversion orchestrator.

Sequence

  1. Read DOCX → RawDocument

  2. Classify → TopicModels

  3. Serialize to DITA topics + map

The pipeline enforces directory creation ahead of writing to eliminate lazy I/O errors during media conversion.

run(input_path, output_dir, map_id=None)

Execute the pipeline.

Return type:

str

Parameters:
  • input_path (str)

  • output_dir (str)

  • map_id (str | None)

Parameters

input_pathstr

Path to the .docx SOP.

output_dirstr

Destination folder for DITA output (topics/, media/, map).

map_idstr or None

Optional explicit map ID override.

Returns

str

Absolute path to generated .ditamap file.

Raises

ValueError

When classifier returns no topics.

Parameters:

classifier (Optional[BaseClassifier])