Classifier Module
classifier.py
Transforms RawDocument objects into TopicModel trees.
This module performs only semantic classification. It does not mutate structured content into DITA form; that logic lives in the writer.
Responsibilities
Detect topic boundaries (per
RawSection)Infer
TopicTypefrom heading textConvert raw blocks into semantic model classes
- Tag specialized cases:
RawNoteBlock
TableBlock.kind classification
Step-like commands in TASK topics (paragraphs only)
Non-responsibilities
ID normalization and uniqueness
Writing DITA XML or performing image conversion
Unwrapping layout tables or detecting step/action tables
- class dita_sop_converter.classifier.BaseClassifier
Bases:
objectAbstract classifier interface.
Subclasses receive a
RawDocumentand produce a list ofTopicModelinstances.- classify(raw_doc)
Convert a raw document into semantic topic models.
- Return type:
typing.List[dita_sop_converter.model.TopicModel]- Parameters:
raw_doc (RawDocument)
- class dita_sop_converter.classifier.SopHeuristicClassifier
Bases:
BaseClassifierHeuristic classifier for SOP-style documents.
Strategy
Treat one
RawSectionas one DITA topicInfer
TopicTypefrom title text- Convert raw reader blocks into semantic block instances:
TableBlock classification based on row count
Paragraph-style RawBlock → Block or StepBlock (task only)
NOTE detection → RawNoteBlock
Images passed through without modification
- FIGURE_RE = re.compile('^\\s*figure\\s+\\d+', re.IGNORECASE)
- classify(raw_doc)
Convert a
RawDocumentinto a list of topic models.- Return type:
typing.List[dita_sop_converter.model.TopicModel]- Parameters:
raw_doc (RawDocument)