Classifier Module

classifier.py

Transforms RawDocument objects into TopicModel trees.

This module performs only semantic classification. It does not mutate structured content into DITA form; that logic lives in the writer.

Responsibilities

  • Detect topic boundaries (per RawSection)

  • Infer TopicType from heading text

  • Convert raw blocks into semantic model classes

  • Tag specialized cases:
    • RawNoteBlock

    • TableBlock.kind classification

    • Step-like commands in TASK topics (paragraphs only)

Non-responsibilities

  • ID normalization and uniqueness

  • Writing DITA XML or performing image conversion

  • Unwrapping layout tables or detecting step/action tables

class dita_sop_converter.classifier.BaseClassifier

Bases: object

Abstract classifier interface.

Subclasses receive a RawDocument and produce a list of TopicModel instances.

classify(raw_doc)

Convert a raw document into semantic topic models.

Return type:

typing.List[dita_sop_converter.model.TopicModel]

Parameters:

raw_doc (RawDocument)

class dita_sop_converter.classifier.SopHeuristicClassifier

Bases: BaseClassifier

Heuristic classifier for SOP-style documents.

Strategy

  • Treat one RawSection as one DITA topic

  • Infer TopicType from title text

  • Convert raw reader blocks into semantic block instances:
    • TableBlock classification based on row count

    • Paragraph-style RawBlock → Block or StepBlock (task only)

    • NOTE detection → RawNoteBlock

    • Images passed through without modification

FIGURE_RE = re.compile('^\\s*figure\\s+\\d+', re.IGNORECASE)
classify(raw_doc)

Convert a RawDocument into a list of topic models.

Return type:

typing.List[dita_sop_converter.model.TopicModel]

Parameters:

raw_doc (RawDocument)