docx_reader Module

docx_reader.py

Extract structural content from DOCX files for semantic conversion.

Produces a RawDocument tree composed of:

RawDocument

RawSection[]
Block[]

Preserves: * document order * heading boundaries * paragraph text * table structure * image extraction * NOTE heuristics * Figure caption detection (Figure X - …)

class dita_sop_converter.docx_reader.DocxReader

Bases: object

Structural extractor for SOP-style DOCX documents.

Responsibilities

Preserve document order
Detect heading boundaries with style heuristics
Extract tables and table rows
Extract inline/table images
Detect NOTE and FIGURE captions heuristically

FIGURE_RE = re.compile('^figure\\s+\\d+', re.IGNORECASE)

HEADING_STYLE_RE = re.compile('.*heading\\s*([1-9]).*', re.IGNORECASE)

NOTE_RE = re.compile('^\\s*(note|warning|caution)\\b[:\\-]?\\s*(.*)$', re.IGNORECASE)

read(path)

Return type:: dita_sop_converter.docx_reader.RawDocument
Parameters:: path (str)

class dita_sop_converter.docx_reader.RawBlock(style_name, text, is_table=False, is_caption=False)

Bases: object

Low-level paragraph block extracted directly from DOCX.

Attributes

style_name : str text : str is_table : bool

True only for table cell paragraphs.

is_captionbool: True for paragraphs starting with “Figure X”.

is_caption: bool = False

is_table: bool = False

style_name: str

text: str

Parameters:

style_name (str)
text (str)
is_table (bool)
is_caption (bool)

class dita_sop_converter.docx_reader.RawDocument(title, sections)

Bases: object

Parameters:

title (str | None)
sections (List[RawSection])

sections: List[RawSection]

title: str | None

class dita_sop_converter.docx_reader.RawSection(title, level, blocks=<factory>)

Bases: object

Parameters:

title (str)
level (int)
blocks (List[RawBlock | RawNoteBlock | ImageBlock | TableBlock])

blocks: List[RawBlock | RawNoteBlock | ImageBlock | TableBlock]

level: int

title: str