docx_reader Module

docx_reader.py

Extract structural content from DOCX files for semantic conversion.

Produces a RawDocument tree composed of:

RawDocument
RawSection[]

Block[]

Preserves: * document order * heading boundaries * paragraph text * table structure * image extraction * NOTE heuristics * Figure caption detection (Figure X - …)

class dita_sop_converter.docx_reader.DocxReader

Bases: object

Structural extractor for SOP-style DOCX documents.

Responsibilities

  • Preserve document order

  • Detect heading boundaries with style heuristics

  • Extract tables and table rows

  • Extract inline/table images

  • Detect NOTE and FIGURE captions heuristically

FIGURE_RE = re.compile('^figure\\s+\\d+', re.IGNORECASE)
HEADING_STYLE_RE = re.compile('.*heading\\s*([1-9]).*', re.IGNORECASE)
NOTE_RE = re.compile('^\\s*(note|warning|caution)\\b[:\\-]?\\s*(.*)$', re.IGNORECASE)
read(path)
Return type:

dita_sop_converter.docx_reader.RawDocument

Parameters:

path (str)

class dita_sop_converter.docx_reader.RawBlock(style_name, text, is_table=False, is_caption=False)

Bases: object

Low-level paragraph block extracted directly from DOCX.

Attributes

style_name : str text : str is_table : bool

True only for table cell paragraphs.

is_captionbool

True for paragraphs starting with “Figure X”.

is_caption: bool = False
is_table: bool = False
style_name: str
text: str
Parameters:
  • style_name (str)

  • text (str)

  • is_table (bool)

  • is_caption (bool)

class dita_sop_converter.docx_reader.RawDocument(title, sections)

Bases: object

Parameters:
sections: List[RawSection]
title: str | None
class dita_sop_converter.docx_reader.RawSection(title, level, blocks=<factory>)

Bases: object

Parameters:
blocks: List[RawBlock | RawNoteBlock | ImageBlock | TableBlock]
level: int
title: str