docx_reader Module
docx_reader.py
Extract structural content from DOCX files for semantic conversion.
Produces a RawDocument tree composed of:
- RawDocument
- RawSection[]
Block[]
Preserves: * document order * heading boundaries * paragraph text * table structure * image extraction * NOTE heuristics * Figure caption detection (Figure X - …)
- class dita_sop_converter.docx_reader.DocxReader
Bases:
objectStructural extractor for SOP-style DOCX documents.
Responsibilities
Preserve document order
Detect heading boundaries with style heuristics
Extract tables and table rows
Extract inline/table images
Detect NOTE and FIGURE captions heuristically
- FIGURE_RE = re.compile('^figure\\s+\\d+', re.IGNORECASE)
- HEADING_STYLE_RE = re.compile('.*heading\\s*([1-9]).*', re.IGNORECASE)
- NOTE_RE = re.compile('^\\s*(note|warning|caution)\\b[:\\-]?\\s*(.*)$', re.IGNORECASE)
- read(path)
- Return type:
- Parameters:
path (str)
- class dita_sop_converter.docx_reader.RawBlock(style_name, text, is_table=False, is_caption=False)
Bases:
objectLow-level paragraph block extracted directly from DOCX.
Attributes
style_name : str text : str is_table : bool
True only for table cell paragraphs.
- is_captionbool
True for paragraphs starting with “Figure X”.
- is_caption: bool = False
- is_table: bool = False
- style_name: str
- text: str
- Parameters:
style_name (str)
text (str)
is_table (bool)
is_caption (bool)
- class dita_sop_converter.docx_reader.RawDocument(title, sections)
Bases:
object- Parameters:
title (str | None)
sections (List[RawSection])
- sections: List[RawSection]
- title: str | None
- class dita_sop_converter.docx_reader.RawSection(title, level, blocks=<factory>)
Bases:
object- Parameters:
title (str)
level (int)
blocks (List[RawBlock | RawNoteBlock | ImageBlock | TableBlock])
- blocks: List[RawBlock | RawNoteBlock | ImageBlock | TableBlock]
- level: int
- title: str