Verwenden Sie diese Fähigkeit immer dann, wenn eine Tabellenkalkulationsdatei die primäre Eingabe oder Ausgabe ist. Damit ist jede Aufgabe gemeint, bei der der Benutzer Folgendes tun möchte: eine vorhandene XLSX-, XLSM-, CSV- oder TSV-Datei öffnen, lesen, bearbeiten oder reparieren (z. B. Spalten hinzufügen, Formeln berechnen, formatieren, Diagramme erstellen, unordentliche Daten bereinigen); Erstellen Sie eine neue Tabelle von Grund auf oder aus anderen Datenquellen. oder zwischen tabellarischen Dateiformaten konvertieren. Wird insbesondere dann ausgelöst, wenn der Benutzer mit Namen oder Pfad auf eine Tabellenkalkulationsdatei verweist – auch beiläufig (z. B. „die XLSX-Datei in meinen Downloads“) – und möchte, dass etwas damit gemacht oder daraus erstellt wird. Lösen Sie außerdem die Bereinigung oder Umstrukturierung unordentlicher tabellarischer Datendateien (fehlerhafte Zeilen, falsch platzierte Kopfzeilen, Junk-Daten) in ordnungsgemäße Tabellenkalkulationen aus. Bei dem Liefergegenstand muss es sich um eine Tabellenkalkulationsdatei handeln. NICHT auslösen, wenn das primäre Ergebnis ein Word-Dokument, ein HTML-Bericht, ein eigenständiges Python-Skript, eine Datenbankpipeline oder eine Google Sheets-API-Integration ist, selbst wenn es sich um tabellarische Daten handelt.

Quelle: Inhalt angepasst von Anthropics/Skills (MIT).

Alle Excel-Dateien

Professionelle Schriftart

Verwenden Sie für alle Lieferungen eine konsistente, professionelle Schriftart (z. B. Arial, Times New Roman), sofern der Benutzer nichts anderes anweist

Keine Formelfehler

Jedes Excel-Modell MUSS mit NULL Formelfehlern geliefert werden (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?)

Vorhandene Vorlagen beibehalten (beim Aktualisieren von Vorlagen)

Studieren Sie vorhandene Formate, Stile und Konventionen und passen Sie sie beim Ändern von Dateien GENAU an
Erzwingen Sie niemals eine standardisierte Formatierung von Dateien mit etablierten Mustern
Bestehende Vorlagenkonventionen setzen diese Richtlinien IMMER außer Kraft

Finanzmodelle

Farbcodierungsstandards

Sofern vom Benutzer oder der vorhandenen Vorlage nicht anders angegeben

Branchenübliche Farbkonventionen

Blauer Text (RGB: 0,0,255): Hartcodierte Eingaben und Zahlen, die Benutzer für Szenarien ändern
Schwarzer Text (RGB: 0,0,0): ALLE Formeln und Berechnungen
Grüner Text (RGB: 0,128,0): Links, die von anderen Arbeitsblättern innerhalb derselben Arbeitsmappe abgerufen werden
Roter Text (RGB: 255,0,0): Externe Links zu anderen Dateien
Gelber Hintergrund (RGB: 255.255,0): Wichtige Annahmen, die Aufmerksamkeit erfordern, oder Zellen, die aktualisiert werden müssen

Standards für die Zahlenformatierung

Erforderliche Formatregeln

Jahre: Als Textzeichenfolgen formatieren (z. B. „2024“, nicht „2.024“)
Währung: Verwenden Sie das Format $#,##0; Geben Sie IMMER Einheiten in den Kopfzeilen an („Umsatz ($mm)“)
Nullen: Verwenden Sie Zahlenformatierung, um alle Nullen zu „-“ zu machen, einschließlich Prozentangaben (z. B. „$#,##0;($#,##0);-“)
Prozentsätze: Standardmäßig 0,0 %-Format (eine Dezimalstelle)
Multiples: Format als 0,0x für Bewertungsmultiplikatoren (EV/EBITDA, KGV)
Negative Zahlen: Verwenden Sie Klammern (123), nicht minus -123

Formelkonstruktionsregeln

Platzierung der Annahmen

Platzieren Sie ALLE Annahmen (Wachstumsraten, Margen, Vielfache usw.) in separaten Annahmezellen
Verwenden Sie Zellbezüge anstelle von fest codierten Werten in Formeln
Beispiel: Verwenden Sie =B5*(1+$B$6) anstelle von =B5*1,05

Verhinderung von Formelfehlern

Überprüfen Sie, ob alle Zellbezüge korrekt sind
Überprüfen Sie die Bereiche auf Abweichungsfehler
Sorgen Sie für konsistente Formeln über alle Prognosezeiträume hinweg
Test mit Randfällen (Nullwerte, negative Zahlen)
Stellen Sie sicher, dass es keine unbeabsichtigten Zirkelverweise gibt

Dokumentationsanforderungen für Hardcodes

Kommentar oder in den Zellen daneben (am Ende der Tabelle). Format: „Quelle: [System/Dokument], [Datum], [Spezifische Referenz], [URL, falls zutreffend]“
Beispiele:
- „Quelle: Unternehmen 10-K, GJ2024, Seite 45, Umsatzmitteilung, [SEC EDGAR URL]“
- „Quelle: Unternehmen 10-Q, Q2 2025, Anlage 99.1, [SEC EDGAR URL]“
- „Quelle: Bloomberg Terminal, 15.08.2025, AAPL US Equity“
- „Quelle: FactSet, 20.08.2025, Bildschirm „Konsensschätzungen““

XLSX-Erstellung, Bearbeitung und Analyse

Übersicht

Ein Benutzer bittet Sie möglicherweise, den Inhalt einer XLSX-Datei zu erstellen, zu bearbeiten oder zu analysieren. Für unterschiedliche Aufgaben stehen Ihnen unterschiedliche Tools und Workflows zur Verfügung.

Wichtige Anforderungen

LibreOffice für die Neuberechnung von Formeln erforderlich: Sie können davon ausgehen, dass LibreOffice für die Neuberechnung von Formelwerten mithilfe desscripts/recalc.py-Skripts installiert ist. Das Skript konfiguriert LibreOffice automatisch bei der ersten Ausführung, auch in Sandbox-Umgebungen, in denen Unix-Sockets eingeschränkt sind (verwaltet vonscripts/office/soffice.py).

Daten lesen und analysieren

Datenanalyse mit Pandas

Verwenden Sie für Datenanalyse, Visualisierung und grundlegende Vorgänge Pandas, das leistungsstarke Datenbearbeitungsfunktionen bietet:

import pandas as pd

# Read Excel
df = pd.read_excel('file.xlsx')  # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None)  # All sheets as dict

# Analyze
df.head()      # Preview data
df.info()      # Column info
df.describe()  # Statistics

# Write Excel
df.to_excel('output.xlsx', index=False)

Excel-Datei-Workflows

KRITISCH: Verwenden Sie Formeln, keine fest codierten Werte

Verwenden Sie immer Excel-Formeln, anstatt Werte in Python zu berechnen und fest zu codieren. Dadurch wird sichergestellt, dass die Tabelle dynamisch und aktualisierbar bleibt.

FALSCH – Berechnete Werte fest codieren

# Bad: Calculating in Python and hardcoding result
total = df['Sales'].sum()
sheet['B10'] = total  # Hardcodes 5000

# Bad: Computing growth rate in Python
growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue']
sheet['C5'] = growth  # Hardcodes 0.15

# Bad: Python calculation for average
avg = sum(values) / len(values)
sheet['D20'] = avg  # Hardcodes 42.5

RICHTIG – Excel-Formeln verwenden

# Good: Let Excel calculate the sum
sheet['B10'] = '=SUM(B2:B9)'

# Good: Growth rate as Excel formula
sheet['C5'] = '=(C4-C2)/C2'

# Good: Average using Excel function
sheet['D20'] = '=AVERAGE(D2:D19)'

Dies gilt für ALLE Berechnungen – Summen, Prozentsätze, Verhältnisse, Differenzen usw. Die Kalkulationstabelle sollte in der Lage sein, neu zu berechnen, wenn sich Quelldaten ändern.

Gemeinsamer Arbeitsablauf

Tool auswählen: Pandas für Daten, openpyxl für Formeln/Formatierung
Erstellen/Laden: Neue Arbeitsmappe erstellen oder vorhandene Datei laden
Ändern: Daten, Formeln und Formatierungen hinzufügen/bearbeiten
Speichern: In Datei schreiben
Formeln neu berechnen (OBLIGATORISCH, WENN FORMELN VERWENDET WERDEN): Verwenden Sie das Skript scripts/recalc.py
```
python scripts/recalc.py output.xlsx
```
Überprüfen und beheben Sie alle Fehler:
- Das Skript gibt JSON mit Fehlerdetails zurück
- Wennstatus``errors_foundist, überprüfen Sieerror_summaryauf bestimmte Fehlertypen und -orte
- Beheben Sie die identifizierten Fehler und führen Sie erneut eine Neuberechnung durch
- Häufige zu behebende Fehler:
  - #REF!: Ungültige Zellbezüge
  - #DIV/0!: Division durch Null
  - #VALUE!: Falscher Datentyp in der Formel
  - #NAME?: Unbekannter Formelname

Erstellen neuer Excel-Dateien

# Using openpyxl for formulas and formatting
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active

# Add data
sheet['A1'] = 'Hello'
sheet['B1'] = 'World'
sheet.append(['Row', 'of', 'data'])

# Add formula
sheet['B2'] = '=SUM(A1:A10)'

# Formatting
sheet['A1'].font = Font(bold=True, color='FF0000')
sheet['A1'].fill = PatternFill('solid', start_color='FFFF00')
sheet['A1'].alignment = Alignment(horizontal='center')

# Column width
sheet.column_dimensions['A'].width = 20

wb.save('output.xlsx')

Bearbeiten vorhandener Excel-Dateien

# Using openpyxl to preserve formulas and formatting
from openpyxl import load_workbook

# Load existing file
wb = load_workbook('existing.xlsx')
sheet = wb.active  # or wb['SheetName'] for specific sheet

# Working with multiple sheets
for sheet_name in wb.sheetnames:
    sheet = wb[sheet_name]
    print(f"Sheet: {sheet_name}")

# Modify cells
sheet['A1'] = 'New Value'
sheet.insert_rows(2)  # Insert row at position 2
sheet.delete_cols(3)  # Delete column 3

# Add new sheet
new_sheet = wb.create_sheet('NewSheet')
new_sheet['A1'] = 'Data'

wb.save('modified.xlsx')

Formeln neu berechnen

Von openpyxl erstellte oder geänderte Excel-Dateien enthalten Formeln als Zeichenfolgen, jedoch keine berechneten Werte. Verwenden Sie das bereitgestelltescripts/recalc.py-Skript, um Formeln neu zu berechnen:

python scripts/recalc.py <excel_file> [timeout_seconds]

Beispiel:

python scripts/recalc.py output.xlsx 30

Das Skript:

Richtet das LibreOffice-Makro beim ersten Start automatisch ein
Berechnet alle Formeln in allen Blättern neu
Durchsucht ALLE Zellen nach Excel-Fehlern (#REF!, #DIV/0! usw.)
Gibt JSON mit detaillierten Fehlerorten und -anzahlen zurück
Funktioniert sowohl unter Linux als auch unter macOS

Checkliste zur Formelüberprüfung

Schnelle Überprüfungen, um sicherzustellen, dass Formeln korrekt funktionieren:

Wesentliche Überprüfung

Testen Sie 2-3 Beispielreferenzen: Stellen Sie sicher, dass sie korrekte Werte abrufen, bevor Sie das vollständige Modell erstellen
Spaltenzuordnung: Übereinstimmung der Excel-Spalten bestätigen (z. B. Spalte 64 = BL, nicht BK)
Zeilenversatz: Denken Sie daran, dass Excel-Zeilen 1-indiziert sind (DataFrame-Zeile 5 = Excel-Zeile 6)

Häufige Fallstricke

NaN-Behandlung: Mitpd.notna()auf Nullwerte prüfen
Spalten ganz rechts: Geschäftsjahresdaten häufig in Spalten 50+
Mehrere Übereinstimmungen: Alle Vorkommen durchsuchen, nicht nur das erste
Division durch Null: Überprüfen Sie die Nenner, bevor Sie/in Formeln verwenden (#DIV/0!)
Falsche Referenzen: Überprüfen Sie, ob alle Zellreferenzen auf die beabsichtigten Zellen verweisen (#REF!)
Blattübergreifende Verweise: Verwenden Sie das richtige Format (Blatt1!A1) zum Verknüpfen von Blättern

Formelteststrategie

Klein anfangen: Testen Sie Formeln an 2-3 Zellen, bevor Sie sie großflächig anwenden
Abhängigkeiten überprüfen: Überprüfen Sie, ob alle in Formeln referenzierten Zellen vorhanden sind
Testrandfälle: Null, negative und sehr große Werte einschließen

Interpretieren der Ausgabe von scripts/recalc.py

Das Skript gibt JSON mit Fehlerdetails zurück:

{
  "status": "success",           // or "errors_found"
  "total_errors": 0,              // Total error count
  "total_formulas": 42,           // Number of formulas in file
  "error_summary": {              // Only present if errors found
    "#REF!": {
      "count": 2,
      "locations": ["Sheet1!B5", "Sheet1!C10"]
    }
  }
}

Best Practices

Bibliotheksauswahl

Pandas: Am besten für Datenanalyse, Massenoperationen und einfachen Datenexport geeignet
openpyxl: Am besten für komplexe Formatierungen, Formeln und Excel-spezifische Funktionen geeignet

Arbeiten mit openpyxl

Zellindizes basieren auf 1 (Zeile=1, Spalte=1 bezieht sich auf Zelle A1)
Verwenden Siedata_only=True, um berechnete Werte zu lesen:load_workbook('file.xlsx', data_only=True)
Warnung: Beim Öffnen mitdata_only=Trueund Speichern werden Formeln durch Werte ersetzt und gehen dauerhaft verloren
Für große Dateien: Verwenden Sieread_only=Truezum Lesen oderwrite_only=Truezum Schreiben
Formeln bleiben erhalten, aber nicht ausgewertet – verwenden Sie scripts/recalc.py, um Werte zu aktualisieren

Arbeiten mit Pandas

Geben Sie Datentypen an, um Inferenzprobleme zu vermeiden:pd.read_excel('file.xlsx', dtype={'id': str})
Lesen Sie bei großen Dateien bestimmte Spalten:pd.read_excel('file.xlsx', usecols=['A', 'C', 'E'])
Gehen Sie richtig mit Datumsangaben um:pd.read_excel('file.xlsx', parse_dates=['date_column'])

Richtlinien zum Codestil

WICHTIG: Beim Generieren von Python-Code für Excel-Operationen:

Schreiben Sie minimalen, prägnanten Python-Code ohne unnötige Kommentare
Vermeiden Sie ausführliche Variablennamen und redundante Operationen
Vermeiden Sie unnötige Druckanweisungen

Für Excel-Dateien selbst:

Fügen Sie Kommentare zu Zellen mit komplexen Formeln oder wichtigen Annahmen hinzu
Dokumentieren Sie Datenquellen für hartcodierte Werte
Fügen Sie Hinweise zu wichtigen Berechnungen und Modellabschnitten hinzu

Ressourcendateien

LIZENZ.txt

LICENSE.txt herunterladen

Binäre Ressource

scripts/office/helpers/init.py

Scripts/office/helpers/init.py herunterladen

Binäre Ressource

scripts/office/helpers/merge_runs.py

Download scripts/office/helpers/merge_runs.py

"""Merge adjacent runs with identical formatting in DOCX.

Merges adjacent <w:r> elements that have identical <w:rPr> properties.
Works on runs in paragraphs and inside tracked changes (<w:ins>, <w:del>).

Also:
- Removes rsid attributes from runs (revision metadata that doesn't affect rendering)
- Removes proofErr elements (spell/grammar markers that block merging)
"""

from pathlib import Path

import defusedxml.minidom


def merge_runs(input_dir: str) -> tuple[int, str]:
    doc_xml = Path(input_dir) / "word" / "document.xml"

    if not doc_xml.exists():
        return 0, f"Error: {doc_xml} not found"

    try:
        dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
        root = dom.documentElement

        _remove_elements(root, "proofErr")
        _strip_run_rsid_attrs(root)

        containers = {run.parentNode for run in _find_elements(root, "r")}

        merge_count = 0
        for container in containers:
            merge_count += _merge_runs_in(container)

        doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
        return merge_count, f"Merged {merge_count} runs"

    except Exception as e:
        return 0, f"Error: {e}"




def _find_elements(root, tag: str) -> list:
    results = []

    def traverse(node):
        if node.nodeType == node.ELEMENT_NODE:
            name = node.localName or node.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(node)
            for child in node.childNodes:
                traverse(child)

    traverse(root)
    return results


def _get_child(parent, tag: str):
    for child in parent.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name == tag or name.endswith(f":{tag}"):
                return child
    return None


def _get_children(parent, tag: str) -> list:
    results = []
    for child in parent.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(child)
    return results


def _is_adjacent(elem1, elem2) -> bool:
    node = elem1.nextSibling
    while node:
        if node == elem2:
            return True
        if node.nodeType == node.ELEMENT_NODE:
            return False
        if node.nodeType == node.TEXT_NODE and node.data.strip():
            return False
        node = node.nextSibling
    return False




def _remove_elements(root, tag: str):
    for elem in _find_elements(root, tag):
        if elem.parentNode:
            elem.parentNode.removeChild(elem)


def _strip_run_rsid_attrs(root):
    for run in _find_elements(root, "r"):
        for attr in list(run.attributes.values()):
            if "rsid" in attr.name.lower():
                run.removeAttribute(attr.name)




def _merge_runs_in(container) -> int:
    merge_count = 0
    run = _first_child_run(container)

    while run:
        while True:
            next_elem = _next_element_sibling(run)
            if next_elem and _is_run(next_elem) and _can_merge(run, next_elem):
                _merge_run_content(run, next_elem)
                container.removeChild(next_elem)
                merge_count += 1
            else:
                break

        _consolidate_text(run)
        run = _next_sibling_run(run)

    return merge_count


def _first_child_run(container):
    for child in container.childNodes:
        if child.nodeType == child.ELEMENT_NODE and _is_run(child):
            return child
    return None


def _next_element_sibling(node):
    sibling = node.nextSibling
    while sibling:
        if sibling.nodeType == sibling.ELEMENT_NODE:
            return sibling
        sibling = sibling.nextSibling
    return None


def _next_sibling_run(node):
    sibling = node.nextSibling
    while sibling:
        if sibling.nodeType == sibling.ELEMENT_NODE:
            if _is_run(sibling):
                return sibling
        sibling = sibling.nextSibling
    return None


def _is_run(node) -> bool:
    name = node.localName or node.tagName
    return name == "r" or name.endswith(":r")


def _can_merge(run1, run2) -> bool:
    rpr1 = _get_child(run1, "rPr")
    rpr2 = _get_child(run2, "rPr")

    if (rpr1 is None) != (rpr2 is None):
        return False
    if rpr1 is None:
        return True
    return rpr1.toxml() == rpr2.toxml()  


def _merge_run_content(target, source):
    for child in list(source.childNodes):
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name != "rPr" and not name.endswith(":rPr"):
                target.appendChild(child)


def _consolidate_text(run):
    t_elements = _get_children(run, "t")

    for i in range(len(t_elements) - 1, 0, -1):
        curr, prev = t_elements[i], t_elements[i - 1]

        if _is_adjacent(prev, curr):
            prev_text = prev.firstChild.data if prev.firstChild else ""
            curr_text = curr.firstChild.data if curr.firstChild else ""
            merged = prev_text + curr_text

            if prev.firstChild:
                prev.firstChild.data = merged
            else:
                prev.appendChild(run.ownerDocument.createTextNode(merged))

            if merged.startswith(" ") or merged.endswith(" "):
                prev.setAttribute("xml:space", "preserve")
            elif prev.hasAttribute("xml:space"):
                prev.removeAttribute("xml:space")

            run.removeChild(curr)

scripts/office/helpers/simplify_redlines.py

Download scripts/office/helpers/simplify_redlines.py

"""Simplify tracked changes by merging adjacent w:ins or w:del elements.

Merges adjacent <w:ins> elements from the same author into a single element.
Same for <w:del> elements. This makes heavily-redlined documents easier to
work with by reducing the number of tracked change wrappers.

Rules:
- Only merges w:ins with w:ins, w:del with w:del (same element type)
- Only merges if same author (ignores timestamp differences)
- Only merges if truly adjacent (only whitespace between them)
"""

import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path

import defusedxml.minidom

WORD_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"


def simplify_redlines(input_dir: str) -> tuple[int, str]:
    doc_xml = Path(input_dir) / "word" / "document.xml"

    if not doc_xml.exists():
        return 0, f"Error: {doc_xml} not found"

    try:
        dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
        root = dom.documentElement

        merge_count = 0

        containers = _find_elements(root, "p") + _find_elements(root, "tc")

        for container in containers:
            merge_count += _merge_tracked_changes_in(container, "ins")
            merge_count += _merge_tracked_changes_in(container, "del")

        doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
        return merge_count, f"Simplified {merge_count} tracked changes"

    except Exception as e:
        return 0, f"Error: {e}"


def _merge_tracked_changes_in(container, tag: str) -> int:
    merge_count = 0

    tracked = [
        child
        for child in container.childNodes
        if child.nodeType == child.ELEMENT_NODE and _is_element(child, tag)
    ]

    if len(tracked) < 2:
        return 0

    i = 0
    while i < len(tracked) - 1:
        curr = tracked[i]
        next_elem = tracked[i + 1]

        if _can_merge_tracked(curr, next_elem):
            _merge_tracked_content(curr, next_elem)
            container.removeChild(next_elem)
            tracked.pop(i + 1)
            merge_count += 1
        else:
            i += 1

    return merge_count


def _is_element(node, tag: str) -> bool:
    name = node.localName or node.tagName
    return name == tag or name.endswith(f":{tag}")


def _get_author(elem) -> str:
    author = elem.getAttribute("w:author")
    if not author:
        for attr in elem.attributes.values():
            if attr.localName == "author" or attr.name.endswith(":author"):
                return attr.value
    return author


def _can_merge_tracked(elem1, elem2) -> bool:
    if _get_author(elem1) != _get_author(elem2):
        return False

    node = elem1.nextSibling
    while node and node != elem2:
        if node.nodeType == node.ELEMENT_NODE:
            return False
        if node.nodeType == node.TEXT_NODE and node.data.strip():
            return False
        node = node.nextSibling

    return True


def _merge_tracked_content(target, source):
    while source.firstChild:
        child = source.firstChild
        source.removeChild(child)
        target.appendChild(child)


def _find_elements(root, tag: str) -> list:
    results = []

    def traverse(node):
        if node.nodeType == node.ELEMENT_NODE:
            name = node.localName or node.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(node)
            for child in node.childNodes:
                traverse(child)

    traverse(root)
    return results


def get_tracked_change_authors(doc_xml_path: Path) -> dict[str, int]:
    if not doc_xml_path.exists():
        return {}

    try:
        tree = ET.parse(doc_xml_path)
        root = tree.getroot()
    except ET.ParseError:
        return {}

    namespaces = {"w": WORD_NS}
    author_attr = f"{{{WORD_NS}}}author"

    authors: dict[str, int] = {}
    for tag in ["ins", "del"]:
        for elem in root.findall(f".//w:{tag}", namespaces):
            author = elem.get(author_attr)
            if author:
                authors[author] = authors.get(author, 0) + 1

    return authors


def _get_authors_from_docx(docx_path: Path) -> dict[str, int]:
    try:
        with zipfile.ZipFile(docx_path, "r") as zf:
            if "word/document.xml" not in zf.namelist():
                return {}
            with zf.open("word/document.xml") as f:
                tree = ET.parse(f)
                root = tree.getroot()

                namespaces = {"w": WORD_NS}
                author_attr = f"{{{WORD_NS}}}author"

                authors: dict[str, int] = {}
                for tag in ["ins", "del"]:
                    for elem in root.findall(f".//w:{tag}", namespaces):
                        author = elem.get(author_attr)
                        if author:
                            authors[author] = authors.get(author, 0) + 1
                return authors
    except (zipfile.BadZipFile, ET.ParseError):
        return {}


def infer_author(modified_dir: Path, original_docx: Path, default: str = "Claude") -> str:
    modified_xml = modified_dir / "word" / "document.xml"
    modified_authors = get_tracked_change_authors(modified_xml)

    if not modified_authors:
        return default

    original_authors = _get_authors_from_docx(original_docx)

    new_changes: dict[str, int] = {}
    for author, count in modified_authors.items():
        original_count = original_authors.get(author, 0)
        diff = count - original_count
        if diff > 0:
            new_changes[author] = diff

    if not new_changes:
        return default

    if len(new_changes) == 1:
        return next(iter(new_changes))

    raise ValueError(
        f"Multiple authors added new changes: {new_changes}. "
        "Cannot infer which author to validate."
    )

scripts/office/pack.py

Scripts/office/pack.py herunterladen

"""Pack a directory into a DOCX, PPTX, or XLSX file.

Validates with auto-repair, condenses XML formatting, and creates the Office file.

Usage:
    python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]

Examples:
    python pack.py unpacked/ output.docx --original input.docx
    python pack.py unpacked/ output.pptx --validate false
"""

import argparse
import sys
import shutil
import tempfile
import zipfile
from pathlib import Path

import defusedxml.minidom

from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator

def pack(
    input_directory: str,
    output_file: str,
    original_file: str | None = None,
    validate: bool = True,
    infer_author_func=None,
) -> tuple[None, str]:
    input_dir = Path(input_directory)
    output_path = Path(output_file)
    suffix = output_path.suffix.lower()

    if not input_dir.is_dir():
        return None, f"Error: {input_dir} is not a directory"

    if suffix not in {".docx", ".pptx", ".xlsx"}:
        return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"

    if validate and original_file:
        original_path = Path(original_file)
        if original_path.exists():
            success, output = _run_validation(
                input_dir, original_path, suffix, infer_author_func
            )
            if output:
                print(output)
            if not success:
                return None, f"Error: Validation failed for {input_dir}"

    with tempfile.TemporaryDirectory() as temp_dir:
        temp_content_dir = Path(temp_dir) / "content"
        shutil.copytree(input_dir, temp_content_dir)

        for pattern in ["*.xml", "*.rels"]:
            for xml_file in temp_content_dir.rglob(pattern):
                _condense_xml(xml_file)

        output_path.parent.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
            for f in temp_content_dir.rglob("*"):
                if f.is_file():
                    zf.write(f, f.relative_to(temp_content_dir))

    return None, f"Successfully packed {input_dir} to {output_file}"


def _run_validation(
    unpacked_dir: Path,
    original_file: Path,
    suffix: str,
    infer_author_func=None,
) -> tuple[bool, str | None]:
    output_lines = []
    validators = []

    if suffix == ".docx":
        author = "Claude"
        if infer_author_func:
            try:
                author = infer_author_func(unpacked_dir, original_file)
            except ValueError as e:
                print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)

        validators = [
            DOCXSchemaValidator(unpacked_dir, original_file),
            RedliningValidator(unpacked_dir, original_file, author=author),
        ]
    elif suffix == ".pptx":
        validators = [PPTXSchemaValidator(unpacked_dir, original_file)]

    if not validators:
        return True, None

    total_repairs = sum(v.repair() for v in validators)
    if total_repairs:
        output_lines.append(f"Auto-repaired {total_repairs} issue(s)")

    success = all(v.validate() for v in validators)

    if success:
        output_lines.append("All validations PASSED!")

    return success, "\n".join(output_lines) if output_lines else None


def _condense_xml(xml_file: Path) -> None:
    try:
        with open(xml_file, encoding="utf-8") as f:
            dom = defusedxml.minidom.parse(f)

        for element in dom.getElementsByTagName("*"):
            if element.tagName.endswith(":t"):
                continue

            for child in list(element.childNodes):
                if (
                    child.nodeType == child.TEXT_NODE
                    and child.nodeValue
                    and child.nodeValue.strip() == ""
                ) or child.nodeType == child.COMMENT_NODE:
                    element.removeChild(child)

        xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
    except Exception as e:
        print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
        raise


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Pack a directory into a DOCX, PPTX, or XLSX file"
    )
    parser.add_argument("input_directory", help="Unpacked Office document directory")
    parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
    parser.add_argument(
        "--original",
        help="Original file for validation comparison",
    )
    parser.add_argument(
        "--validate",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Run validation with auto-repair (default: true)",
    )
    args = parser.parse_args()

    _, message = pack(
        args.input_directory,
        args.output_file,
        original_file=args.original,
        validate=args.validate,
    )
    print(message)

    if "Error" in message:
        sys.exit(1)

scripts/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/pml.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/pml.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/sml.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/sml.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/wml.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/wml.xsd

Binäre Ressource

scripts/office/schemas/ISO-IEC29500-4_2016/xml.xsd

Download scripts/office/schemas/ISO-IEC29500-4_2016/xml.xsd

Binäre Ressource

scripts/office/schemas/ecma/fouth-edition/opc-contentTypes.xsd

Download scripts/office/schemas/ecma/fouth-edition/opc-contentTypes.xsd

Binäre Ressource

scripts/office/schemas/ecma/fouth-edition/opc-coreProperties.xsd

Download scripts/office/schemas/ecma/fouth-edition/opc-coreProperties.xsd

Binäre Ressource

scripts/office/schemas/ecma/fouth-edition/opc-digSig.xsd

Download scripts/office/schemas/ecma/fouth-edition/opc-digSig.xsd

Binäre Ressource

scripts/office/schemas/ecma/fouth-edition/opc-relationships.xsd

Download scripts/office/schemas/ecma/fouth-edition/opc-relationships.xsd

Binäre Ressource

scripts/office/schemas/mce/mc.xsd

Download scripts/office/schemas/mce/mc.xsd

Binäre Ressource

scripts/office/schemas/microsoft/wml-2010.xsd

Download scripts/office/schemas/microsoft/wml-2010.xsd

Binäre Ressource

scripts/office/schemas/microsoft/wml-2012.xsd

Scripts/office/schemas/microsoft/wml-2012.xsd herunterladen

Binäre Ressource

scripts/office/schemas/microsoft/wml-2018.xsd

Scripts/office/schemas/microsoft/wml-2018.xsd herunterladen

Binäre Ressource

scripts/office/schemas/microsoft/wml-cex-2018.xsd

Scripts/office/schemas/microsoft/wml-cex-2018.xsd herunterladen

Binäre Ressource

scripts/office/schemas/microsoft/wml-cid-2016.xsd

Scripts/office/schemas/microsoft/wml-cid-2016.xsd herunterladen

Binäre Ressource

scripts/office/schemas/microsoft/wml-sdtdatahash-2020.xsd

Download scripts/office/schemas/microsoft/wml-sdtdatahash-2020.xsd

Binäre Ressource

scripts/office/schemas/microsoft/wml-symex-2015.xsd

Download scripts/office/schemas/microsoft/wml-symex-2015.xsd

Binäre Ressource

scripts/office/soffice.py

Scripts/office/soffice.py herunterladen

"""
Helper for running LibreOffice (soffice) in environments where AF_UNIX
sockets may be blocked (e.g., sandboxed VMs).  Detects the restriction
at runtime and applies an LD_PRELOAD shim if needed.

Usage:
    from office.soffice import run_soffice, get_soffice_env

    # Option 1 – run soffice directly
    result = run_soffice(["--headless", "--convert-to", "pdf", "input.docx"])

    # Option 2 – get env dict for your own subprocess calls
    env = get_soffice_env()
    subprocess.run(["soffice", ...], env=env)
"""

import os
import socket
import subprocess
import tempfile
from pathlib import Path


def get_soffice_env() -> dict:
    env = os.environ.copy()
    env["SAL_USE_VCLPLUGIN"] = "svp"

    if _needs_shim():
        shim = _ensure_shim()
        env["LD_PRELOAD"] = str(shim)

    return env


def run_soffice(args: list[str], **kwargs) -> subprocess.CompletedProcess:
    env = get_soffice_env()
    return subprocess.run(["soffice"] + args, env=env, **kwargs)



_SHIM_SO = Path(tempfile.gettempdir()) / "lo_socket_shim.so"


def _needs_shim() -> bool:
    try:
        s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        s.close()
        return False
    except OSError:
        return True


def _ensure_shim() -> Path:
    if _SHIM_SO.exists():
        return _SHIM_SO

    src = Path(tempfile.gettempdir()) / "lo_socket_shim.c"
    src.write_text(_SHIM_SOURCE)
    subprocess.run(
        ["gcc", "-shared", "-fPIC", "-o", str(_SHIM_SO), str(src), "-ldl"],
        check=True,
        capture_output=True,
    )
    src.unlink()
    return _SHIM_SO



_SHIM_SOURCE = r"""
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>

static int (*real_socket)(int, int, int);
static int (*real_socketpair)(int, int, int, int[2]);
static int (*real_listen)(int, int);
static int (*real_accept)(int, struct sockaddr *, socklen_t *);
static int (*real_close)(int);
static int (*real_read)(int, void *, size_t);

/* Per-FD bookkeeping (FDs >= 1024 are passed through unshimmed). */
static int is_shimmed[1024];
static int peer_of[1024];
static int wake_r[1024];            /* accept() blocks reading this */
static int wake_w[1024];            /* close()  writes to this      */
static int listener_fd = -1;        /* FD that received listen()    */

__attribute__((constructor))
static void init(void) {
    real_socket     = dlsym(RTLD_NEXT, "socket");
    real_socketpair = dlsym(RTLD_NEXT, "socketpair");
    real_listen     = dlsym(RTLD_NEXT, "listen");
    real_accept     = dlsym(RTLD_NEXT, "accept");
    real_close      = dlsym(RTLD_NEXT, "close");
    real_read       = dlsym(RTLD_NEXT, "read");
    for (int i = 0; i < 1024; i++) {
        peer_of[i] = -1;
        wake_r[i]  = -1;
        wake_w[i]  = -1;
    }
}

/* ---- socket ---------------------------------------------------------- */
int socket(int domain, int type, int protocol) {
    if (domain == AF_UNIX) {
        int fd = real_socket(domain, type, protocol);
        if (fd >= 0) return fd;
        /* socket(AF_UNIX) blocked – fall back to socketpair(). */
        int sv[2];
        if (real_socketpair(domain, type, protocol, sv) == 0) {
            if (sv[0] >= 0 && sv[0] < 1024) {
                is_shimmed[sv[0]] = 1;
                peer_of[sv[0]]    = sv[1];
                int wp[2];
                if (pipe(wp) == 0) {
                    wake_r[sv[0]] = wp[0];
                    wake_w[sv[0]] = wp[1];
                }
            }
            return sv[0];
        }
        errno = EPERM;
        return -1;
    }
    return real_socket(domain, type, protocol);
}

/* ---- listen ---------------------------------------------------------- */
int listen(int sockfd, int backlog) {
    if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
        listener_fd = sockfd;
        return 0;
    }
    return real_listen(sockfd, backlog);
}

/* ---- accept ---------------------------------------------------------- */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
    if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
        /* Block until close() writes to the wake pipe. */
        if (wake_r[sockfd] >= 0) {
            char buf;
            real_read(wake_r[sockfd], &buf, 1);
        }
        errno = ECONNABORTED;
        return -1;
    }
    return real_accept(sockfd, addr, addrlen);
}

/* ---- close ----------------------------------------------------------- */
int close(int fd) {
    if (fd >= 0 && fd < 1024 && is_shimmed[fd]) {
        int was_listener = (fd == listener_fd);
        is_shimmed[fd] = 0;

        if (wake_w[fd] >= 0) {              /* unblock accept() */
            char c = 0;
            write(wake_w[fd], &c, 1);
            real_close(wake_w[fd]);
            wake_w[fd] = -1;
        }
        if (wake_r[fd] >= 0) { real_close(wake_r[fd]); wake_r[fd]  = -1; }
        if (peer_of[fd] >= 0) { real_close(peer_of[fd]); peer_of[fd] = -1; }

        if (was_listener)
            _exit(0);                        /* conversion done – exit */
    }
    return real_close(fd);
}
"""



if __name__ == "__main__":
    import sys
    result = run_soffice(sys.argv[1:])
    sys.exit(result.returncode)

scripts/office/unpack.py

Scripts/office/unpack.py herunterladen

"""Unpack Office files (DOCX, PPTX, XLSX) for editing.

Extracts the ZIP archive, pretty-prints XML files, and optionally:
- Merges adjacent runs with identical formatting (DOCX only)
- Simplifies adjacent tracked changes from same author (DOCX only)

Usage:
    python unpack.py <office_file> <output_dir> [options]

Examples:
    python unpack.py document.docx unpacked/
    python unpack.py presentation.pptx unpacked/
    python unpack.py document.docx unpacked/ --merge-runs false
"""

import argparse
import sys
import zipfile
from pathlib import Path

import defusedxml.minidom

from helpers.merge_runs import merge_runs as do_merge_runs
from helpers.simplify_redlines import simplify_redlines as do_simplify_redlines

SMART_QUOTE_REPLACEMENTS = {
    "\u201c": "&#x201C;",  
    "\u201d": "&#x201D;",  
    "\u2018": "&#x2018;",  
    "\u2019": "&#x2019;",  
}


def unpack(
    input_file: str,
    output_directory: str,
    merge_runs: bool = True,
    simplify_redlines: bool = True,
) -> tuple[None, str]:
    input_path = Path(input_file)
    output_path = Path(output_directory)
    suffix = input_path.suffix.lower()

    if not input_path.exists():
        return None, f"Error: {input_file} does not exist"

    if suffix not in {".docx", ".pptx", ".xlsx"}:
        return None, f"Error: {input_file} must be a .docx, .pptx, or .xlsx file"

    try:
        output_path.mkdir(parents=True, exist_ok=True)

        with zipfile.ZipFile(input_path, "r") as zf:
            zf.extractall(output_path)

        xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
        for xml_file in xml_files:
            _pretty_print_xml(xml_file)

        message = f"Unpacked {input_file} ({len(xml_files)} XML files)"

        if suffix == ".docx":
            if simplify_redlines:
                simplify_count, _ = do_simplify_redlines(str(output_path))
                message += f", simplified {simplify_count} tracked changes"

            if merge_runs:
                merge_count, _ = do_merge_runs(str(output_path))
                message += f", merged {merge_count} runs"

        for xml_file in xml_files:
            _escape_smart_quotes(xml_file)

        return None, message

    except zipfile.BadZipFile:
        return None, f"Error: {input_file} is not a valid Office file"
    except Exception as e:
        return None, f"Error unpacking: {e}"


def _pretty_print_xml(xml_file: Path) -> None:
    try:
        content = xml_file.read_text(encoding="utf-8")
        dom = defusedxml.minidom.parseString(content)
        xml_file.write_bytes(dom.toprettyxml(indent="  ", encoding="utf-8"))
    except Exception:
        pass  


def _escape_smart_quotes(xml_file: Path) -> None:
    try:
        content = xml_file.read_text(encoding="utf-8")
        for char, entity in SMART_QUOTE_REPLACEMENTS.items():
            content = content.replace(char, entity)
        xml_file.write_text(content, encoding="utf-8")
    except Exception:
        pass


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Unpack an Office file (DOCX, PPTX, XLSX) for editing"
    )
    parser.add_argument("input_file", help="Office file to unpack")
    parser.add_argument("output_directory", help="Output directory")
    parser.add_argument(
        "--merge-runs",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Merge adjacent runs with identical formatting (DOCX only, default: true)",
    )
    parser.add_argument(
        "--simplify-redlines",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Merge adjacent tracked changes from same author (DOCX only, default: true)",
    )
    args = parser.parse_args()

    _, message = unpack(
        args.input_file,
        args.output_directory,
        merge_runs=args.merge_runs,
        simplify_redlines=args.simplify_redlines,
    )
    print(message)

    if "Error" in message:
        sys.exit(1)

scripts/office/validate.py

Scripts/office/validate.py herunterladen

"""
Command line tool to validate Office document XML files against XSD schemas and tracked changes.

Usage:
    python validate.py <path> [--original <original_file>] [--auto-repair] [--author NAME]

The first argument can be either:
- An unpacked directory containing the Office document XML files
- A packed Office file (.docx/.pptx/.xlsx) which will be unpacked to a temp directory

Auto-repair fixes:
- paraId/durableId values that exceed OOXML limits
- Missing xml:space="preserve" on w:t elements with whitespace
"""

import argparse
import sys
import tempfile
import zipfile
from pathlib import Path

from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator


def main():
    parser = argparse.ArgumentParser(description="Validate Office document XML files")
    parser.add_argument(
        "path",
        help="Path to unpacked directory or packed Office file (.docx/.pptx/.xlsx)",
    )
    parser.add_argument(
        "--original",
        required=False,
        default=None,
        help="Path to original file (.docx/.pptx/.xlsx). If omitted, all XSD errors are reported and redlining validation is skipped.",
    )
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        help="Enable verbose output",
    )
    parser.add_argument(
        "--auto-repair",
        action="store_true",
        help="Automatically repair common issues (hex IDs, whitespace preservation)",
    )
    parser.add_argument(
        "--author",
        default="Claude",
        help="Author name for redlining validation (default: Claude)",
    )
    args = parser.parse_args()

    path = Path(args.path)
    assert path.exists(), f"Error: {path} does not exist"

    original_file = None
    if args.original:
        original_file = Path(args.original)
        assert original_file.is_file(), f"Error: {original_file} is not a file"
        assert original_file.suffix.lower() in [".docx", ".pptx", ".xlsx"], (
            f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
        )

    file_extension = (original_file or path).suffix.lower()
    assert file_extension in [".docx", ".pptx", ".xlsx"], (
        f"Error: Cannot determine file type from {path}. Use --original or provide a .docx/.pptx/.xlsx file."
    )

    if path.is_file() and path.suffix.lower() in [".docx", ".pptx", ".xlsx"]:
        temp_dir = tempfile.mkdtemp()
        with zipfile.ZipFile(path, "r") as zf:
            zf.extractall(temp_dir)
        unpacked_dir = Path(temp_dir)
    else:
        assert path.is_dir(), f"Error: {path} is not a directory or Office file"
        unpacked_dir = path

    match file_extension:
        case ".docx":
            validators = [
                DOCXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
            ]
            if original_file:
                validators.append(
                    RedliningValidator(unpacked_dir, original_file, verbose=args.verbose, author=args.author)  
                )
        case ".pptx":
            validators = [
                PPTXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
            ]
        case _:
            print(f"Error: Validation not supported for file type {file_extension}")
            sys.exit(1)

    if args.auto_repair:
        total_repairs = sum(v.repair() for v in validators)
        if total_repairs:
            print(f"Auto-repaired {total_repairs} issue(s)")

    success = all(v.validate() for v in validators)

    if success:
        print("All validations PASSED!")

    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()

scripts/office/validators/init.py

Scripts/office/validators/init.py herunterladen

"""
Validation modules for Word document processing.
"""

from .base import BaseSchemaValidator
from .docx import DOCXSchemaValidator
from .pptx import PPTXSchemaValidator
from .redlining import RedliningValidator

__all__ = [
    "BaseSchemaValidator",
    "DOCXSchemaValidator",
    "PPTXSchemaValidator",
    "RedliningValidator",
]

scripts/office/validators/base.py

Scripts/office/validators/base.py herunterladen

Binäre Ressource

scripts/office/validators/docx.py

Scripts/office/validators/docx.py herunterladen

Binäre Ressource

scripts/office/validators/pptx.py

Scripts/office/validators/pptx.py herunterladen

Binäre Ressource

scripts/office/validators/redlining.py

Download scripts/office/validators/redlining.py

Binäre Ressource

scripts/recalc.py

Scripts/recalc.py herunterladen

"""
Excel Formula Recalculation Script
Recalculates all formulas in an Excel file using LibreOffice
"""

import json
import os
import platform
import subprocess
import sys
from pathlib import Path

from office.soffice import get_soffice_env

from openpyxl import load_workbook

MACRO_DIR_MACOS = "~/Library/Application Support/LibreOffice/4/user/basic/Standard"
MACRO_DIR_LINUX = "~/.config/libreoffice/4/user/basic/Standard"
MACRO_FILENAME = "Module1.xba"

RECALCULATE_MACRO = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">
    Sub RecalculateAndSave()
      ThisComponent.calculateAll()
      ThisComponent.store()
      ThisComponent.close(True)
    End Sub
</script:module>"""


def has_gtimeout():
    try:
        subprocess.run(
            ["gtimeout", "--version"], capture_output=True, timeout=1, check=False
        )
        return True
    except (FileNotFoundError, subprocess.TimeoutExpired):
        return False


def setup_libreoffice_macro():
    macro_dir = os.path.expanduser(
        MACRO_DIR_MACOS if platform.system() == "Darwin" else MACRO_DIR_LINUX
    )
    macro_file = os.path.join(macro_dir, MACRO_FILENAME)

    if (
        os.path.exists(macro_file)
        and "RecalculateAndSave" in Path(macro_file).read_text()
    ):
        return True

    if not os.path.exists(macro_dir):
        subprocess.run(
            ["soffice", "--headless", "--terminate_after_init"],
            capture_output=True,
            timeout=10,
            env=get_soffice_env(),
        )
        os.makedirs(macro_dir, exist_ok=True)

    try:
        Path(macro_file).write_text(RECALCULATE_MACRO)
        return True
    except Exception:
        return False


def recalc(filename, timeout=30):
    if not Path(filename).exists():
        return {"error": f"File {filename} does not exist"}

    abs_path = str(Path(filename).absolute())

    if not setup_libreoffice_macro():
        return {"error": "Failed to setup LibreOffice macro"}

    cmd = [
        "soffice",
        "--headless",
        "--norestore",
        "vnd.sun.star.script:Standard.Module1.RecalculateAndSave?language=Basic&location=application",
        abs_path,
    ]

    if platform.system() == "Linux":
        cmd = ["timeout", str(timeout)] + cmd
    elif platform.system() == "Darwin" and has_gtimeout():
        cmd = ["gtimeout", str(timeout)] + cmd

    result = subprocess.run(cmd, capture_output=True, text=True, env=get_soffice_env())

    if result.returncode != 0 and result.returncode != 124:  
        error_msg = result.stderr or "Unknown error during recalculation"
        if "Module1" in error_msg or "RecalculateAndSave" not in error_msg:
            return {"error": "LibreOffice macro not configured properly"}
        return {"error": error_msg}

    try:
        wb = load_workbook(filename, data_only=True)

        excel_errors = [
            "#VALUE!",
            "#DIV/0!",
            "#REF!",
            "#NAME?",
            "#NULL!",
            "#NUM!",
            "#N/A",
        ]
        error_details = {err: [] for err in excel_errors}
        total_errors = 0

        for sheet_name in wb.sheetnames:
            ws = wb[sheet_name]
            for row in ws.iter_rows():
                for cell in row:
                    if cell.value is not None and isinstance(cell.value, str):
                        for err in excel_errors:
                            if err in cell.value:
                                location = f"{sheet_name}!{cell.coordinate}"
                                error_details[err].append(location)
                                total_errors += 1
                                break

        wb.close()

        result = {
            "status": "success" if total_errors == 0 else "errors_found",
            "total_errors": total_errors,
            "error_summary": {},
        }

        for err_type, locations in error_details.items():
            if locations:
                result["error_summary"][err_type] = {
                    "count": len(locations),
                    "locations": locations[:20],  
                }

        wb_formulas = load_workbook(filename, data_only=False)
        formula_count = 0
        for sheet_name in wb_formulas.sheetnames:
            ws = wb_formulas[sheet_name]
            for row in ws.iter_rows():
                for cell in row:
                    if (
                        cell.value
                        and isinstance(cell.value, str)
                        and cell.value.startswith("=")
                    ):
                        formula_count += 1
        wb_formulas.close()

        result["total_formulas"] = formula_count

        return result

    except Exception as e:
        return {"error": str(e)}


def main():
    if len(sys.argv) < 2:
        print("Usage: python recalc.py <excel_file> [timeout_seconds]")
        print("\nRecalculates all formulas in an Excel file using LibreOffice")
        print("\nReturns JSON with error details:")
        print("  - status: 'success' or 'errors_found'")
        print("  - total_errors: Total number of Excel errors found")
        print("  - total_formulas: Number of formulas in the file")
        print("  - error_summary: Breakdown by error type with locations")
        print("    - #VALUE!, #DIV/0!, #REF!, #NAME?, #NULL!, #NUM!, #N/A")
        sys.exit(1)

    filename = sys.argv[1]
    timeout = int(sys.argv[2]) if len(sys.argv) > 2 else 30

    result = recalc(filename, timeout)
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    main()

Siehe in GitHub

Quelle: Inhalt angepasst von Anthropics/Skills (MIT).

Alle Excel-Dateien

Professionelle Schriftart

Verwenden Sie für alle Lieferungen eine konsistente, professionelle Schriftart (z. B. Arial, Times New Roman), sofern der Benutzer nichts anderes anweist

Keine Formelfehler

Jedes Excel-Modell MUSS mit NULL Formelfehlern geliefert werden (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?)

Vorhandene Vorlagen beibehalten (beim Aktualisieren von Vorlagen)

Studieren Sie vorhandene Formate, Stile und Konventionen und passen Sie sie beim Ändern von Dateien GENAU an
Erzwingen Sie niemals eine standardisierte Formatierung von Dateien mit etablierten Mustern
Bestehende Vorlagenkonventionen setzen diese Richtlinien IMMER außer Kraft

Finanzmodelle

Farbcodierungsstandards

Sofern vom Benutzer oder der vorhandenen Vorlage nicht anders angegeben

Branchenübliche Farbkonventionen

Blauer Text (RGB: 0,0,255): Hartcodierte Eingaben und Zahlen, die Benutzer für Szenarien ändern
Schwarzer Text (RGB: 0,0,0): ALLE Formeln und Berechnungen
Grüner Text (RGB: 0,128,0): Links, die von anderen Arbeitsblättern innerhalb derselben Arbeitsmappe abgerufen werden
Roter Text (RGB: 255,0,0): Externe Links zu anderen Dateien
Gelber Hintergrund (RGB: 255.255,0): Wichtige Annahmen, die Aufmerksamkeit erfordern, oder Zellen, die aktualisiert werden müssen

Standards für die Zahlenformatierung

Erforderliche Formatregeln

Jahre: Als Textzeichenfolgen formatieren (z. B. „2024“, nicht „2.024“)
Währung: Verwenden Sie das Format $#,##0; Geben Sie IMMER Einheiten in den Kopfzeilen an („Umsatz ($mm)“)
Nullen: Verwenden Sie Zahlenformatierung, um alle Nullen zu „-“ zu machen, einschließlich Prozentangaben (z. B. „$#,##0;($#,##0);-“)
Prozentsätze: Standardmäßig 0,0 %-Format (eine Dezimalstelle)
Multiples: Format als 0,0x für Bewertungsmultiplikatoren (EV/EBITDA, KGV)
Negative Zahlen: Verwenden Sie Klammern (123), nicht minus -123

Formelkonstruktionsregeln

Platzierung der Annahmen

Platzieren Sie ALLE Annahmen (Wachstumsraten, Margen, Vielfache usw.) in separaten Annahmezellen
Verwenden Sie Zellbezüge anstelle von fest codierten Werten in Formeln
Beispiel: Verwenden Sie =B5*(1+$B$6) anstelle von =B5*1,05

Verhinderung von Formelfehlern

Überprüfen Sie, ob alle Zellbezüge korrekt sind
Überprüfen Sie die Bereiche auf Abweichungsfehler
Sorgen Sie für konsistente Formeln über alle Prognosezeiträume hinweg
Test mit Randfällen (Nullwerte, negative Zahlen)
Stellen Sie sicher, dass es keine unbeabsichtigten Zirkelverweise gibt

Dokumentationsanforderungen für Hardcodes

Kommentar oder in den Zellen daneben (am Ende der Tabelle). Format: „Quelle: [System/Dokument], [Datum], [Spezifische Referenz], [URL, falls zutreffend]“
Beispiele:
- „Quelle: Unternehmen 10-K, GJ2024, Seite 45, Umsatzmitteilung, [SEC EDGAR URL]“
- „Quelle: Unternehmen 10-Q, Q2 2025, Anlage 99.1, [SEC EDGAR URL]“
- „Quelle: Bloomberg Terminal, 15.08.2025, AAPL US Equity“
- „Quelle: FactSet, 20.08.2025, Bildschirm „Konsensschätzungen““

XLSX-Erstellung, Bearbeitung und Analyse

Übersicht

Wichtige Anforderungen

Daten lesen und analysieren

Datenanalyse mit Pandas

Verwenden Sie für Datenanalyse, Visualisierung und grundlegende Vorgänge Pandas, das leistungsstarke Datenbearbeitungsfunktionen bietet:

import pandas as pd

# Read Excel
df = pd.read_excel('file.xlsx')  # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None)  # All sheets as dict

# Analyze
df.head()      # Preview data
df.info()      # Column info
df.describe()  # Statistics

# Write Excel
df.to_excel('output.xlsx', index=False)

# Bad: Calculating in Python and hardcoding result
total = df['Sales'].sum()
sheet['B10'] = total  # Hardcodes 5000

# Bad: Computing growth rate in Python
growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue']
sheet['C5'] = growth  # Hardcodes 0.15

# Bad: Python calculation for average
avg = sum(values) / len(values)
sheet['D20'] = avg  # Hardcodes 42.5

RICHTIG – Excel-Formeln verwenden

# Good: Let Excel calculate the sum
sheet['B10'] = '=SUM(B2:B9)'

# Good: Growth rate as Excel formula
sheet['C5'] = '=(C4-C2)/C2'

# Good: Average using Excel function
sheet['D20'] = '=AVERAGE(D2:D19)'

Dies gilt für ALLE Berechnungen – Summen, Prozentsätze, Verhältnisse, Differenzen usw. Die Kalkulationstabelle sollte in der Lage sein, neu zu berechnen, wenn sich Quelldaten ändern.

Gemeinsamer Arbeitsablauf

Tool auswählen: Pandas für Daten, openpyxl für Formeln/Formatierung
Erstellen/Laden: Neue Arbeitsmappe erstellen oder vorhandene Datei laden
Ändern: Daten, Formeln und Formatierungen hinzufügen/bearbeiten
Speichern: In Datei schreiben
Formeln neu berechnen (OBLIGATORISCH, WENN FORMELN VERWENDET WERDEN): Verwenden Sie das Skript scripts/recalc.py
```
python scripts/recalc.py output.xlsx
```
Überprüfen und beheben Sie alle Fehler:
- Das Skript gibt JSON mit Fehlerdetails zurück
- Wennstatus``errors_foundist, überprüfen Sieerror_summaryauf bestimmte Fehlertypen und -orte
- Beheben Sie die identifizierten Fehler und führen Sie erneut eine Neuberechnung durch
- Häufige zu behebende Fehler:
  - #REF!: Ungültige Zellbezüge
  - #DIV/0!: Division durch Null
  - #VALUE!: Falscher Datentyp in der Formel
  - #NAME?: Unbekannter Formelname

Erstellen neuer Excel-Dateien

# Using openpyxl for formulas and formatting
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active

# Add data
sheet['A1'] = 'Hello'
sheet['B1'] = 'World'
sheet.append(['Row', 'of', 'data'])

# Add formula
sheet['B2'] = '=SUM(A1:A10)'

# Formatting
sheet['A1'].font = Font(bold=True, color='FF0000')
sheet['A1'].fill = PatternFill('solid', start_color='FFFF00')
sheet['A1'].alignment = Alignment(horizontal='center')

# Column width
sheet.column_dimensions['A'].width = 20

wb.save('output.xlsx')

Bearbeiten vorhandener Excel-Dateien

# Using openpyxl to preserve formulas and formatting
from openpyxl import load_workbook

# Load existing file
wb = load_workbook('existing.xlsx')
sheet = wb.active  # or wb['SheetName'] for specific sheet

# Working with multiple sheets
for sheet_name in wb.sheetnames:
    sheet = wb[sheet_name]
    print(f"Sheet: {sheet_name}")

# Modify cells
sheet['A1'] = 'New Value'
sheet.insert_rows(2)  # Insert row at position 2
sheet.delete_cols(3)  # Delete column 3

# Add new sheet
new_sheet = wb.create_sheet('NewSheet')
new_sheet['A1'] = 'Data'

wb.save('modified.xlsx')

Formeln neu berechnen

python scripts/recalc.py <excel_file> [timeout_seconds]

Beispiel:

python scripts/recalc.py output.xlsx 30

Das Skript:

Richtet das LibreOffice-Makro beim ersten Start automatisch ein
Berechnet alle Formeln in allen Blättern neu
Durchsucht ALLE Zellen nach Excel-Fehlern (#REF!, #DIV/0! usw.)
Gibt JSON mit detaillierten Fehlerorten und -anzahlen zurück
Funktioniert sowohl unter Linux als auch unter macOS

Checkliste zur Formelüberprüfung

Schnelle Überprüfungen, um sicherzustellen, dass Formeln korrekt funktionieren:

Wesentliche Überprüfung

Testen Sie 2-3 Beispielreferenzen: Stellen Sie sicher, dass sie korrekte Werte abrufen, bevor Sie das vollständige Modell erstellen
Spaltenzuordnung: Übereinstimmung der Excel-Spalten bestätigen (z. B. Spalte 64 = BL, nicht BK)
Zeilenversatz: Denken Sie daran, dass Excel-Zeilen 1-indiziert sind (DataFrame-Zeile 5 = Excel-Zeile 6)

Häufige Fallstricke

NaN-Behandlung: Mitpd.notna()auf Nullwerte prüfen
Spalten ganz rechts: Geschäftsjahresdaten häufig in Spalten 50+
Mehrere Übereinstimmungen: Alle Vorkommen durchsuchen, nicht nur das erste
Division durch Null: Überprüfen Sie die Nenner, bevor Sie/in Formeln verwenden (#DIV/0!)
Falsche Referenzen: Überprüfen Sie, ob alle Zellreferenzen auf die beabsichtigten Zellen verweisen (#REF!)
Blattübergreifende Verweise: Verwenden Sie das richtige Format (Blatt1!A1) zum Verknüpfen von Blättern

Formelteststrategie

Klein anfangen: Testen Sie Formeln an 2-3 Zellen, bevor Sie sie großflächig anwenden
Abhängigkeiten überprüfen: Überprüfen Sie, ob alle in Formeln referenzierten Zellen vorhanden sind
Testrandfälle: Null, negative und sehr große Werte einschließen

Interpretieren der Ausgabe von scripts/recalc.py

Das Skript gibt JSON mit Fehlerdetails zurück:

{
  "status": "success",           // or "errors_found"
  "total_errors": 0,              // Total error count
  "total_formulas": 42,           // Number of formulas in file
  "error_summary": {              // Only present if errors found
    "#REF!": {
      "count": 2,
      "locations": ["Sheet1!B5", "Sheet1!C10"]
    }
  }
}

Best Practices

Bibliotheksauswahl

Pandas: Am besten für Datenanalyse, Massenoperationen und einfachen Datenexport geeignet
openpyxl: Am besten für komplexe Formatierungen, Formeln und Excel-spezifische Funktionen geeignet

Arbeiten mit openpyxl

Zellindizes basieren auf 1 (Zeile=1, Spalte=1 bezieht sich auf Zelle A1)
Verwenden Siedata_only=True, um berechnete Werte zu lesen:load_workbook('file.xlsx', data_only=True)
Warnung: Beim Öffnen mitdata_only=Trueund Speichern werden Formeln durch Werte ersetzt und gehen dauerhaft verloren
Für große Dateien: Verwenden Sieread_only=Truezum Lesen oderwrite_only=Truezum Schreiben
Formeln bleiben erhalten, aber nicht ausgewertet – verwenden Sie scripts/recalc.py, um Werte zu aktualisieren

Arbeiten mit Pandas

Geben Sie Datentypen an, um Inferenzprobleme zu vermeiden:pd.read_excel('file.xlsx', dtype={'id': str})
Lesen Sie bei großen Dateien bestimmte Spalten:pd.read_excel('file.xlsx', usecols=['A', 'C', 'E'])
Gehen Sie richtig mit Datumsangaben um:pd.read_excel('file.xlsx', parse_dates=['date_column'])

Richtlinien zum Codestil

WICHTIG: Beim Generieren von Python-Code für Excel-Operationen:

Schreiben Sie minimalen, prägnanten Python-Code ohne unnötige Kommentare
Vermeiden Sie ausführliche Variablennamen und redundante Operationen
Vermeiden Sie unnötige Druckanweisungen

Für Excel-Dateien selbst:

Fügen Sie Kommentare zu Zellen mit komplexen Formeln oder wichtigen Annahmen hinzu
Dokumentieren Sie Datenquellen für hartcodierte Werte
Fügen Sie Hinweise zu wichtigen Berechnungen und Modellabschnitten hinzu

"""Merge adjacent runs with identical formatting in DOCX.

Merges adjacent <w:r> elements that have identical <w:rPr> properties.
Works on runs in paragraphs and inside tracked changes (<w:ins>, <w:del>).

Also:
- Removes rsid attributes from runs (revision metadata that doesn't affect rendering)
- Removes proofErr elements (spell/grammar markers that block merging)
"""

from pathlib import Path

import defusedxml.minidom


def merge_runs(input_dir: str) -> tuple[int, str]:
    doc_xml = Path(input_dir) / "word" / "document.xml"

    if not doc_xml.exists():
        return 0, f"Error: {doc_xml} not found"

    try:
        dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
        root = dom.documentElement

        _remove_elements(root, "proofErr")
        _strip_run_rsid_attrs(root)

        containers = {run.parentNode for run in _find_elements(root, "r")}

        merge_count = 0
        for container in containers:
            merge_count += _merge_runs_in(container)

        doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
        return merge_count, f"Merged {merge_count} runs"

    except Exception as e:
        return 0, f"Error: {e}"




def _find_elements(root, tag: str) -> list:
    results = []

    def traverse(node):
        if node.nodeType == node.ELEMENT_NODE:
            name = node.localName or node.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(node)
            for child in node.childNodes:
                traverse(child)

    traverse(root)
    return results


def _get_child(parent, tag: str):
    for child in parent.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name == tag or name.endswith(f":{tag}"):
                return child
    return None


def _get_children(parent, tag: str) -> list:
    results = []
    for child in parent.childNodes:
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(child)
    return results


def _is_adjacent(elem1, elem2) -> bool:
    node = elem1.nextSibling
    while node:
        if node == elem2:
            return True
        if node.nodeType == node.ELEMENT_NODE:
            return False
        if node.nodeType == node.TEXT_NODE and node.data.strip():
            return False
        node = node.nextSibling
    return False




def _remove_elements(root, tag: str):
    for elem in _find_elements(root, tag):
        if elem.parentNode:
            elem.parentNode.removeChild(elem)


def _strip_run_rsid_attrs(root):
    for run in _find_elements(root, "r"):
        for attr in list(run.attributes.values()):
            if "rsid" in attr.name.lower():
                run.removeAttribute(attr.name)




def _merge_runs_in(container) -> int:
    merge_count = 0
    run = _first_child_run(container)

    while run:
        while True:
            next_elem = _next_element_sibling(run)
            if next_elem and _is_run(next_elem) and _can_merge(run, next_elem):
                _merge_run_content(run, next_elem)
                container.removeChild(next_elem)
                merge_count += 1
            else:
                break

        _consolidate_text(run)
        run = _next_sibling_run(run)

    return merge_count


def _first_child_run(container):
    for child in container.childNodes:
        if child.nodeType == child.ELEMENT_NODE and _is_run(child):
            return child
    return None


def _next_element_sibling(node):
    sibling = node.nextSibling
    while sibling:
        if sibling.nodeType == sibling.ELEMENT_NODE:
            return sibling
        sibling = sibling.nextSibling
    return None


def _next_sibling_run(node):
    sibling = node.nextSibling
    while sibling:
        if sibling.nodeType == sibling.ELEMENT_NODE:
            if _is_run(sibling):
                return sibling
        sibling = sibling.nextSibling
    return None


def _is_run(node) -> bool:
    name = node.localName or node.tagName
    return name == "r" or name.endswith(":r")


def _can_merge(run1, run2) -> bool:
    rpr1 = _get_child(run1, "rPr")
    rpr2 = _get_child(run2, "rPr")

    if (rpr1 is None) != (rpr2 is None):
        return False
    if rpr1 is None:
        return True
    return rpr1.toxml() == rpr2.toxml()  


def _merge_run_content(target, source):
    for child in list(source.childNodes):
        if child.nodeType == child.ELEMENT_NODE:
            name = child.localName or child.tagName
            if name != "rPr" and not name.endswith(":rPr"):
                target.appendChild(child)


def _consolidate_text(run):
    t_elements = _get_children(run, "t")

    for i in range(len(t_elements) - 1, 0, -1):
        curr, prev = t_elements[i], t_elements[i - 1]

        if _is_adjacent(prev, curr):
            prev_text = prev.firstChild.data if prev.firstChild else ""
            curr_text = curr.firstChild.data if curr.firstChild else ""
            merged = prev_text + curr_text

            if prev.firstChild:
                prev.firstChild.data = merged
            else:
                prev.appendChild(run.ownerDocument.createTextNode(merged))

            if merged.startswith(" ") or merged.endswith(" "):
                prev.setAttribute("xml:space", "preserve")
            elif prev.hasAttribute("xml:space"):
                prev.removeAttribute("xml:space")

            run.removeChild(curr)

scripts/office/helpers/simplify_redlines.py

Download scripts/office/helpers/simplify_redlines.py

"""Simplify tracked changes by merging adjacent w:ins or w:del elements.

Merges adjacent <w:ins> elements from the same author into a single element.
Same for <w:del> elements. This makes heavily-redlined documents easier to
work with by reducing the number of tracked change wrappers.

Rules:
- Only merges w:ins with w:ins, w:del with w:del (same element type)
- Only merges if same author (ignores timestamp differences)
- Only merges if truly adjacent (only whitespace between them)
"""

import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path

import defusedxml.minidom

WORD_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"


def simplify_redlines(input_dir: str) -> tuple[int, str]:
    doc_xml = Path(input_dir) / "word" / "document.xml"

    if not doc_xml.exists():
        return 0, f"Error: {doc_xml} not found"

    try:
        dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
        root = dom.documentElement

        merge_count = 0

        containers = _find_elements(root, "p") + _find_elements(root, "tc")

        for container in containers:
            merge_count += _merge_tracked_changes_in(container, "ins")
            merge_count += _merge_tracked_changes_in(container, "del")

        doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
        return merge_count, f"Simplified {merge_count} tracked changes"

    except Exception as e:
        return 0, f"Error: {e}"


def _merge_tracked_changes_in(container, tag: str) -> int:
    merge_count = 0

    tracked = [
        child
        for child in container.childNodes
        if child.nodeType == child.ELEMENT_NODE and _is_element(child, tag)
    ]

    if len(tracked) < 2:
        return 0

    i = 0
    while i < len(tracked) - 1:
        curr = tracked[i]
        next_elem = tracked[i + 1]

        if _can_merge_tracked(curr, next_elem):
            _merge_tracked_content(curr, next_elem)
            container.removeChild(next_elem)
            tracked.pop(i + 1)
            merge_count += 1
        else:
            i += 1

    return merge_count


def _is_element(node, tag: str) -> bool:
    name = node.localName or node.tagName
    return name == tag or name.endswith(f":{tag}")


def _get_author(elem) -> str:
    author = elem.getAttribute("w:author")
    if not author:
        for attr in elem.attributes.values():
            if attr.localName == "author" or attr.name.endswith(":author"):
                return attr.value
    return author


def _can_merge_tracked(elem1, elem2) -> bool:
    if _get_author(elem1) != _get_author(elem2):
        return False

    node = elem1.nextSibling
    while node and node != elem2:
        if node.nodeType == node.ELEMENT_NODE:
            return False
        if node.nodeType == node.TEXT_NODE and node.data.strip():
            return False
        node = node.nextSibling

    return True


def _merge_tracked_content(target, source):
    while source.firstChild:
        child = source.firstChild
        source.removeChild(child)
        target.appendChild(child)


def _find_elements(root, tag: str) -> list:
    results = []

    def traverse(node):
        if node.nodeType == node.ELEMENT_NODE:
            name = node.localName or node.tagName
            if name == tag or name.endswith(f":{tag}"):
                results.append(node)
            for child in node.childNodes:
                traverse(child)

    traverse(root)
    return results


def get_tracked_change_authors(doc_xml_path: Path) -> dict[str, int]:
    if not doc_xml_path.exists():
        return {}

    try:
        tree = ET.parse(doc_xml_path)
        root = tree.getroot()
    except ET.ParseError:
        return {}

    namespaces = {"w": WORD_NS}
    author_attr = f"{{{WORD_NS}}}author"

    authors: dict[str, int] = {}
    for tag in ["ins", "del"]:
        for elem in root.findall(f".//w:{tag}", namespaces):
            author = elem.get(author_attr)
            if author:
                authors[author] = authors.get(author, 0) + 1

    return authors


def _get_authors_from_docx(docx_path: Path) -> dict[str, int]:
    try:
        with zipfile.ZipFile(docx_path, "r") as zf:
            if "word/document.xml" not in zf.namelist():
                return {}
            with zf.open("word/document.xml") as f:
                tree = ET.parse(f)
                root = tree.getroot()

                namespaces = {"w": WORD_NS}
                author_attr = f"{{{WORD_NS}}}author"

                authors: dict[str, int] = {}
                for tag in ["ins", "del"]:
                    for elem in root.findall(f".//w:{tag}", namespaces):
                        author = elem.get(author_attr)
                        if author:
                            authors[author] = authors.get(author, 0) + 1
                return authors
    except (zipfile.BadZipFile, ET.ParseError):
        return {}


def infer_author(modified_dir: Path, original_docx: Path, default: str = "Claude") -> str:
    modified_xml = modified_dir / "word" / "document.xml"
    modified_authors = get_tracked_change_authors(modified_xml)

    if not modified_authors:
        return default

    original_authors = _get_authors_from_docx(original_docx)

    new_changes: dict[str, int] = {}
    for author, count in modified_authors.items():
        original_count = original_authors.get(author, 0)
        diff = count - original_count
        if diff > 0:
            new_changes[author] = diff

    if not new_changes:
        return default

    if len(new_changes) == 1:
        return next(iter(new_changes))

    raise ValueError(
        f"Multiple authors added new changes: {new_changes}. "
        "Cannot infer which author to validate."
    )

scripts/office/pack.py

Scripts/office/pack.py herunterladen

"""Pack a directory into a DOCX, PPTX, or XLSX file.

Validates with auto-repair, condenses XML formatting, and creates the Office file.

Usage:
    python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]

Examples:
    python pack.py unpacked/ output.docx --original input.docx
    python pack.py unpacked/ output.pptx --validate false
"""

import argparse
import sys
import shutil
import tempfile
import zipfile
from pathlib import Path

import defusedxml.minidom

from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator

def pack(
    input_directory: str,
    output_file: str,
    original_file: str | None = None,
    validate: bool = True,
    infer_author_func=None,
) -> tuple[None, str]:
    input_dir = Path(input_directory)
    output_path = Path(output_file)
    suffix = output_path.suffix.lower()

    if not input_dir.is_dir():
        return None, f"Error: {input_dir} is not a directory"

    if suffix not in {".docx", ".pptx", ".xlsx"}:
        return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"

    if validate and original_file:
        original_path = Path(original_file)
        if original_path.exists():
            success, output = _run_validation(
                input_dir, original_path, suffix, infer_author_func
            )
            if output:
                print(output)
            if not success:
                return None, f"Error: Validation failed for {input_dir}"

    with tempfile.TemporaryDirectory() as temp_dir:
        temp_content_dir = Path(temp_dir) / "content"
        shutil.copytree(input_dir, temp_content_dir)

        for pattern in ["*.xml", "*.rels"]:
            for xml_file in temp_content_dir.rglob(pattern):
                _condense_xml(xml_file)

        output_path.parent.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
            for f in temp_content_dir.rglob("*"):
                if f.is_file():
                    zf.write(f, f.relative_to(temp_content_dir))

    return None, f"Successfully packed {input_dir} to {output_file}"


def _run_validation(
    unpacked_dir: Path,
    original_file: Path,
    suffix: str,
    infer_author_func=None,
) -> tuple[bool, str | None]:
    output_lines = []
    validators = []

    if suffix == ".docx":
        author = "Claude"
        if infer_author_func:
            try:
                author = infer_author_func(unpacked_dir, original_file)
            except ValueError as e:
                print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)

        validators = [
            DOCXSchemaValidator(unpacked_dir, original_file),
            RedliningValidator(unpacked_dir, original_file, author=author),
        ]
    elif suffix == ".pptx":
        validators = [PPTXSchemaValidator(unpacked_dir, original_file)]

    if not validators:
        return True, None

    total_repairs = sum(v.repair() for v in validators)
    if total_repairs:
        output_lines.append(f"Auto-repaired {total_repairs} issue(s)")

    success = all(v.validate() for v in validators)

    if success:
        output_lines.append("All validations PASSED!")

    return success, "\n".join(output_lines) if output_lines else None


def _condense_xml(xml_file: Path) -> None:
    try:
        with open(xml_file, encoding="utf-8") as f:
            dom = defusedxml.minidom.parse(f)

        for element in dom.getElementsByTagName("*"):
            if element.tagName.endswith(":t"):
                continue

            for child in list(element.childNodes):
                if (
                    child.nodeType == child.TEXT_NODE
                    and child.nodeValue
                    and child.nodeValue.strip() == ""
                ) or child.nodeType == child.COMMENT_NODE:
                    element.removeChild(child)

        xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
    except Exception as e:
        print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
        raise


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Pack a directory into a DOCX, PPTX, or XLSX file"
    )
    parser.add_argument("input_directory", help="Unpacked Office document directory")
    parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
    parser.add_argument(
        "--original",
        help="Original file for validation comparison",
    )
    parser.add_argument(
        "--validate",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Run validation with auto-repair (default: true)",
    )
    args = parser.parse_args()

    _, message = pack(
        args.input_directory,
        args.output_file,
        original_file=args.original,
        validate=args.validate,
    )
    print(message)

    if "Error" in message:
        sys.exit(1)

"""
Helper for running LibreOffice (soffice) in environments where AF_UNIX
sockets may be blocked (e.g., sandboxed VMs).  Detects the restriction
at runtime and applies an LD_PRELOAD shim if needed.

Usage:
    from office.soffice import run_soffice, get_soffice_env

    # Option 1 – run soffice directly
    result = run_soffice(["--headless", "--convert-to", "pdf", "input.docx"])

    # Option 2 – get env dict for your own subprocess calls
    env = get_soffice_env()
    subprocess.run(["soffice", ...], env=env)
"""

import os
import socket
import subprocess
import tempfile
from pathlib import Path


def get_soffice_env() -> dict:
    env = os.environ.copy()
    env["SAL_USE_VCLPLUGIN"] = "svp"

    if _needs_shim():
        shim = _ensure_shim()
        env["LD_PRELOAD"] = str(shim)

    return env


def run_soffice(args: list[str], **kwargs) -> subprocess.CompletedProcess:
    env = get_soffice_env()
    return subprocess.run(["soffice"] + args, env=env, **kwargs)



_SHIM_SO = Path(tempfile.gettempdir()) / "lo_socket_shim.so"


def _needs_shim() -> bool:
    try:
        s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        s.close()
        return False
    except OSError:
        return True


def _ensure_shim() -> Path:
    if _SHIM_SO.exists():
        return _SHIM_SO

    src = Path(tempfile.gettempdir()) / "lo_socket_shim.c"
    src.write_text(_SHIM_SOURCE)
    subprocess.run(
        ["gcc", "-shared", "-fPIC", "-o", str(_SHIM_SO), str(src), "-ldl"],
        check=True,
        capture_output=True,
    )
    src.unlink()
    return _SHIM_SO



_SHIM_SOURCE = r"""
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>

static int (*real_socket)(int, int, int);
static int (*real_socketpair)(int, int, int, int[2]);
static int (*real_listen)(int, int);
static int (*real_accept)(int, struct sockaddr *, socklen_t *);
static int (*real_close)(int);
static int (*real_read)(int, void *, size_t);

/* Per-FD bookkeeping (FDs >= 1024 are passed through unshimmed). */
static int is_shimmed[1024];
static int peer_of[1024];
static int wake_r[1024];            /* accept() blocks reading this */
static int wake_w[1024];            /* close()  writes to this      */
static int listener_fd = -1;        /* FD that received listen()    */

__attribute__((constructor))
static void init(void) {
    real_socket     = dlsym(RTLD_NEXT, "socket");
    real_socketpair = dlsym(RTLD_NEXT, "socketpair");
    real_listen     = dlsym(RTLD_NEXT, "listen");
    real_accept     = dlsym(RTLD_NEXT, "accept");
    real_close      = dlsym(RTLD_NEXT, "close");
    real_read       = dlsym(RTLD_NEXT, "read");
    for (int i = 0; i < 1024; i++) {
        peer_of[i] = -1;
        wake_r[i]  = -1;
        wake_w[i]  = -1;
    }
}

/* ---- socket ---------------------------------------------------------- */
int socket(int domain, int type, int protocol) {
    if (domain == AF_UNIX) {
        int fd = real_socket(domain, type, protocol);
        if (fd >= 0) return fd;
        /* socket(AF_UNIX) blocked – fall back to socketpair(). */
        int sv[2];
        if (real_socketpair(domain, type, protocol, sv) == 0) {
            if (sv[0] >= 0 && sv[0] < 1024) {
                is_shimmed[sv[0]] = 1;
                peer_of[sv[0]]    = sv[1];
                int wp[2];
                if (pipe(wp) == 0) {
                    wake_r[sv[0]] = wp[0];
                    wake_w[sv[0]] = wp[1];
                }
            }
            return sv[0];
        }
        errno = EPERM;
        return -1;
    }
    return real_socket(domain, type, protocol);
}

/* ---- listen ---------------------------------------------------------- */
int listen(int sockfd, int backlog) {
    if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
        listener_fd = sockfd;
        return 0;
    }
    return real_listen(sockfd, backlog);
}

/* ---- accept ---------------------------------------------------------- */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
    if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
        /* Block until close() writes to the wake pipe. */
        if (wake_r[sockfd] >= 0) {
            char buf;
            real_read(wake_r[sockfd], &buf, 1);
        }
        errno = ECONNABORTED;
        return -1;
    }
    return real_accept(sockfd, addr, addrlen);
}

/* ---- close ----------------------------------------------------------- */
int close(int fd) {
    if (fd >= 0 && fd < 1024 && is_shimmed[fd]) {
        int was_listener = (fd == listener_fd);
        is_shimmed[fd] = 0;

        if (wake_w[fd] >= 0) {              /* unblock accept() */
            char c = 0;
            write(wake_w[fd], &c, 1);
            real_close(wake_w[fd]);
            wake_w[fd] = -1;
        }
        if (wake_r[fd] >= 0) { real_close(wake_r[fd]); wake_r[fd]  = -1; }
        if (peer_of[fd] >= 0) { real_close(peer_of[fd]); peer_of[fd] = -1; }

        if (was_listener)
            _exit(0);                        /* conversion done – exit */
    }
    return real_close(fd);
}
"""



if __name__ == "__main__":
    import sys
    result = run_soffice(sys.argv[1:])
    sys.exit(result.returncode)

scripts/office/unpack.py

Scripts/office/unpack.py herunterladen

"""Unpack Office files (DOCX, PPTX, XLSX) for editing.

Extracts the ZIP archive, pretty-prints XML files, and optionally:
- Merges adjacent runs with identical formatting (DOCX only)
- Simplifies adjacent tracked changes from same author (DOCX only)

Usage:
    python unpack.py <office_file> <output_dir> [options]

Examples:
    python unpack.py document.docx unpacked/
    python unpack.py presentation.pptx unpacked/
    python unpack.py document.docx unpacked/ --merge-runs false
"""

import argparse
import sys
import zipfile
from pathlib import Path

import defusedxml.minidom

from helpers.merge_runs import merge_runs as do_merge_runs
from helpers.simplify_redlines import simplify_redlines as do_simplify_redlines

SMART_QUOTE_REPLACEMENTS = {
    "\u201c": "&#x201C;",  
    "\u201d": "&#x201D;",  
    "\u2018": "&#x2018;",  
    "\u2019": "&#x2019;",  
}


def unpack(
    input_file: str,
    output_directory: str,
    merge_runs: bool = True,
    simplify_redlines: bool = True,
) -> tuple[None, str]:
    input_path = Path(input_file)
    output_path = Path(output_directory)
    suffix = input_path.suffix.lower()

    if not input_path.exists():
        return None, f"Error: {input_file} does not exist"

    if suffix not in {".docx", ".pptx", ".xlsx"}:
        return None, f"Error: {input_file} must be a .docx, .pptx, or .xlsx file"

    try:
        output_path.mkdir(parents=True, exist_ok=True)

        with zipfile.ZipFile(input_path, "r") as zf:
            zf.extractall(output_path)

        xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
        for xml_file in xml_files:
            _pretty_print_xml(xml_file)

        message = f"Unpacked {input_file} ({len(xml_files)} XML files)"

        if suffix == ".docx":
            if simplify_redlines:
                simplify_count, _ = do_simplify_redlines(str(output_path))
                message += f", simplified {simplify_count} tracked changes"

            if merge_runs:
                merge_count, _ = do_merge_runs(str(output_path))
                message += f", merged {merge_count} runs"

        for xml_file in xml_files:
            _escape_smart_quotes(xml_file)

        return None, message

    except zipfile.BadZipFile:
        return None, f"Error: {input_file} is not a valid Office file"
    except Exception as e:
        return None, f"Error unpacking: {e}"


def _pretty_print_xml(xml_file: Path) -> None:
    try:
        content = xml_file.read_text(encoding="utf-8")
        dom = defusedxml.minidom.parseString(content)
        xml_file.write_bytes(dom.toprettyxml(indent="  ", encoding="utf-8"))
    except Exception:
        pass  


def _escape_smart_quotes(xml_file: Path) -> None:
    try:
        content = xml_file.read_text(encoding="utf-8")
        for char, entity in SMART_QUOTE_REPLACEMENTS.items():
            content = content.replace(char, entity)
        xml_file.write_text(content, encoding="utf-8")
    except Exception:
        pass


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Unpack an Office file (DOCX, PPTX, XLSX) for editing"
    )
    parser.add_argument("input_file", help="Office file to unpack")
    parser.add_argument("output_directory", help="Output directory")
    parser.add_argument(
        "--merge-runs",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Merge adjacent runs with identical formatting (DOCX only, default: true)",
    )
    parser.add_argument(
        "--simplify-redlines",
        type=lambda x: x.lower() == "true",
        default=True,
        metavar="true|false",
        help="Merge adjacent tracked changes from same author (DOCX only, default: true)",
    )
    args = parser.parse_args()

    _, message = unpack(
        args.input_file,
        args.output_directory,
        merge_runs=args.merge_runs,
        simplify_redlines=args.simplify_redlines,
    )
    print(message)

    if "Error" in message:
        sys.exit(1)

scripts/office/validate.py

Scripts/office/validate.py herunterladen

"""
Command line tool to validate Office document XML files against XSD schemas and tracked changes.

Usage:
    python validate.py <path> [--original <original_file>] [--auto-repair] [--author NAME]

The first argument can be either:
- An unpacked directory containing the Office document XML files
- A packed Office file (.docx/.pptx/.xlsx) which will be unpacked to a temp directory

Auto-repair fixes:
- paraId/durableId values that exceed OOXML limits
- Missing xml:space="preserve" on w:t elements with whitespace
"""

import argparse
import sys
import tempfile
import zipfile
from pathlib import Path

from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator


def main():
    parser = argparse.ArgumentParser(description="Validate Office document XML files")
    parser.add_argument(
        "path",
        help="Path to unpacked directory or packed Office file (.docx/.pptx/.xlsx)",
    )
    parser.add_argument(
        "--original",
        required=False,
        default=None,
        help="Path to original file (.docx/.pptx/.xlsx). If omitted, all XSD errors are reported and redlining validation is skipped.",
    )
    parser.add_argument(
        "-v",
        "--verbose",
        action="store_true",
        help="Enable verbose output",
    )
    parser.add_argument(
        "--auto-repair",
        action="store_true",
        help="Automatically repair common issues (hex IDs, whitespace preservation)",
    )
    parser.add_argument(
        "--author",
        default="Claude",
        help="Author name for redlining validation (default: Claude)",
    )
    args = parser.parse_args()

    path = Path(args.path)
    assert path.exists(), f"Error: {path} does not exist"

    original_file = None
    if args.original:
        original_file = Path(args.original)
        assert original_file.is_file(), f"Error: {original_file} is not a file"
        assert original_file.suffix.lower() in [".docx", ".pptx", ".xlsx"], (
            f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
        )

    file_extension = (original_file or path).suffix.lower()
    assert file_extension in [".docx", ".pptx", ".xlsx"], (
        f"Error: Cannot determine file type from {path}. Use --original or provide a .docx/.pptx/.xlsx file."
    )

    if path.is_file() and path.suffix.lower() in [".docx", ".pptx", ".xlsx"]:
        temp_dir = tempfile.mkdtemp()
        with zipfile.ZipFile(path, "r") as zf:
            zf.extractall(temp_dir)
        unpacked_dir = Path(temp_dir)
    else:
        assert path.is_dir(), f"Error: {path} is not a directory or Office file"
        unpacked_dir = path

    match file_extension:
        case ".docx":
            validators = [
                DOCXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
            ]
            if original_file:
                validators.append(
                    RedliningValidator(unpacked_dir, original_file, verbose=args.verbose, author=args.author)  
                )
        case ".pptx":
            validators = [
                PPTXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
            ]
        case _:
            print(f"Error: Validation not supported for file type {file_extension}")
            sys.exit(1)

    if args.auto_repair:
        total_repairs = sum(v.repair() for v in validators)
        if total_repairs:
            print(f"Auto-repaired {total_repairs} issue(s)")

    success = all(v.validate() for v in validators)

    if success:
        print("All validations PASSED!")

    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()

scripts/office/validators/init.py

Scripts/office/validators/init.py herunterladen

"""
Validation modules for Word document processing.
"""

from .base import BaseSchemaValidator
from .docx import DOCXSchemaValidator
from .pptx import PPTXSchemaValidator
from .redlining import RedliningValidator

__all__ = [
    "BaseSchemaValidator",
    "DOCXSchemaValidator",
    "PPTXSchemaValidator",
    "RedliningValidator",
]

"""
Excel Formula Recalculation Script
Recalculates all formulas in an Excel file using LibreOffice
"""

import json
import os
import platform
import subprocess
import sys
from pathlib import Path

from office.soffice import get_soffice_env

from openpyxl import load_workbook

MACRO_DIR_MACOS = "~/Library/Application Support/LibreOffice/4/user/basic/Standard"
MACRO_DIR_LINUX = "~/.config/libreoffice/4/user/basic/Standard"
MACRO_FILENAME = "Module1.xba"

RECALCULATE_MACRO = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">
    Sub RecalculateAndSave()
      ThisComponent.calculateAll()
      ThisComponent.store()
      ThisComponent.close(True)
    End Sub
</script:module>"""


def has_gtimeout():
    try:
        subprocess.run(
            ["gtimeout", "--version"], capture_output=True, timeout=1, check=False
        )
        return True
    except (FileNotFoundError, subprocess.TimeoutExpired):
        return False


def setup_libreoffice_macro():
    macro_dir = os.path.expanduser(
        MACRO_DIR_MACOS if platform.system() == "Darwin" else MACRO_DIR_LINUX
    )
    macro_file = os.path.join(macro_dir, MACRO_FILENAME)

    if (
        os.path.exists(macro_file)
        and "RecalculateAndSave" in Path(macro_file).read_text()
    ):
        return True

    if not os.path.exists(macro_dir):
        subprocess.run(
            ["soffice", "--headless", "--terminate_after_init"],
            capture_output=True,
            timeout=10,
            env=get_soffice_env(),
        )
        os.makedirs(macro_dir, exist_ok=True)

    try:
        Path(macro_file).write_text(RECALCULATE_MACRO)
        return True
    except Exception:
        return False


def recalc(filename, timeout=30):
    if not Path(filename).exists():
        return {"error": f"File {filename} does not exist"}

    abs_path = str(Path(filename).absolute())

    if not setup_libreoffice_macro():
        return {"error": "Failed to setup LibreOffice macro"}

    cmd = [
        "soffice",
        "--headless",
        "--norestore",
        "vnd.sun.star.script:Standard.Module1.RecalculateAndSave?language=Basic&location=application",
        abs_path,
    ]

    if platform.system() == "Linux":
        cmd = ["timeout", str(timeout)] + cmd
    elif platform.system() == "Darwin" and has_gtimeout():
        cmd = ["gtimeout", str(timeout)] + cmd

    result = subprocess.run(cmd, capture_output=True, text=True, env=get_soffice_env())

    if result.returncode != 0 and result.returncode != 124:  
        error_msg = result.stderr or "Unknown error during recalculation"
        if "Module1" in error_msg or "RecalculateAndSave" not in error_msg:
            return {"error": "LibreOffice macro not configured properly"}
        return {"error": error_msg}

    try:
        wb = load_workbook(filename, data_only=True)

        excel_errors = [
            "#VALUE!",
            "#DIV/0!",
            "#REF!",
            "#NAME?",
            "#NULL!",
            "#NUM!",
            "#N/A",
        ]
        error_details = {err: [] for err in excel_errors}
        total_errors = 0

        for sheet_name in wb.sheetnames:
            ws = wb[sheet_name]
            for row in ws.iter_rows():
                for cell in row:
                    if cell.value is not None and isinstance(cell.value, str):
                        for err in excel_errors:
                            if err in cell.value:
                                location = f"{sheet_name}!{cell.coordinate}"
                                error_details[err].append(location)
                                total_errors += 1
                                break

        wb.close()

        result = {
            "status": "success" if total_errors == 0 else "errors_found",
            "total_errors": total_errors,
            "error_summary": {},
        }

        for err_type, locations in error_details.items():
            if locations:
                result["error_summary"][err_type] = {
                    "count": len(locations),
                    "locations": locations[:20],  
                }

        wb_formulas = load_workbook(filename, data_only=False)
        formula_count = 0
        for sheet_name in wb_formulas.sheetnames:
            ws = wb_formulas[sheet_name]
            for row in ws.iter_rows():
                for cell in row:
                    if (
                        cell.value
                        and isinstance(cell.value, str)
                        and cell.value.startswith("=")
                    ):
                        formula_count += 1
        wb_formulas.close()

        result["total_formulas"] = formula_count

        return result

    except Exception as e:
        return {"error": str(e)}


def main():
    if len(sys.argv) < 2:
        print("Usage: python recalc.py <excel_file> [timeout_seconds]")
        print("\nRecalculates all formulas in an Excel file using LibreOffice")
        print("\nReturns JSON with error details:")
        print("  - status: 'success' or 'errors_found'")
        print("  - total_errors: Total number of Excel errors found")
        print("  - total_formulas: Number of formulas in the file")
        print("  - error_summary: Breakdown by error type with locations")
        print("    - #VALUE!, #DIV/0!, #REF!, #NAME?, #NULL!, #NUM!, #N/A")
        sys.exit(1)

    filename = sys.argv[1]
    timeout = int(sys.argv[2]) if len(sys.argv) > 2 else 30

    result = recalc(filename, timeout)
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    main()

Xlsx

Inhaltsverzeichnis

Xlsx