Docx
Utilisez cette compétence chaque fois que l'utilisateur souhaite créer, lire, modifier ou manipuler des documents Word (fichiers.docx). Les déclencheurs incluent: toute mention de « doc Word », « document Word », «.docx » ou les demandes de production de documents professionnels avec un formatage tel que des tables des matières, des titres, des numéros de page ou des en-têtes. Utilisez également lors de l'extraction ou de la réorganisation du contenu de fichiers.docx, de l'insertion ou du remplacement d'images dans des documents, de la recherche et du remplacement dans des fichiers Word, de l'utilisation de modifications ou de commentaires suivis ou de la conversion de contenu en un document Word raffiné. Si l'utilisateur demande un « rapport », un « mémo », une « lettre », un « modèle » ou un livrable similaire sous forme de fichier Word ou.docx, utilisez cette compétence. Ne PAS utiliser pour les fichiers PDF, les feuilles de calcul, Google Docs ou les tâches de codage générales sans rapport avec la génération de documents.
Source: Contenu adapté de anthropics/skills (MIT).
Aperçu
Un fichier.docx est une archive ZIP contenant des fichiers XML.
Référence rapide
| Tâche | Approche |
|---|---|
| Lire/analyser du contenu | pandocou décompresser pour XML brut |
| Créer un nouveau document | Utilisezdocx-js- voir Création de nouveaux documents ci-dessous |
| Modifier un document existant | Décompresser -> modifier XML -> reconditionner - voir Modification de documents existants ci-dessous |
Conversion de.doc en.docx
Les anciens fichiers.docdoivent être convertis avant d'être modifiés:
python scripts/office/soffice.py --headless --convert-to docx document.docLire du contenu
# Text extraction with tracked changes
pandoc --track-changes=all document.docx -o output.md
# Raw XML access
python scripts/office/unpack.py document.docx unpacked/Conversion en images
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf pageAccepter les modifications suivies
Pour produire un document propre avec toutes les modifications suivies acceptées (nécessite LibreOffice):
python scripts/accept_changes.py input.docx output.docxCréation de nouveaux documents
Générez des fichiers.docx avec JavaScript, puis validez. Installer:npm install -g docx
Configuration
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
TabStopType, TabStopPosition, Column, SectionType,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require('docx');
const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));Validation
Après avoir créé le fichier, validez-le. Si la validation échoue, décompressez, corrigez le XML et recompressez.
python scripts/office/validate.py doc.docxTaille des pages
// CRITICAL: docx-js defaults to A4, not US Letter
// Always set page size explicitly for consistent results
sections: [{
properties: {
page: {
size: {
width: 12240, // 8.5 inches in DXA
height: 15840 // 11 inches in DXA
},
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
}
},
children: [/* content */]
}]Tailles de page courantes (unités DXA, 1 440 DXA = 1 pouce):
| Papier | Largeur | Hauteur | Largeur du contenu (marges de 1") |
|---|---|---|---|
| Lettre américaine | 12 240 | 15 840 | 9 360 |
| A4 (par défaut) | 11 906 | 16 838 | 9 026 |
Orientation paysage: docx-js échange largeur/hauteur en interne, alors transmettez les dimensions portrait et laissez-le gérer l'échange:
size: {
width: 12240, // Pass SHORT edge as width
height: 15840, // Pass LONG edge as height
orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML
},
// Content width = 15840 - left margin - right margin (uses the long edge)Styles (remplacer les en-têtes intégrés)
Utilisez Arial comme police par défaut (universellement prise en charge). Gardez les titres en noir pour plus de lisibilité.
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
paragraphStyles: [
// IMPORTANT: Use exact IDs to override built-in styles
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, font: "Arial" },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, font: "Arial" },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
]
},
sections: [{
children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
]
}]
});Listes (NE JAMAIS utiliser de puces Unicode)
// WRONG - never manually insert bullet characters
new Paragraph({ children: [new TextRun("* Item")] }) // BAD
new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD
// CORRECT - use numbering config with LevelFormat.BULLET
const doc = new Document({
numbering: {
config: [
{ reference: "bullets",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "*", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "numbers",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{
children: [
new Paragraph({ numbering: { reference: "bullets", level: 0 },
children: [new TextRun("Bullet item")] }),
new Paragraph({ numbering: { reference: "numbers", level: 0 },
children: [new TextRun("Numbered item")] }),
]
}]
});
// Each reference creates INDEPENDENT numbering
// Same reference = continues (1,2,3 then 4,5,6)
// Different reference = restarts (1,2,3 then 1,2,3)Tableaux
CRITIQUE: les tableaux nécessitent des largeurs doubles - définissez à la foiscolumnWidthssur le tableau ETwidthsur chaque cellule. Sans les deux, les tableaux ne s'affichent pas correctement sur certaines plates-formes.
// CRITICAL: Always set table width for consistent rendering
// CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };
new Table({
width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
rows: [
new TableRow({
children: [
new TableCell({
borders,
width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width)
children: [new Paragraph({ children: [new TextRun("Cell")] })]
})
]
})
]
})Calcul de la largeur du tableau:
Utilisez toujours les pausesWidthType.DXA-WidthType.PERCENTAGEdans Google Docs.
// Table width = sum of columnWidths = content width
// US Letter with 1" margins: 12240 - 2880 = 9360 DXA
width: { size: 9360, type: WidthType.DXA },
columnWidths: [7000, 2360] // Must sum to table widthRègles de largeur:
- Toujours utiliser
WidthType.DXA- jamaisWidthType.PERCENTAGE(incompatible avec Google Docs) - La largeur du tableau doit être égale à la somme de
columnWidths - La cellule
widthdoit correspondre aucolumnWidthcorrespondant - Les cellules
marginssont un remplissage interne: elles réduisent la zone de contenu et n'ajoutent pas à la largeur de la cellule. - Pour les tableaux pleine largeur: utilisez la largeur du contenu (largeur de la page moins les marges gauche et droite)
Images
// CRITICAL: type parameter is REQUIRED
new Paragraph({
children: [new ImageRun({
type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150 },
altText: { title: "Title", description: "Desc", name: "Name" } // All three required
})]
})Sauts de page
// CRITICAL: PageBreak must be inside a Paragraph
new Paragraph({ children: [new PageBreak()] })
// Or use pageBreakBefore
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })Liens hypertextes
// External link
new Paragraph({
children: [new ExternalHyperlink({
children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
link: "https://example.com",
})]
})
// Internal link (bookmark + reference)
// 1. Create bookmark at destination
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
]})
// 2. Link to it
new Paragraph({ children: [new InternalHyperlink({
children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
anchor: "chapter1",
})]})Notes de bas de page
const doc = new Document({
footnotes: {
1: { children: [new Paragraph("Source: Annual Report 2024")] },
2: { children: [new Paragraph("See appendix for methodology")] },
},
sections: [{
children: [new Paragraph({
children: [
new TextRun("Revenue grew 15%"),
new FootnoteReferenceRun(1),
new TextRun(" using adjusted metrics"),
new FootnoteReferenceRun(2),
],
})]
}]
});Taquets de tabulation
// Right-align text on same line (e.g., date opposite a title)
new Paragraph({
children: [
new TextRun("Company Name"),
new TextRun("\tJanuary 2025"),
],
tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
})
// Dot leader (e.g., TOC-style)
new Paragraph({
children: [
new TextRun("Introduction"),
new TextRun({ children: [
new PositionalTab({
alignment: PositionalTabAlignment.RIGHT,
relativeTo: PositionalTabRelativeTo.MARGIN,
leader: PositionalTabLeader.DOT,
}),
"3",
]}),
],
})Dispositions multi-colonnes
// Equal-width columns
sections: [{
properties: {
column: {
count: 2, // number of columns
space: 720, // gap between columns in DXA (720 = 0.5 inch)
equalWidth: true,
separate: true, // vertical line between columns
},
},
children: [/* content flows naturally across columns */]
}]
// Custom-width columns (equalWidth must be false)
sections: [{
properties: {
column: {
equalWidth: false,
children: [
new Column({ width: 5400, space: 720 }),
new Column({ width: 3240 }),
],
},
},
children: [/* content */]
}]Forcez un saut de colonne avec une nouvelle section à l'aide detype: SectionType.NEXT_COLUMN.
Table des matières
// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })En-têtes/pieds de page
sections: [{
properties: {
page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch
},
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
},
footers: {
default: new Footer({ children: [new Paragraph({
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
})] })
},
children: [/* content */]
}]Règles critiques pour docx-js
- Définir explicitement la taille de la page - docx-js est par défaut A4; utilisez US Letter (12240 x 15840 DXA) pour les documents américains
- Paysage: transmettre les dimensions du portrait - docx-js permute la largeur/hauteur en interne; passer le bord court comme
width, le bord long commeheightet définirorientation: PageOrientation.LANDSCAPE - N'utilisez jamais
\n- utilisez des éléments de paragraphe séparés - N'utilisez jamais de puces Unicode - utilisez
LevelFormat.BULLETavec la configuration de numérotation - PageBreak doit être dans le paragraphe - le mode autonome crée du XML non valide
- ImageRun nécessite
type- spécifiez toujours png/jpg/etc - Toujours définir la table
widthavec DXA - n'utilisez jamaisWidthType.PERCENTAGE(pauses dans Google Docs) - Les tableaux nécessitent deux largeurs: tableau
columnWidthsET cellulewidth, les deux doivent correspondre - Largeur du tableau = somme des largeurs de colonnes - pour DXA, assurez-vous qu'elles s'additionnent exactement
- Toujours ajouter des marges de cellules - utilisez
margins: { top: 80, bottom: 80, left: 120, right: 120 }pour un remplissage lisible - Utilisez
ShadingType.CLEAR- jamais SOLIDE pour l'ombrage de la table - N'utilisez jamais de tableaux comme séparateurs/règles - les cellules ont une hauteur minimale et s'affichent sous forme de cases vides (y compris dans les en-têtes/pieds de page); utilisez plutôt
border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }sur un paragraphe. Pour les pieds de page à deux colonnes, utilisez des taquets de tabulation (voir la section Taquets de tabulation), et non des tableaux. - La table des matières nécessite uniquement HeadingLevel - aucun style personnalisé sur les paragraphes de titre
- Remplacez les styles intégrés - utilisez les identifiants exacts: "Titre1", "Titre2", etc.
- Inclure
outlineLevel- requis pour la table des matières (0 pour H1, 1 pour H2, etc.)
Modification de documents existants
Suivez les 3 étapes dans l'ordre.
Étape 1: Déballer
python scripts/office/unpack.py document.docx unpacked/Extrait le XML, imprime joliment, fusionne les exécutions adjacentes et convertit les guillemets intelligents en entités XML (“, etc.) afin qu'ils survivent à l'édition. Utilisez--merge-runs falsepour ignorer la fusion des exécutions.
Étape 2: Modifier le XML
Modifiez les fichiers dansunpacked/word/. Voir la référence XML ci-dessous pour les modèles.
Utilisez « Claude » comme auteur pour le suivi des modifications et des commentaires, sauf si l'utilisateur demande explicitement l'utilisation d'un nom différent.
Utilisez l'outil d'édition directement pour le remplacement de chaîne. N'écrivez pas de scripts Python. Les scripts introduisent une complexité inutile. L'outil d'édition montre exactement ce qui est remplacé.
CRITIQUE: utilisez des guillemets intelligents pour le nouveau contenu. Lorsque vous ajoutez du texte avec des apostrophes ou des guillemets, utilisez des entités XML pour produire des guillemets intelligents:
<!-- Use these entities for professional typography -->
<w:t>Here’s a quote: “Hello”</w:t>| Entité | Caractère |
|---|---|
‘ | ' (simple de gauche) |
’ | ' (single à droite / apostrophe) |
“ | " (double gauche) |
” | " (double droit) |
Ajout de commentaires: Utilisezcomment.pypour gérer le passe-partout dans plusieurs fichiers XML (le texte doit être pré-échappé en XML):
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author nameAjoutez ensuite des marqueurs à document.xml (voir Commentaires dans la référence XML).
Étape 3: Emballer
python scripts/office/pack.py unpacked/ output.docx --original document.docxValide avec réparation automatique, condense XML et crée DOCX. Utilisez--validate falsepour sauter.
La réparation automatique corrigera:
durableId>= 0x7FFFFFFF (régénère un identifiant valide)xml:space="preserve"manquant sur<w:t>avec des espaces
La réparation automatique ne résout pas le problème:
- XML mal formé, imbrication d'éléments non valide, relations manquantes, violations de schéma
Pièges courants
- Remplacer des éléments
<w:r>entiers: lors de l'ajout de modifications suivies, remplacez l'intégralité du bloc<w:r>...</w:r>par<w:del>...<w:ins>...en tant que frères et sœurs. N'injectez pas de balises de modification suivies dans une exécution. - Préserver le formatage
<w:rPr>: copiez le bloc<w:rPr>de l'exécution d'origine dans vos exécutions de modifications suivies pour conserver le gras, la taille de la police, etc.
Référence XML
Conformité au schéma
- Ordre des éléments dans
<w:pPr>:<w:pStyle>,<w:numPr>,<w:spacing>,<w:ind>,<w:jc>,<w:rPr>en dernier - Espaces: ajoutez
xml:space="preserve"à<w:t>avec des espaces de début/fin - RSID: doit être composé de 8 chiffres hexadécimaux (par exemple,
00AB1234)
Modifications suivies
Insertion:
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>inserted text</w:t></w:r>
</w:ins>Suppression:
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>À l'intérieur de<w:del>: utilisez<w:delText>au lieu de<w:t>et<w:delInstrText>au lieu de<w:instrText>.
Modifications minimales - marquez uniquement les modifications:
<!-- Change "30 days" to "60 days" -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
<w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
<w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>Suppression de paragraphes/éléments de liste entiers - lorsque vous supprimez TOUT le contenu d'un paragraphe, marquez également la marque de paragraphe comme supprimée afin qu'elle fusionne avec le paragraphe suivant. Ajoutez<w:del/>dans<w:pPr><w:rPr>:
<w:p>
<w:pPr>
<w:numPr>...</w:numPr> <!-- list numbering if present -->
<w:rPr>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
</w:del>
</w:p>Sans<w:del/>dans<w:pPr><w:rPr>, l’acceptation des modifications laisse un élément de paragraphe/liste vide.
Rejet de l'insertion d'un autre auteur - suppression de l'imbrication à l'intérieur de son insertion:
<w:ins w:author="Jane" w:id="5">
<w:del w:author="Claude" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>Restauration de la suppression d'un autre auteur - ajouter une insertion après (ne pas modifier sa suppression):
<w:del w:author="Jane" w:id="5">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="10">
<w:r><w:t>deleted text</w:t></w:r>
</w:ins>Commentaires
Après avoir exécutécomment.py(voir étape 2), ajoutez des marqueurs à document.xml. Pour les réponses, utilisez le drapeau--parentet les marqueurs d'imbrication à l'intérieur de celui du parent.
CRITIQUE:<w:commentRangeStart>et<w:commentRangeEnd>sont frères et sœurs de<w:r>, jamais à l'intérieur de<w:r>.
<!-- Comment markers are direct children of w:p, never inside w:r -->
<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<!-- Comment 0 with reply 1 nested inside -->
<w:commentRangeStart w:id="0"/>
<w:commentRangeStart w:id="1"/>
<w:r><w:t>text</w:t></w:r>
<w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>Images
- Ajouter un fichier image à
word/media/ - Ajoutez une relation à
word/_rels/document.xml.rels:
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>- Ajoutez un type de contenu à
[Content_Types].xml:
<Default Extension="png" ContentType="image/png"/>- Référence dans document.xml:
<w:drawing>
<wp:inline>
<wp:extent cx="914400" cy="914400"/> <!-- EMUs: 914400 = 1 inch -->
<a:graphic>
<a:graphicData uri=".../picture">
<pic:pic>
<pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>Dépendances
- pandoc: Extraction de texte
- docx:
npm install -g docx(nouveaux documents) - LibreOffice: conversion PDF (auto-configurée pour les environnements sandbox via
scripts/office/soffice.py) - Poppler:
pdftoppmpour les images
Fichiers de ressources
LICENCE.txt
Ressource binaire
scripts/init.py
Télécharger les scripts/init.py
scripts/accept_changes.py
Télécharger scripts/accept_changes.py
"""Accept all tracked changes in a DOCX file using LibreOffice.
Requires LibreOffice (soffice) to be installed.
"""
import argparse
import logging
import shutil
import subprocess
from pathlib import Path
from office.soffice import get_soffice_env
logger = logging.getLogger(__name__)
LIBREOFFICE_PROFILE = "/tmp/libreoffice_docx_profile"
MACRO_DIR = f"{LIBREOFFICE_PROFILE}/user/basic/Standard"
ACCEPT_CHANGES_MACRO = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">
Sub AcceptAllTrackedChanges()
Dim document As Object
Dim dispatcher As Object
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
dispatcher.executeDispatch(document, ".uno:AcceptAllTrackedChanges", "", 0, Array())
ThisComponent.store()
ThisComponent.close(True)
End Sub
</script:module>"""
def accept_changes(
input_file: str,
output_file: str,
) -> tuple[None, str]:
input_path = Path(input_file)
output_path = Path(output_file)
if not input_path.exists():
return None, f"Error: Input file not found: {input_file}"
if not input_path.suffix.lower() == ".docx":
return None, f"Error: Input file is not a DOCX file: {input_file}"
try:
output_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(input_path, output_path)
except Exception as e:
return None, f"Error: Failed to copy input file to output location: {e}"
if not _setup_libreoffice_macro():
return None, "Error: Failed to setup LibreOffice macro"
cmd = [
"soffice",
"--headless",
f"-env:UserInstallation=file://{LIBREOFFICE_PROFILE}",
"--norestore",
"vnd.sun.star.script:Standard.Module1.AcceptAllTrackedChanges?language=Basic&location=application",
str(output_path.absolute()),
]
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=30,
check=False,
env=get_soffice_env(),
)
except subprocess.TimeoutExpired:
return (
None,
f"Successfully accepted all tracked changes: {input_file} -> {output_file}",
)
if result.returncode != 0:
return None, f"Error: LibreOffice failed: {result.stderr}"
return (
None,
f"Successfully accepted all tracked changes: {input_file} -> {output_file}",
)
def _setup_libreoffice_macro() -> bool:
macro_dir = Path(MACRO_DIR)
macro_file = macro_dir / "Module1.xba"
if macro_file.exists() and "AcceptAllTrackedChanges" in macro_file.read_text():
return True
if not macro_dir.exists():
subprocess.run(
[
"soffice",
"--headless",
f"-env:UserInstallation=file://{LIBREOFFICE_PROFILE}",
"--terminate_after_init",
],
capture_output=True,
timeout=10,
check=False,
env=get_soffice_env(),
)
macro_dir.mkdir(parents=True, exist_ok=True)
try:
macro_file.write_text(ACCEPT_CHANGES_MACRO)
return True
except Exception as e:
logger.warning(f"Failed to setup LibreOffice macro: {e}")
return False
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Accept all tracked changes in a DOCX file"
)
parser.add_argument("input_file", help="Input DOCX file with tracked changes")
parser.add_argument(
"output_file", help="Output DOCX file (clean, no tracked changes)"
)
args = parser.parse_args()
_, message = accept_changes(args.input_file, args.output_file)
print(message)
if "Error" in message:
raise SystemExit(1)scripts/commentaire.py
Télécharger scripts/comment.py
Ressource binaire
scripts/office/helpers/init.py
Télécharger scripts/office/helpers/init.py
Ressource binaire
scripts/office/helpers/merge_runs.py
Télécharger scripts/office/helpers/merge_runs.py
"""Merge adjacent runs with identical formatting in DOCX.
Merges adjacent <w:r> elements that have identical <w:rPr> properties.
Works on runs in paragraphs and inside tracked changes (<w:ins>, <w:del>).
Also:
- Removes rsid attributes from runs (revision metadata that doesn't affect rendering)
- Removes proofErr elements (spell/grammar markers that block merging)
"""
from pathlib import Path
import defusedxml.minidom
def merge_runs(input_dir: str) -> tuple[int, str]:
doc_xml = Path(input_dir) / "word" / "document.xml"
if not doc_xml.exists():
return 0, f"Error: {doc_xml} not found"
try:
dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
root = dom.documentElement
_remove_elements(root, "proofErr")
_strip_run_rsid_attrs(root)
containers = {run.parentNode for run in _find_elements(root, "r")}
merge_count = 0
for container in containers:
merge_count += _merge_runs_in(container)
doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
return merge_count, f"Merged {merge_count} runs"
except Exception as e:
return 0, f"Error: {e}"
def _find_elements(root, tag: str) -> list:
results = []
def traverse(node):
if node.nodeType == node.ELEMENT_NODE:
name = node.localName or node.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(node)
for child in node.childNodes:
traverse(child)
traverse(root)
return results
def _get_child(parent, tag: str):
for child in parent.childNodes:
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name == tag or name.endswith(f":{tag}"):
return child
return None
def _get_children(parent, tag: str) -> list:
results = []
for child in parent.childNodes:
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(child)
return results
def _is_adjacent(elem1, elem2) -> bool:
node = elem1.nextSibling
while node:
if node == elem2:
return True
if node.nodeType == node.ELEMENT_NODE:
return False
if node.nodeType == node.TEXT_NODE and node.data.strip():
return False
node = node.nextSibling
return False
def _remove_elements(root, tag: str):
for elem in _find_elements(root, tag):
if elem.parentNode:
elem.parentNode.removeChild(elem)
def _strip_run_rsid_attrs(root):
for run in _find_elements(root, "r"):
for attr in list(run.attributes.values()):
if "rsid" in attr.name.lower():
run.removeAttribute(attr.name)
def _merge_runs_in(container) -> int:
merge_count = 0
run = _first_child_run(container)
while run:
while True:
next_elem = _next_element_sibling(run)
if next_elem and _is_run(next_elem) and _can_merge(run, next_elem):
_merge_run_content(run, next_elem)
container.removeChild(next_elem)
merge_count += 1
else:
break
_consolidate_text(run)
run = _next_sibling_run(run)
return merge_count
def _first_child_run(container):
for child in container.childNodes:
if child.nodeType == child.ELEMENT_NODE and _is_run(child):
return child
return None
def _next_element_sibling(node):
sibling = node.nextSibling
while sibling:
if sibling.nodeType == sibling.ELEMENT_NODE:
return sibling
sibling = sibling.nextSibling
return None
def _next_sibling_run(node):
sibling = node.nextSibling
while sibling:
if sibling.nodeType == sibling.ELEMENT_NODE:
if _is_run(sibling):
return sibling
sibling = sibling.nextSibling
return None
def _is_run(node) -> bool:
name = node.localName or node.tagName
return name == "r" or name.endswith(":r")
def _can_merge(run1, run2) -> bool:
rpr1 = _get_child(run1, "rPr")
rpr2 = _get_child(run2, "rPr")
if (rpr1 is None) != (rpr2 is None):
return False
if rpr1 is None:
return True
return rpr1.toxml() == rpr2.toxml()
def _merge_run_content(target, source):
for child in list(source.childNodes):
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name != "rPr" and not name.endswith(":rPr"):
target.appendChild(child)
def _consolidate_text(run):
t_elements = _get_children(run, "t")
for i in range(len(t_elements) - 1, 0, -1):
curr, prev = t_elements[i], t_elements[i - 1]
if _is_adjacent(prev, curr):
prev_text = prev.firstChild.data if prev.firstChild else ""
curr_text = curr.firstChild.data if curr.firstChild else ""
merged = prev_text + curr_text
if prev.firstChild:
prev.firstChild.data = merged
else:
prev.appendChild(run.ownerDocument.createTextNode(merged))
if merged.startswith(" ") or merged.endswith(" "):
prev.setAttribute("xml:space", "preserve")
elif prev.hasAttribute("xml:space"):
prev.removeAttribute("xml:space")
run.removeChild(curr)scripts/office/helpers/simplify_redlines.py
Télécharger scripts/office/helpers/simplify_redlines.py
"""Simplify tracked changes by merging adjacent w:ins or w:del elements.
Merges adjacent <w:ins> elements from the same author into a single element.
Same for <w:del> elements. This makes heavily-redlined documents easier to
work with by reducing the number of tracked change wrappers.
Rules:
- Only merges w:ins with w:ins, w:del with w:del (same element type)
- Only merges if same author (ignores timestamp differences)
- Only merges if truly adjacent (only whitespace between them)
"""
import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path
import defusedxml.minidom
WORD_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
def simplify_redlines(input_dir: str) -> tuple[int, str]:
doc_xml = Path(input_dir) / "word" / "document.xml"
if not doc_xml.exists():
return 0, f"Error: {doc_xml} not found"
try:
dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
root = dom.documentElement
merge_count = 0
containers = _find_elements(root, "p") + _find_elements(root, "tc")
for container in containers:
merge_count += _merge_tracked_changes_in(container, "ins")
merge_count += _merge_tracked_changes_in(container, "del")
doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
return merge_count, f"Simplified {merge_count} tracked changes"
except Exception as e:
return 0, f"Error: {e}"
def _merge_tracked_changes_in(container, tag: str) -> int:
merge_count = 0
tracked = [
child
for child in container.childNodes
if child.nodeType == child.ELEMENT_NODE and _is_element(child, tag)
]
if len(tracked) < 2:
return 0
i = 0
while i < len(tracked) - 1:
curr = tracked[i]
next_elem = tracked[i + 1]
if _can_merge_tracked(curr, next_elem):
_merge_tracked_content(curr, next_elem)
container.removeChild(next_elem)
tracked.pop(i + 1)
merge_count += 1
else:
i += 1
return merge_count
def _is_element(node, tag: str) -> bool:
name = node.localName or node.tagName
return name == tag or name.endswith(f":{tag}")
def _get_author(elem) -> str:
author = elem.getAttribute("w:author")
if not author:
for attr in elem.attributes.values():
if attr.localName == "author" or attr.name.endswith(":author"):
return attr.value
return author
def _can_merge_tracked(elem1, elem2) -> bool:
if _get_author(elem1) != _get_author(elem2):
return False
node = elem1.nextSibling
while node and node != elem2:
if node.nodeType == node.ELEMENT_NODE:
return False
if node.nodeType == node.TEXT_NODE and node.data.strip():
return False
node = node.nextSibling
return True
def _merge_tracked_content(target, source):
while source.firstChild:
child = source.firstChild
source.removeChild(child)
target.appendChild(child)
def _find_elements(root, tag: str) -> list:
results = []
def traverse(node):
if node.nodeType == node.ELEMENT_NODE:
name = node.localName or node.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(node)
for child in node.childNodes:
traverse(child)
traverse(root)
return results
def get_tracked_change_authors(doc_xml_path: Path) -> dict[str, int]:
if not doc_xml_path.exists():
return {}
try:
tree = ET.parse(doc_xml_path)
root = tree.getroot()
except ET.ParseError:
return {}
namespaces = {"w": WORD_NS}
author_attr = f"{{{WORD_NS}}}author"
authors: dict[str, int] = {}
for tag in ["ins", "del"]:
for elem in root.findall(f".//w:{tag}", namespaces):
author = elem.get(author_attr)
if author:
authors[author] = authors.get(author, 0) + 1
return authors
def _get_authors_from_docx(docx_path: Path) -> dict[str, int]:
try:
with zipfile.ZipFile(docx_path, "r") as zf:
if "word/document.xml" not in zf.namelist():
return {}
with zf.open("word/document.xml") as f:
tree = ET.parse(f)
root = tree.getroot()
namespaces = {"w": WORD_NS}
author_attr = f"{{{WORD_NS}}}author"
authors: dict[str, int] = {}
for tag in ["ins", "del"]:
for elem in root.findall(f".//w:{tag}", namespaces):
author = elem.get(author_attr)
if author:
authors[author] = authors.get(author, 0) + 1
return authors
except (zipfile.BadZipFile, ET.ParseError):
return {}
def infer_author(modified_dir: Path, original_docx: Path, default: str = "Claude") -> str:
modified_xml = modified_dir / "word" / "document.xml"
modified_authors = get_tracked_change_authors(modified_xml)
if not modified_authors:
return default
original_authors = _get_authors_from_docx(original_docx)
new_changes: dict[str, int] = {}
for author, count in modified_authors.items():
original_count = original_authors.get(author, 0)
diff = count - original_count
if diff > 0:
new_changes[author] = diff
if not new_changes:
return default
if len(new_changes) == 1:
return next(iter(new_changes))
raise ValueError(
f"Multiple authors added new changes: {new_changes}. "
"Cannot infer which author to validate."
)scripts/office/pack.py
Télécharger scripts/office/pack.py
"""Pack a directory into a DOCX, PPTX, or XLSX file.
Validates with auto-repair, condenses XML formatting, and creates the Office file.
Usage:
python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]
Examples:
python pack.py unpacked/ output.docx --original input.docx
python pack.py unpacked/ output.pptx --validate false
"""
import argparse
import sys
import shutil
import tempfile
import zipfile
from pathlib import Path
import defusedxml.minidom
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
def pack(
input_directory: str,
output_file: str,
original_file: str | None = None,
validate: bool = True,
infer_author_func=None,
) -> tuple[None, str]:
input_dir = Path(input_directory)
output_path = Path(output_file)
suffix = output_path.suffix.lower()
if not input_dir.is_dir():
return None, f"Error: {input_dir} is not a directory"
if suffix not in {".docx", ".pptx", ".xlsx"}:
return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"
if validate and original_file:
original_path = Path(original_file)
if original_path.exists():
success, output = _run_validation(
input_dir, original_path, suffix, infer_author_func
)
if output:
print(output)
if not success:
return None, f"Error: Validation failed for {input_dir}"
with tempfile.TemporaryDirectory() as temp_dir:
temp_content_dir = Path(temp_dir) / "content"
shutil.copytree(input_dir, temp_content_dir)
for pattern in ["*.xml", "*.rels"]:
for xml_file in temp_content_dir.rglob(pattern):
_condense_xml(xml_file)
output_path.parent.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
for f in temp_content_dir.rglob("*"):
if f.is_file():
zf.write(f, f.relative_to(temp_content_dir))
return None, f"Successfully packed {input_dir} to {output_file}"
def _run_validation(
unpacked_dir: Path,
original_file: Path,
suffix: str,
infer_author_func=None,
) -> tuple[bool, str | None]:
output_lines = []
validators = []
if suffix == ".docx":
author = "Claude"
if infer_author_func:
try:
author = infer_author_func(unpacked_dir, original_file)
except ValueError as e:
print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)
validators = [
DOCXSchemaValidator(unpacked_dir, original_file),
RedliningValidator(unpacked_dir, original_file, author=author),
]
elif suffix == ".pptx":
validators = [PPTXSchemaValidator(unpacked_dir, original_file)]
if not validators:
return True, None
total_repairs = sum(v.repair() for v in validators)
if total_repairs:
output_lines.append(f"Auto-repaired {total_repairs} issue(s)")
success = all(v.validate() for v in validators)
if success:
output_lines.append("All validations PASSED!")
return success, "\n".join(output_lines) if output_lines else None
def _condense_xml(xml_file: Path) -> None:
try:
with open(xml_file, encoding="utf-8") as f:
dom = defusedxml.minidom.parse(f)
for element in dom.getElementsByTagName("*"):
if element.tagName.endswith(":t"):
continue
for child in list(element.childNodes):
if (
child.nodeType == child.TEXT_NODE
and child.nodeValue
and child.nodeValue.strip() == ""
) or child.nodeType == child.COMMENT_NODE:
element.removeChild(child)
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
except Exception as e:
print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
raise
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Pack a directory into a DOCX, PPTX, or XLSX file"
)
parser.add_argument("input_directory", help="Unpacked Office document directory")
parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
parser.add_argument(
"--original",
help="Original file for validation comparison",
)
parser.add_argument(
"--validate",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Run validation with auto-repair (default: true)",
)
args = parser.parse_args()
_, message = pack(
args.input_directory,
args.output_file,
original_file=args.original,
validate=args.validate,
)
print(message)
if "Error" in message:
sys.exit(1)scripts/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/pml.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/pml.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/sml.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/sml.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/wml.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/wml.xsd
Ressource binaire
scripts/office/schemas/ISO-IEC29500-4_2016/xml.xsd
Télécharger scripts/office/schemas/ISO-IEC29500-4_2016/xml.xsd
Ressource binaire
scripts/office/schemas/ecma/fouth-edition/opc-contentTypes.xsd
Télécharger scripts/office/schemas/ecma/fouth-edition/opc-contentTypes.xsd
Ressource binaire
scripts/office/schemas/ecma/fouth-edition/opc-coreProperties.xsd
Télécharger scripts/office/schemas/ecma/fouth-edition/opc-coreProperties.xsd
Ressource binaire
scripts/office/schemas/ecma/fouth-edition/opc-digSig.xsd
Télécharger scripts/office/schemas/ecma/fouth-edition/opc-digSig.xsd
Ressource binaire
scripts/office/schemas/ecma/fouth-edition/opc-relationships.xsd
Télécharger scripts/office/schemas/ecma/fouth-edition/opc-relationships.xsd
Ressource binaire
scripts/office/schemas/mce/mc.xsd
Télécharger scripts/office/schemas/mce/mc.xsd
Ressource binaire
scripts/office/schemas/microsoft/wml-2010.xsd
Télécharger scripts/office/schemas/microsoft/wml-2010.xsd
Ressource binaire
scripts/office/schemas/microsoft/wml-2012.xsd
Télécharger scripts/office/schemas/microsoft/wml-2012.xsd
Ressource binaire
scripts/office/schemas/microsoft/wml-2018.xsd
Télécharger scripts/office/schemas/microsoft/wml-2018.xsd
Ressource binaire
scripts/office/schemas/microsoft/wml-cex-2018.xsd
Télécharger scripts/office/schemas/microsoft/wml-cex-2018.xsd
Ressource binaire
scripts/office/schemas/microsoft/wml-cid-2016.xsd
Télécharger scripts/office/schemas/microsoft/wml-cid-2016.xsd
Ressource binaire
scripts/office/schemas/microsoft/wml-sdtdatahash-2020.xsd
Télécharger scripts/office/schemas/microsoft/wml-sdtdatahash-2020.xsd
Ressource binaire
scripts/office/schemas/microsoft/wml-symex-2015.xsd
Télécharger scripts/office/schemas/microsoft/wml-symex-2015.xsd
Ressource binaire
scripts/office/soffice.py
Télécharger scripts/office/soffice.py
"""
Helper for running LibreOffice (soffice) in environments where AF_UNIX
sockets may be blocked (e.g., sandboxed VMs). Detects the restriction
at runtime and applies an LD_PRELOAD shim if needed.
Usage:
from office.soffice import run_soffice, get_soffice_env
# Option 1 – run soffice directly
result = run_soffice(["--headless", "--convert-to", "pdf", "input.docx"])
# Option 2 – get env dict for your own subprocess calls
env = get_soffice_env()
subprocess.run(["soffice", ...], env=env)
"""
import os
import socket
import subprocess
import tempfile
from pathlib import Path
def get_soffice_env() -> dict:
env = os.environ.copy()
env["SAL_USE_VCLPLUGIN"] = "svp"
if _needs_shim():
shim = _ensure_shim()
env["LD_PRELOAD"] = str(shim)
return env
def run_soffice(args: list[str], **kwargs) -> subprocess.CompletedProcess:
env = get_soffice_env()
return subprocess.run(["soffice"] + args, env=env, **kwargs)
_SHIM_SO = Path(tempfile.gettempdir()) / "lo_socket_shim.so"
def _needs_shim() -> bool:
try:
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.close()
return False
except OSError:
return True
def _ensure_shim() -> Path:
if _SHIM_SO.exists():
return _SHIM_SO
src = Path(tempfile.gettempdir()) / "lo_socket_shim.c"
src.write_text(_SHIM_SOURCE)
subprocess.run(
["gcc", "-shared", "-fPIC", "-o", str(_SHIM_SO), str(src), "-ldl"],
check=True,
capture_output=True,
)
src.unlink()
return _SHIM_SO
_SHIM_SOURCE = r"""
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>
static int (*real_socket)(int, int, int);
static int (*real_socketpair)(int, int, int, int[2]);
static int (*real_listen)(int, int);
static int (*real_accept)(int, struct sockaddr *, socklen_t *);
static int (*real_close)(int);
static int (*real_read)(int, void *, size_t);
/* Per-FD bookkeeping (FDs >= 1024 are passed through unshimmed). */
static int is_shimmed[1024];
static int peer_of[1024];
static int wake_r[1024]; /* accept() blocks reading this */
static int wake_w[1024]; /* close() writes to this */
static int listener_fd = -1; /* FD that received listen() */
__attribute__((constructor))
static void init(void) {
real_socket = dlsym(RTLD_NEXT, "socket");
real_socketpair = dlsym(RTLD_NEXT, "socketpair");
real_listen = dlsym(RTLD_NEXT, "listen");
real_accept = dlsym(RTLD_NEXT, "accept");
real_close = dlsym(RTLD_NEXT, "close");
real_read = dlsym(RTLD_NEXT, "read");
for (int i = 0; i < 1024; i++) {
peer_of[i] = -1;
wake_r[i] = -1;
wake_w[i] = -1;
}
}
/* ---- socket ---------------------------------------------------------- */
int socket(int domain, int type, int protocol) {
if (domain == AF_UNIX) {
int fd = real_socket(domain, type, protocol);
if (fd >= 0) return fd;
/* socket(AF_UNIX) blocked – fall back to socketpair(). */
int sv[2];
if (real_socketpair(domain, type, protocol, sv) == 0) {
if (sv[0] >= 0 && sv[0] < 1024) {
is_shimmed[sv[0]] = 1;
peer_of[sv[0]] = sv[1];
int wp[2];
if (pipe(wp) == 0) {
wake_r[sv[0]] = wp[0];
wake_w[sv[0]] = wp[1];
}
}
return sv[0];
}
errno = EPERM;
return -1;
}
return real_socket(domain, type, protocol);
}
/* ---- listen ---------------------------------------------------------- */
int listen(int sockfd, int backlog) {
if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
listener_fd = sockfd;
return 0;
}
return real_listen(sockfd, backlog);
}
/* ---- accept ---------------------------------------------------------- */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
/* Block until close() writes to the wake pipe. */
if (wake_r[sockfd] >= 0) {
char buf;
real_read(wake_r[sockfd], &buf, 1);
}
errno = ECONNABORTED;
return -1;
}
return real_accept(sockfd, addr, addrlen);
}
/* ---- close ----------------------------------------------------------- */
int close(int fd) {
if (fd >= 0 && fd < 1024 && is_shimmed[fd]) {
int was_listener = (fd == listener_fd);
is_shimmed[fd] = 0;
if (wake_w[fd] >= 0) { /* unblock accept() */
char c = 0;
write(wake_w[fd], &c, 1);
real_close(wake_w[fd]);
wake_w[fd] = -1;
}
if (wake_r[fd] >= 0) { real_close(wake_r[fd]); wake_r[fd] = -1; }
if (peer_of[fd] >= 0) { real_close(peer_of[fd]); peer_of[fd] = -1; }
if (was_listener)
_exit(0); /* conversion done – exit */
}
return real_close(fd);
}
"""
if __name__ == "__main__":
import sys
result = run_soffice(sys.argv[1:])
sys.exit(result.returncode)scripts/office/unpack.py
Télécharger scripts/office/unpack.py
"""Unpack Office files (DOCX, PPTX, XLSX) for editing.
Extracts the ZIP archive, pretty-prints XML files, and optionally:
- Merges adjacent runs with identical formatting (DOCX only)
- Simplifies adjacent tracked changes from same author (DOCX only)
Usage:
python unpack.py <office_file> <output_dir> [options]
Examples:
python unpack.py document.docx unpacked/
python unpack.py presentation.pptx unpacked/
python unpack.py document.docx unpacked/ --merge-runs false
"""
import argparse
import sys
import zipfile
from pathlib import Path
import defusedxml.minidom
from helpers.merge_runs import merge_runs as do_merge_runs
from helpers.simplify_redlines import simplify_redlines as do_simplify_redlines
SMART_QUOTE_REPLACEMENTS = {
"\u201c": "“",
"\u201d": "”",
"\u2018": "‘",
"\u2019": "’",
}
def unpack(
input_file: str,
output_directory: str,
merge_runs: bool = True,
simplify_redlines: bool = True,
) -> tuple[None, str]:
input_path = Path(input_file)
output_path = Path(output_directory)
suffix = input_path.suffix.lower()
if not input_path.exists():
return None, f"Error: {input_file} does not exist"
if suffix not in {".docx", ".pptx", ".xlsx"}:
return None, f"Error: {input_file} must be a .docx, .pptx, or .xlsx file"
try:
output_path.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(input_path, "r") as zf:
zf.extractall(output_path)
xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
for xml_file in xml_files:
_pretty_print_xml(xml_file)
message = f"Unpacked {input_file} ({len(xml_files)} XML files)"
if suffix == ".docx":
if simplify_redlines:
simplify_count, _ = do_simplify_redlines(str(output_path))
message += f", simplified {simplify_count} tracked changes"
if merge_runs:
merge_count, _ = do_merge_runs(str(output_path))
message += f", merged {merge_count} runs"
for xml_file in xml_files:
_escape_smart_quotes(xml_file)
return None, message
except zipfile.BadZipFile:
return None, f"Error: {input_file} is not a valid Office file"
except Exception as e:
return None, f"Error unpacking: {e}"
def _pretty_print_xml(xml_file: Path) -> None:
try:
content = xml_file.read_text(encoding="utf-8")
dom = defusedxml.minidom.parseString(content)
xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="utf-8"))
except Exception:
pass
def _escape_smart_quotes(xml_file: Path) -> None:
try:
content = xml_file.read_text(encoding="utf-8")
for char, entity in SMART_QUOTE_REPLACEMENTS.items():
content = content.replace(char, entity)
xml_file.write_text(content, encoding="utf-8")
except Exception:
pass
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Unpack an Office file (DOCX, PPTX, XLSX) for editing"
)
parser.add_argument("input_file", help="Office file to unpack")
parser.add_argument("output_directory", help="Output directory")
parser.add_argument(
"--merge-runs",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Merge adjacent runs with identical formatting (DOCX only, default: true)",
)
parser.add_argument(
"--simplify-redlines",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Merge adjacent tracked changes from same author (DOCX only, default: true)",
)
args = parser.parse_args()
_, message = unpack(
args.input_file,
args.output_directory,
merge_runs=args.merge_runs,
simplify_redlines=args.simplify_redlines,
)
print(message)
if "Error" in message:
sys.exit(1)scripts/office/validate.py
Télécharger scripts/office/validate.py
"""
Command line tool to validate Office document XML files against XSD schemas and tracked changes.
Usage:
python validate.py <path> [--original <original_file>] [--auto-repair] [--author NAME]
The first argument can be either:
- An unpacked directory containing the Office document XML files
- A packed Office file (.docx/.pptx/.xlsx) which will be unpacked to a temp directory
Auto-repair fixes:
- paraId/durableId values that exceed OOXML limits
- Missing xml:space="preserve" on w:t elements with whitespace
"""
import argparse
import sys
import tempfile
import zipfile
from pathlib import Path
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
def main():
parser = argparse.ArgumentParser(description="Validate Office document XML files")
parser.add_argument(
"path",
help="Path to unpacked directory or packed Office file (.docx/.pptx/.xlsx)",
)
parser.add_argument(
"--original",
required=False,
default=None,
help="Path to original file (.docx/.pptx/.xlsx). If omitted, all XSD errors are reported and redlining validation is skipped.",
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
help="Enable verbose output",
)
parser.add_argument(
"--auto-repair",
action="store_true",
help="Automatically repair common issues (hex IDs, whitespace preservation)",
)
parser.add_argument(
"--author",
default="Claude",
help="Author name for redlining validation (default: Claude)",
)
args = parser.parse_args()
path = Path(args.path)
assert path.exists(), f"Error: {path} does not exist"
original_file = None
if args.original:
original_file = Path(args.original)
assert original_file.is_file(), f"Error: {original_file} is not a file"
assert original_file.suffix.lower() in [".docx", ".pptx", ".xlsx"], (
f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
)
file_extension = (original_file or path).suffix.lower()
assert file_extension in [".docx", ".pptx", ".xlsx"], (
f"Error: Cannot determine file type from {path}. Use --original or provide a .docx/.pptx/.xlsx file."
)
if path.is_file() and path.suffix.lower() in [".docx", ".pptx", ".xlsx"]:
temp_dir = tempfile.mkdtemp()
with zipfile.ZipFile(path, "r") as zf:
zf.extractall(temp_dir)
unpacked_dir = Path(temp_dir)
else:
assert path.is_dir(), f"Error: {path} is not a directory or Office file"
unpacked_dir = path
match file_extension:
case ".docx":
validators = [
DOCXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
]
if original_file:
validators.append(
RedliningValidator(unpacked_dir, original_file, verbose=args.verbose, author=args.author)
)
case ".pptx":
validators = [
PPTXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
]
case _:
print(f"Error: Validation not supported for file type {file_extension}")
sys.exit(1)
if args.auto_repair:
total_repairs = sum(v.repair() for v in validators)
if total_repairs:
print(f"Auto-repaired {total_repairs} issue(s)")
success = all(v.validate() for v in validators)
if success:
print("All validations PASSED!")
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()scripts/office/validateurs/init.py
Télécharger scripts/office/validators/init.py
"""
Validation modules for Word document processing.
"""
from .base import BaseSchemaValidator
from .docx import DOCXSchemaValidator
from .pptx import PPTXSchemaValidator
from .redlining import RedliningValidator
__all__ = [
"BaseSchemaValidator",
"DOCXSchemaValidator",
"PPTXSchemaValidator",
"RedliningValidator",
]scripts/office/validateurs/base.py
Télécharger scripts/office/validators/base.py
Ressource binaire
scripts/office/validateurs/docx.py
Télécharger scripts/office/validators/docx.py
Ressource binaire
scripts/office/validateurs/pptx.py
Télécharger scripts/office/validators/pptx.py
Ressource binaire
scripts/office/validateurs/redlining.py
Télécharger scripts/office/validators/redlining.py
Ressource binaire
scripts/modèles/comments.xml
Télécharger scripts/templates/comments.xml
Ressource binaire
scripts/modèles/commentsExtended.xml
Télécharger scripts/templates/commentsExtended.xml
Ressource binaire
scripts/modèles/commentsExtensible.xml
Télécharger scripts/templates/commentsExtensible.xml
Ressource binaire
scripts/modèles/commentsIds.xml
Télécharger scripts/templates/commentsIds.xml
Ressource binaire
scripts/modèles/personnes.xml
Télécharger scripts/templates/people.xml
Ressource binaire
Voir dans GitHub
Co-rédaction de documents
Guidez les utilisateurs à travers un flux de travail structuré pour la co-création de documentation. À utiliser lorsque l'utilisateur souhaite rédiger de la documentation, des propositions, des spécifications techniques, des documents de décision ou un contenu structuré similaire. Ce flux de travail aide les utilisateurs à transférer efficacement le contexte, à affiner le contenu par itération et à vérifier que le document fonctionne pour les lecteurs. Déclenchez lorsque l'utilisateur mentionne la rédaction de documents, la création de propositions, la rédaction de spécifications ou des tâches de documentation similaires.
Utilisez cette compétence chaque fois que l'utilisateur souhaite faire quelque chose avec des fichiers PDF. Cela inclut la lecture ou l'extraction de texte/tableaux à partir de PDF, la combinaison ou la fusion de plusieurs PDF en un seul, la séparation des PDF, la rotation des pages, l'ajout de filigranes, la création de nouveaux PDF, le remplissage de formulaires PDF, le cryptage/déchiffrement de PDF, l'extraction d'images et l'OCR sur les PDF numérisés pour les rendre consultables. Si l'utilisateur mentionne un fichier.pdf ou demande à en produire un, utilisez cette compétence.
claudeskills Docs