文档
每当用户想要创建、阅读、编辑或操作 Word 文档(.docx 文件)时,请使用此技能。触发因素包括:提及“Word doc”、“word 文档”、“.docx”,或要求生成具有目录、标题、页码或信头等格式的专业文档。还可以在从.docx 文件中提取或重新组织内容、在文档中插入或替换图像、在 Word 文件中执行查找和替换、处理跟踪的更改或注释或将内容转换为精美的 Word 文档时使用。如果用户要求“报告”、“备忘录”、“信件”、“模板”或类似的 Word 或.docx 文件形式的可交付成果,请使用此技能。请勿用于 PDF、电子表格、Google 文档或与文档生成无关的一般编码任务。
来源:内容改编自人类/技能(麻省理工学院)。
概述
.docx 文件是包含 XML 文件的 ZIP 存档。
快速参考
| 任务 | 方法 |
|---|---|
| 阅读/分析内容 | pandoc或解压原始 XML |
| 创建新文档 | 使用docx-js- 请参阅下面的创建新文档 |
| 编辑现有文档 | 解压 -> 编辑 XML -> 重新打包 - 请参阅下面的编辑现有文档 |
将.doc 转换为.docx
旧版.doc文件必须在编辑前进行转换:
python scripts/office/soffice.py --headless --convert-to docx document.doc阅读内容
# Text extraction with tracked changes
pandoc --track-changes=all document.docx -o output.md
# Raw XML access
python scripts/office/unpack.py document.docx unpacked/转换为图像
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page接受跟踪更改
要生成接受所有跟踪更改的干净文档(需要 LibreOffice):
python scripts/accept_changes.py input.docx output.docx创建新文档
使用 JavaScript 生成.docx 文件,然后进行验证。安装:npm install -g docx
设置
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
TabStopType, TabStopPosition, Column, SectionType,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require('docx');
const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));验证
创建文件后,验证它。如果验证失败,请解压、修复 XML,然后重新打包。
python scripts/office/validate.py doc.docx页面尺寸
// CRITICAL: docx-js defaults to A4, not US Letter
// Always set page size explicitly for consistent results
sections: [{
properties: {
page: {
size: {
width: 12240, // 8.5 inches in DXA
height: 15840 // 11 inches in DXA
},
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
}
},
children: [/* content */]
}]常见页面尺寸(DXA 单位,1440 DXA = 1 英寸):
| 纸 | 宽度 | 身高 | 内容宽度(1 英寸边距) |
|---|---|---|---|
| 美国信 | 12,240 | 12,240 15,840 | 15,840 9,360 |
| A4(默认) | 11,906 | 11,906 16,838 | 9,026 |
横向: docx-js 在内部交换宽度/高度,因此传递纵向尺寸并让它处理交换:
size: {
width: 12240, // Pass SHORT edge as width
height: 15840, // Pass LONG edge as height
orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML
},
// Content width = 15840 - left margin - right margin (uses the long edge)样式(覆盖内置标题)
使用 Arial 作为默认字体(普遍支持)。将标题保留为黑色以提高可读性。
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
paragraphStyles: [
// IMPORTANT: Use exact IDs to override built-in styles
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, font: "Arial" },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, font: "Arial" },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
]
},
sections: [{
children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
]
}]
});列表(切勿使用 unicode 项目符号)
// WRONG - never manually insert bullet characters
new Paragraph({ children: [new TextRun("* Item")] }) // BAD
new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD
// CORRECT - use numbering config with LevelFormat.BULLET
const doc = new Document({
numbering: {
config: [
{ reference: "bullets",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "*", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "numbers",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{
children: [
new Paragraph({ numbering: { reference: "bullets", level: 0 },
children: [new TextRun("Bullet item")] }),
new Paragraph({ numbering: { reference: "numbers", level: 0 },
children: [new TextRun("Numbered item")] }),
]
}]
});
// Each reference creates INDEPENDENT numbering
// Same reference = continues (1,2,3 then 4,5,6)
// Different reference = restarts (1,2,3 then 1,2,3)表格
关键:表格需要双宽度 - 在表格上设置columnWidths并在每个单元格上设置width。如果没有两者,表格在某些平台上将无法正确呈现。
// CRITICAL: Always set table width for consistent rendering
// CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };
new Table({
width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
rows: [
new TableRow({
children: [
new TableCell({
borders,
width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width)
children: [new Paragraph({ children: [new TextRun("Cell")] })]
})
]
})
]
})桌子宽度计算:
始终使用WidthType.DXA- Google 文档中的WidthType.PERCENTAGE中断。
// Table width = sum of columnWidths = content width
// US Letter with 1" margins: 12240 - 2880 = 9360 DXA
width: { size: 9360, type: WidthType.DXA },
columnWidths: [7000, 2360] // Must sum to table width宽度规则:
- 始终使用
WidthType.DXA- 切勿使用WidthType.PERCENTAGE(与 Google 文档不兼容) - 表宽度必须等于
columnWidths之和 - 单元格
width必须与相应的columnWidth匹配 - 单元格
margins是内部填充 - 它们减少内容区域,而不增加单元格宽度 - 对于全宽表格:使用内容宽度(页面宽度减去左右边距)
图片
// CRITICAL: type parameter is REQUIRED
new Paragraph({
children: [new ImageRun({
type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150 },
altText: { title: "Title", description: "Desc", name: "Name" } // All three required
})]
})分页符
// CRITICAL: PageBreak must be inside a Paragraph
new Paragraph({ children: [new PageBreak()] })
// Or use pageBreakBefore
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })超链接
// External link
new Paragraph({
children: [new ExternalHyperlink({
children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
link: "https://example.com",
})]
})
// Internal link (bookmark + reference)
// 1. Create bookmark at destination
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
]})
// 2. Link to it
new Paragraph({ children: [new InternalHyperlink({
children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
anchor: "chapter1",
})]})脚注
const doc = new Document({
footnotes: {
1: { children: [new Paragraph("Source: Annual Report 2024")] },
2: { children: [new Paragraph("See appendix for methodology")] },
},
sections: [{
children: [new Paragraph({
children: [
new TextRun("Revenue grew 15%"),
new FootnoteReferenceRun(1),
new TextRun(" using adjusted metrics"),
new FootnoteReferenceRun(2),
],
})]
}]
});制表位
// Right-align text on same line (e.g., date opposite a title)
new Paragraph({
children: [
new TextRun("Company Name"),
new TextRun("\tJanuary 2025"),
],
tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
})
// Dot leader (e.g., TOC-style)
new Paragraph({
children: [
new TextRun("Introduction"),
new TextRun({ children: [
new PositionalTab({
alignment: PositionalTabAlignment.RIGHT,
relativeTo: PositionalTabRelativeTo.MARGIN,
leader: PositionalTabLeader.DOT,
}),
"3",
]}),
],
})多栏布局
// Equal-width columns
sections: [{
properties: {
column: {
count: 2, // number of columns
space: 720, // gap between columns in DXA (720 = 0.5 inch)
equalWidth: true,
separate: true, // vertical line between columns
},
},
children: [/* content flows naturally across columns */]
}]
// Custom-width columns (equalWidth must be false)
sections: [{
properties: {
column: {
equalWidth: false,
children: [
new Column({ width: 5400, space: 720 }),
new Column({ width: 3240 }),
],
},
},
children: [/* content */]
}]使用type: SectionType.NEXT_COLUMN强制使用新部分进行分栏。
目录
// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })页眉/页脚
sections: [{
properties: {
page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch
},
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
},
footers: {
default: new Footer({ children: [new Paragraph({
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
})] })
},
children: [/* content */]
}]docx-js 的关键规则
- 明确设置页面大小 - docx-js 默认为 A4;对于美国文档,请使用 US Letter (12240 x 15840 DXA)
- 横向:传递纵向尺寸 - docx-js 在内部交换宽度/高度;将短边传递为
width,将长边传递为height,并设置orientation: PageOrientation.LANDSCAPE - 切勿使用
\n- 使用单独的段落元素 - 切勿使用 unicode 项目符号 - 使用带有编号配置的
LevelFormat.BULLET - PageBreak 必须位于段落中 - 独立创建无效的 XML
- ImageRun 需要
type- 始终指定 png/jpg/etc - 始终使用 DXA 设置表
width- 切勿使用WidthType.PERCENTAGE(Google 文档中的中断) - 表格需要双宽度 -
columnWidths数组和单元格width,两者必须匹配 - 表格宽度 = 列宽度总和 - 对于 DXA,确保它们精确相加
- 始终添加单元格边距 - 使用
margins: { top: 80, bottom: 80, left: 120, right: 120 }进行可读填充 - 使用
ShadingType.CLEAR- 从不使用 SOLID 进行表格着色 - 切勿使用表格作为分隔线/规则 - 单元格具有最小高度并呈现为空框(包括页眉/页脚);在段落上使用
border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }代替。对于两列页脚,请使用制表位(请参阅制表位部分),而不是表格 - TOC 仅需要 HeadingLevel - 标题段落没有自定义样式
- 覆盖内置样式 - 使用精确的 ID:“Heading1”、“Heading2”等。
- 包括
outlineLevel- TOC 所需(0 表示 H1,1 表示 H2 等)
编辑现有文档
按顺序执行所有 3 个步骤。
第 1 步:拆包
python scripts/office/unpack.py document.docx unpacked/提取 XML、漂亮打印、合并相邻运行,并将智能引号转换为 XML 实体(“等),以便它们在编辑中幸存下来。使用--merge-runs false跳过运行合并。
第 2 步:编辑 XML
编辑unpacked/word/中的文件。有关模式,请参阅下面的 XML 参考。
使用“Claude”作为作者跟踪更改和评论,除非用户明确请求使用不同的名称。
直接使用编辑工具进行字符串替换。不要编写 Python 脚本。 脚本会带来不必要的复杂性。编辑工具准确显示正在替换的内容。
关键:对新内容使用智能引号。 添加带有撇号或引号的文本时,请使用 XML 实体生成智能引号:
<!-- Use these entities for professional typography -->
<w:t>Here’s a quote: “Hello”</w:t>| 实体 | 人物 |
|---|---|
‘ | '(左单) |
’ | '(右单/撇号) |
“ | “(左双) |
” | “(右双) |
添加注释: 使用comment.py处理跨多个 XML 文件的样板(文本必须是预转义的 XML):
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author name然后将标记添加到 document.xml(请参阅 XML 参考中的注释)。
第三步:打包
python scripts/office/pack.py unpacked/ output.docx --original document.docx通过自动修复进行验证、压缩 XML 并创建 DOCX。使用--validate false跳过。
自动修复将修复:
durableId>= 0x7FFFFFFF(重新生成有效 ID)<w:t>上缺少xml:space="preserve"并带有空格
自动修复无法修复:
- 格式错误的 XML、无效元素嵌套、缺失关系、架构违规
常见陷阱
- 替换整个
<w:r>元素:添加跟踪更改时,将整个<w:r>...</w:r>块替换为<w:del>...<w:ins>...作为同级元素。不要在运行中注入跟踪更改标签。 - 保留
<w:rPr>格式:将原始运行的<w:rPr>块复制到跟踪的更改运行中,以保持粗体、字体大小等。
XML 参考
架构合规性
<w:pPr>中的元素顺序:<w:pStyle>、<w:numPr>、<w:spacing>、<w:ind>、<w:jc>、<w:rPr>最后- 空白:将
xml:space="preserve"添加到<w:t>,并带有前导/尾随空格 - RSID:必须是 8 位十六进制数(例如,
00AB1234)
追踪变更
插入:
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>inserted text</w:t></w:r>
</w:ins>删除:
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>在<w:del>内部:使用<w:delText>代替<w:t>,使用<w:delInstrText>代替<w:instrText>。
最少编辑 - 仅标记更改内容:
<!-- Change "30 days" to "60 days" -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
<w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
<w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>删除整个段落/列表项 - 从段落中删除所有内容时,还将段落标记标记为已删除,以便它与下一个段落合并。在<w:pPr><w:rPr>里面添加<w:del/>:
<w:p>
<w:pPr>
<w:numPr>...</w:numPr> <!-- list numbering if present -->
<w:rPr>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
</w:del>
</w:p>如果<w:pPr><w:rPr>中没有<w:del/>,则接受更改会留下一个空的段落/列表项。
拒绝其他作者的插入 - 在其插入内容中嵌套删除:
<w:ins w:author="Jane" w:id="5">
<w:del w:author="Claude" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>恢复其他作者的删除 - 在之后添加插入(不要修改他们的删除):
<w:del w:author="Jane" w:id="5">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="10">
<w:r><w:t>deleted text</w:t></w:r>
</w:ins>评论
运行comment.py(参见步骤 2)后,将标记添加到 document.xml。对于回复,请使用--parent标志并在父级内部嵌套标记。
关键:<w:commentRangeStart>和<w:commentRangeEnd>是<w:r>的兄弟姐妹,从未在<w:r>内部。
<!-- Comment markers are direct children of w:p, never inside w:r -->
<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<!-- Comment 0 with reply 1 nested inside -->
<w:commentRangeStart w:id="0"/>
<w:commentRangeStart w:id="1"/>
<w:r><w:t>text</w:t></w:r>
<w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>图片
- 添加图像文件到
word/media/ - 添加与
word/_rels/document.xml.rels的关系:
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>- 将内容类型添加到
[Content_Types].xml:
<Default Extension="png" ContentType="image/png"/>- document.xml中的参考:
<w:drawing>
<wp:inline>
<wp:extent cx="914400" cy="914400"/> <!-- EMUs: 914400 = 1 inch -->
<a:graphic>
<a:graphicData uri=".../picture">
<pic:pic>
<pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>依赖关系
- pandoc:文本提取
- docx:
npm install -g docx(新文档) - LibreOffice:PDF 转换(通过
scripts/office/soffice.py自动配置沙盒环境) - Poppler:图像的
pdftoppm
资源文件
许可证.txt
二进制资源
脚本/init.py
脚本/accept_changes.py
"""Accept all tracked changes in a DOCX file using LibreOffice.
Requires LibreOffice (soffice) to be installed.
"""
import argparse
import logging
import shutil
import subprocess
from pathlib import Path
from office.soffice import get_soffice_env
logger = logging.getLogger(__name__)
LIBREOFFICE_PROFILE = "/tmp/libreoffice_docx_profile"
MACRO_DIR = f"{LIBREOFFICE_PROFILE}/user/basic/Standard"
ACCEPT_CHANGES_MACRO = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">
Sub AcceptAllTrackedChanges()
Dim document As Object
Dim dispatcher As Object
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
dispatcher.executeDispatch(document, ".uno:AcceptAllTrackedChanges", "", 0, Array())
ThisComponent.store()
ThisComponent.close(True)
End Sub
</script:module>"""
def accept_changes(
input_file: str,
output_file: str,
) -> tuple[None, str]:
input_path = Path(input_file)
output_path = Path(output_file)
if not input_path.exists():
return None, f"Error: Input file not found: {input_file}"
if not input_path.suffix.lower() == ".docx":
return None, f"Error: Input file is not a DOCX file: {input_file}"
try:
output_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(input_path, output_path)
except Exception as e:
return None, f"Error: Failed to copy input file to output location: {e}"
if not _setup_libreoffice_macro():
return None, "Error: Failed to setup LibreOffice macro"
cmd = [
"soffice",
"--headless",
f"-env:UserInstallation=file://{LIBREOFFICE_PROFILE}",
"--norestore",
"vnd.sun.star.script:Standard.Module1.AcceptAllTrackedChanges?language=Basic&location=application",
str(output_path.absolute()),
]
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=30,
check=False,
env=get_soffice_env(),
)
except subprocess.TimeoutExpired:
return (
None,
f"Successfully accepted all tracked changes: {input_file} -> {output_file}",
)
if result.returncode != 0:
return None, f"Error: LibreOffice failed: {result.stderr}"
return (
None,
f"Successfully accepted all tracked changes: {input_file} -> {output_file}",
)
def _setup_libreoffice_macro() -> bool:
macro_dir = Path(MACRO_DIR)
macro_file = macro_dir / "Module1.xba"
if macro_file.exists() and "AcceptAllTrackedChanges" in macro_file.read_text():
return True
if not macro_dir.exists():
subprocess.run(
[
"soffice",
"--headless",
f"-env:UserInstallation=file://{LIBREOFFICE_PROFILE}",
"--terminate_after_init",
],
capture_output=True,
timeout=10,
check=False,
env=get_soffice_env(),
)
macro_dir.mkdir(parents=True, exist_ok=True)
try:
macro_file.write_text(ACCEPT_CHANGES_MACRO)
return True
except Exception as e:
logger.warning(f"Failed to setup LibreOffice macro: {e}")
return False
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Accept all tracked changes in a DOCX file"
)
parser.add_argument("input_file", help="Input DOCX file with tracked changes")
parser.add_argument(
"output_file", help="Output DOCX file (clean, no tracked changes)"
)
args = parser.parse_args()
_, message = accept_changes(args.input_file, args.output_file)
print(message)
if "Error" in message:
raise SystemExit(1)脚本/comment.py
二进制资源
脚本/office/helpers/init.py
二进制资源
脚本/office/helpers/merge_runs.py
下载脚本/office/helpers/merge_runs.py
"""Merge adjacent runs with identical formatting in DOCX.
Merges adjacent <w:r> elements that have identical <w:rPr> properties.
Works on runs in paragraphs and inside tracked changes (<w:ins>, <w:del>).
Also:
- Removes rsid attributes from runs (revision metadata that doesn't affect rendering)
- Removes proofErr elements (spell/grammar markers that block merging)
"""
from pathlib import Path
import defusedxml.minidom
def merge_runs(input_dir: str) -> tuple[int, str]:
doc_xml = Path(input_dir) / "word" / "document.xml"
if not doc_xml.exists():
return 0, f"Error: {doc_xml} not found"
try:
dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
root = dom.documentElement
_remove_elements(root, "proofErr")
_strip_run_rsid_attrs(root)
containers = {run.parentNode for run in _find_elements(root, "r")}
merge_count = 0
for container in containers:
merge_count += _merge_runs_in(container)
doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
return merge_count, f"Merged {merge_count} runs"
except Exception as e:
return 0, f"Error: {e}"
def _find_elements(root, tag: str) -> list:
results = []
def traverse(node):
if node.nodeType == node.ELEMENT_NODE:
name = node.localName or node.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(node)
for child in node.childNodes:
traverse(child)
traverse(root)
return results
def _get_child(parent, tag: str):
for child in parent.childNodes:
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name == tag or name.endswith(f":{tag}"):
return child
return None
def _get_children(parent, tag: str) -> list:
results = []
for child in parent.childNodes:
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(child)
return results
def _is_adjacent(elem1, elem2) -> bool:
node = elem1.nextSibling
while node:
if node == elem2:
return True
if node.nodeType == node.ELEMENT_NODE:
return False
if node.nodeType == node.TEXT_NODE and node.data.strip():
return False
node = node.nextSibling
return False
def _remove_elements(root, tag: str):
for elem in _find_elements(root, tag):
if elem.parentNode:
elem.parentNode.removeChild(elem)
def _strip_run_rsid_attrs(root):
for run in _find_elements(root, "r"):
for attr in list(run.attributes.values()):
if "rsid" in attr.name.lower():
run.removeAttribute(attr.name)
def _merge_runs_in(container) -> int:
merge_count = 0
run = _first_child_run(container)
while run:
while True:
next_elem = _next_element_sibling(run)
if next_elem and _is_run(next_elem) and _can_merge(run, next_elem):
_merge_run_content(run, next_elem)
container.removeChild(next_elem)
merge_count += 1
else:
break
_consolidate_text(run)
run = _next_sibling_run(run)
return merge_count
def _first_child_run(container):
for child in container.childNodes:
if child.nodeType == child.ELEMENT_NODE and _is_run(child):
return child
return None
def _next_element_sibling(node):
sibling = node.nextSibling
while sibling:
if sibling.nodeType == sibling.ELEMENT_NODE:
return sibling
sibling = sibling.nextSibling
return None
def _next_sibling_run(node):
sibling = node.nextSibling
while sibling:
if sibling.nodeType == sibling.ELEMENT_NODE:
if _is_run(sibling):
return sibling
sibling = sibling.nextSibling
return None
def _is_run(node) -> bool:
name = node.localName or node.tagName
return name == "r" or name.endswith(":r")
def _can_merge(run1, run2) -> bool:
rpr1 = _get_child(run1, "rPr")
rpr2 = _get_child(run2, "rPr")
if (rpr1 is None) != (rpr2 is None):
return False
if rpr1 is None:
return True
return rpr1.toxml() == rpr2.toxml()
def _merge_run_content(target, source):
for child in list(source.childNodes):
if child.nodeType == child.ELEMENT_NODE:
name = child.localName or child.tagName
if name != "rPr" and not name.endswith(":rPr"):
target.appendChild(child)
def _consolidate_text(run):
t_elements = _get_children(run, "t")
for i in range(len(t_elements) - 1, 0, -1):
curr, prev = t_elements[i], t_elements[i - 1]
if _is_adjacent(prev, curr):
prev_text = prev.firstChild.data if prev.firstChild else ""
curr_text = curr.firstChild.data if curr.firstChild else ""
merged = prev_text + curr_text
if prev.firstChild:
prev.firstChild.data = merged
else:
prev.appendChild(run.ownerDocument.createTextNode(merged))
if merged.startswith(" ") or merged.endswith(" "):
prev.setAttribute("xml:space", "preserve")
elif prev.hasAttribute("xml:space"):
prev.removeAttribute("xml:space")
run.removeChild(curr)脚本/office/helpers/simplify_redlines.py
下载脚本/office/helpers/simplify_redlines.py
"""Simplify tracked changes by merging adjacent w:ins or w:del elements.
Merges adjacent <w:ins> elements from the same author into a single element.
Same for <w:del> elements. This makes heavily-redlined documents easier to
work with by reducing the number of tracked change wrappers.
Rules:
- Only merges w:ins with w:ins, w:del with w:del (same element type)
- Only merges if same author (ignores timestamp differences)
- Only merges if truly adjacent (only whitespace between them)
"""
import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path
import defusedxml.minidom
WORD_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
def simplify_redlines(input_dir: str) -> tuple[int, str]:
doc_xml = Path(input_dir) / "word" / "document.xml"
if not doc_xml.exists():
return 0, f"Error: {doc_xml} not found"
try:
dom = defusedxml.minidom.parseString(doc_xml.read_text(encoding="utf-8"))
root = dom.documentElement
merge_count = 0
containers = _find_elements(root, "p") + _find_elements(root, "tc")
for container in containers:
merge_count += _merge_tracked_changes_in(container, "ins")
merge_count += _merge_tracked_changes_in(container, "del")
doc_xml.write_bytes(dom.toxml(encoding="UTF-8"))
return merge_count, f"Simplified {merge_count} tracked changes"
except Exception as e:
return 0, f"Error: {e}"
def _merge_tracked_changes_in(container, tag: str) -> int:
merge_count = 0
tracked = [
child
for child in container.childNodes
if child.nodeType == child.ELEMENT_NODE and _is_element(child, tag)
]
if len(tracked) < 2:
return 0
i = 0
while i < len(tracked) - 1:
curr = tracked[i]
next_elem = tracked[i + 1]
if _can_merge_tracked(curr, next_elem):
_merge_tracked_content(curr, next_elem)
container.removeChild(next_elem)
tracked.pop(i + 1)
merge_count += 1
else:
i += 1
return merge_count
def _is_element(node, tag: str) -> bool:
name = node.localName or node.tagName
return name == tag or name.endswith(f":{tag}")
def _get_author(elem) -> str:
author = elem.getAttribute("w:author")
if not author:
for attr in elem.attributes.values():
if attr.localName == "author" or attr.name.endswith(":author"):
return attr.value
return author
def _can_merge_tracked(elem1, elem2) -> bool:
if _get_author(elem1) != _get_author(elem2):
return False
node = elem1.nextSibling
while node and node != elem2:
if node.nodeType == node.ELEMENT_NODE:
return False
if node.nodeType == node.TEXT_NODE and node.data.strip():
return False
node = node.nextSibling
return True
def _merge_tracked_content(target, source):
while source.firstChild:
child = source.firstChild
source.removeChild(child)
target.appendChild(child)
def _find_elements(root, tag: str) -> list:
results = []
def traverse(node):
if node.nodeType == node.ELEMENT_NODE:
name = node.localName or node.tagName
if name == tag or name.endswith(f":{tag}"):
results.append(node)
for child in node.childNodes:
traverse(child)
traverse(root)
return results
def get_tracked_change_authors(doc_xml_path: Path) -> dict[str, int]:
if not doc_xml_path.exists():
return {}
try:
tree = ET.parse(doc_xml_path)
root = tree.getroot()
except ET.ParseError:
return {}
namespaces = {"w": WORD_NS}
author_attr = f"{{{WORD_NS}}}author"
authors: dict[str, int] = {}
for tag in ["ins", "del"]:
for elem in root.findall(f".//w:{tag}", namespaces):
author = elem.get(author_attr)
if author:
authors[author] = authors.get(author, 0) + 1
return authors
def _get_authors_from_docx(docx_path: Path) -> dict[str, int]:
try:
with zipfile.ZipFile(docx_path, "r") as zf:
if "word/document.xml" not in zf.namelist():
return {}
with zf.open("word/document.xml") as f:
tree = ET.parse(f)
root = tree.getroot()
namespaces = {"w": WORD_NS}
author_attr = f"{{{WORD_NS}}}author"
authors: dict[str, int] = {}
for tag in ["ins", "del"]:
for elem in root.findall(f".//w:{tag}", namespaces):
author = elem.get(author_attr)
if author:
authors[author] = authors.get(author, 0) + 1
return authors
except (zipfile.BadZipFile, ET.ParseError):
return {}
def infer_author(modified_dir: Path, original_docx: Path, default: str = "Claude") -> str:
modified_xml = modified_dir / "word" / "document.xml"
modified_authors = get_tracked_change_authors(modified_xml)
if not modified_authors:
return default
original_authors = _get_authors_from_docx(original_docx)
new_changes: dict[str, int] = {}
for author, count in modified_authors.items():
original_count = original_authors.get(author, 0)
diff = count - original_count
if diff > 0:
new_changes[author] = diff
if not new_changes:
return default
if len(new_changes) == 1:
return next(iter(new_changes))
raise ValueError(
f"Multiple authors added new changes: {new_changes}. "
"Cannot infer which author to validate."
)脚本/office/pack.py
"""Pack a directory into a DOCX, PPTX, or XLSX file.
Validates with auto-repair, condenses XML formatting, and creates the Office file.
Usage:
python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]
Examples:
python pack.py unpacked/ output.docx --original input.docx
python pack.py unpacked/ output.pptx --validate false
"""
import argparse
import sys
import shutil
import tempfile
import zipfile
from pathlib import Path
import defusedxml.minidom
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
def pack(
input_directory: str,
output_file: str,
original_file: str | None = None,
validate: bool = True,
infer_author_func=None,
) -> tuple[None, str]:
input_dir = Path(input_directory)
output_path = Path(output_file)
suffix = output_path.suffix.lower()
if not input_dir.is_dir():
return None, f"Error: {input_dir} is not a directory"
if suffix not in {".docx", ".pptx", ".xlsx"}:
return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"
if validate and original_file:
original_path = Path(original_file)
if original_path.exists():
success, output = _run_validation(
input_dir, original_path, suffix, infer_author_func
)
if output:
print(output)
if not success:
return None, f"Error: Validation failed for {input_dir}"
with tempfile.TemporaryDirectory() as temp_dir:
temp_content_dir = Path(temp_dir) / "content"
shutil.copytree(input_dir, temp_content_dir)
for pattern in ["*.xml", "*.rels"]:
for xml_file in temp_content_dir.rglob(pattern):
_condense_xml(xml_file)
output_path.parent.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
for f in temp_content_dir.rglob("*"):
if f.is_file():
zf.write(f, f.relative_to(temp_content_dir))
return None, f"Successfully packed {input_dir} to {output_file}"
def _run_validation(
unpacked_dir: Path,
original_file: Path,
suffix: str,
infer_author_func=None,
) -> tuple[bool, str | None]:
output_lines = []
validators = []
if suffix == ".docx":
author = "Claude"
if infer_author_func:
try:
author = infer_author_func(unpacked_dir, original_file)
except ValueError as e:
print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)
validators = [
DOCXSchemaValidator(unpacked_dir, original_file),
RedliningValidator(unpacked_dir, original_file, author=author),
]
elif suffix == ".pptx":
validators = [PPTXSchemaValidator(unpacked_dir, original_file)]
if not validators:
return True, None
total_repairs = sum(v.repair() for v in validators)
if total_repairs:
output_lines.append(f"Auto-repaired {total_repairs} issue(s)")
success = all(v.validate() for v in validators)
if success:
output_lines.append("All validations PASSED!")
return success, "\n".join(output_lines) if output_lines else None
def _condense_xml(xml_file: Path) -> None:
try:
with open(xml_file, encoding="utf-8") as f:
dom = defusedxml.minidom.parse(f)
for element in dom.getElementsByTagName("*"):
if element.tagName.endswith(":t"):
continue
for child in list(element.childNodes):
if (
child.nodeType == child.TEXT_NODE
and child.nodeValue
and child.nodeValue.strip() == ""
) or child.nodeType == child.COMMENT_NODE:
element.removeChild(child)
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
except Exception as e:
print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
raise
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Pack a directory into a DOCX, PPTX, or XLSX file"
)
parser.add_argument("input_directory", help="Unpacked Office document directory")
parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
parser.add_argument(
"--original",
help="Original file for validation comparison",
)
parser.add_argument(
"--validate",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Run validation with auto-repair (default: true)",
)
args = parser.parse_args()
_, message = pack(
args.input_directory,
args.output_file,
original_file=args.original,
validate=args.validate,
)
print(message)
if "Error" in message:
sys.exit(1)脚本/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/pml.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/pml.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/sml.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/sml.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/wml.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/wml.xsd
二进制资源
脚本/office/schemas/ISO-IEC29500-4_2016/xml.xsd
下载脚本/office/schemas/ISO-IEC29500-4_2016/xml.xsd
二进制资源
脚本/office/schemas/ecma/第四版/opc-contentTypes.xsd
下载脚本/office/schemas/ecma/fouth-edition/opc-contentTypes.xsd
二进制资源
脚本/office/schemas/ecma/第四版/opc-coreProperties.xsd
下载脚本/office/schemas/ecma/fouth-edition/opc-coreProperties.xsd
二进制资源
脚本/office/schemas/ecma/第四版/opc-digSig.xsd
下载脚本/office/schemas/ecma/fouth-edition/opc-digSig.xsd
二进制资源
脚本/office/schemas/ecma/第四版/opc-relationships.xsd
下载脚本/office/schemas/ecma/fouth-edition/opc-relationships.xsd
二进制资源
脚本/office/schemas/mce/mc.xsd
下载脚本/office/schemas/mce/mc.xsd
二进制资源
脚本/office/schemas/microsoft/wml-2010.xsd
下载脚本/office/schemas/microsoft/wml-2010.xsd
二进制资源
脚本/office/schemas/microsoft/wml-2012.xsd
下载脚本/office/schemas/microsoft/wml-2012.xsd
二进制资源
脚本/office/schemas/microsoft/wml-2018.xsd
下载脚本/office/schemas/microsoft/wml-2018.xsd
二进制资源
脚本/office/schemas/microsoft/wml-cex-2018.xsd
下载脚本/office/schemas/microsoft/wml-cex-2018.xsd
二进制资源
脚本/office/schemas/microsoft/wml-cid-2016.xsd
下载脚本/office/schemas/microsoft/wml-cid-2016.xsd
二进制资源
脚本/office/schemas/microsoft/wml-sdtdatahash-2020.xsd
下载脚本/office/schemas/microsoft/wml-sdtdatahash-2020.xsd
二进制资源
脚本/office/schemas/microsoft/wml-symex-2015.xsd
下载脚本/office/schemas/microsoft/wml-symex-2015.xsd
二进制资源
脚本/office/soffice.py
"""
Helper for running LibreOffice (soffice) in environments where AF_UNIX
sockets may be blocked (e.g., sandboxed VMs). Detects the restriction
at runtime and applies an LD_PRELOAD shim if needed.
Usage:
from office.soffice import run_soffice, get_soffice_env
# Option 1 – run soffice directly
result = run_soffice(["--headless", "--convert-to", "pdf", "input.docx"])
# Option 2 – get env dict for your own subprocess calls
env = get_soffice_env()
subprocess.run(["soffice", ...], env=env)
"""
import os
import socket
import subprocess
import tempfile
from pathlib import Path
def get_soffice_env() -> dict:
env = os.environ.copy()
env["SAL_USE_VCLPLUGIN"] = "svp"
if _needs_shim():
shim = _ensure_shim()
env["LD_PRELOAD"] = str(shim)
return env
def run_soffice(args: list[str], **kwargs) -> subprocess.CompletedProcess:
env = get_soffice_env()
return subprocess.run(["soffice"] + args, env=env, **kwargs)
_SHIM_SO = Path(tempfile.gettempdir()) / "lo_socket_shim.so"
def _needs_shim() -> bool:
try:
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.close()
return False
except OSError:
return True
def _ensure_shim() -> Path:
if _SHIM_SO.exists():
return _SHIM_SO
src = Path(tempfile.gettempdir()) / "lo_socket_shim.c"
src.write_text(_SHIM_SOURCE)
subprocess.run(
["gcc", "-shared", "-fPIC", "-o", str(_SHIM_SO), str(src), "-ldl"],
check=True,
capture_output=True,
)
src.unlink()
return _SHIM_SO
_SHIM_SOURCE = r"""
#define _GNU_SOURCE
#include <dlfcn.h>
#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <unistd.h>
static int (*real_socket)(int, int, int);
static int (*real_socketpair)(int, int, int, int[2]);
static int (*real_listen)(int, int);
static int (*real_accept)(int, struct sockaddr *, socklen_t *);
static int (*real_close)(int);
static int (*real_read)(int, void *, size_t);
/* Per-FD bookkeeping (FDs >= 1024 are passed through unshimmed). */
static int is_shimmed[1024];
static int peer_of[1024];
static int wake_r[1024]; /* accept() blocks reading this */
static int wake_w[1024]; /* close() writes to this */
static int listener_fd = -1; /* FD that received listen() */
__attribute__((constructor))
static void init(void) {
real_socket = dlsym(RTLD_NEXT, "socket");
real_socketpair = dlsym(RTLD_NEXT, "socketpair");
real_listen = dlsym(RTLD_NEXT, "listen");
real_accept = dlsym(RTLD_NEXT, "accept");
real_close = dlsym(RTLD_NEXT, "close");
real_read = dlsym(RTLD_NEXT, "read");
for (int i = 0; i < 1024; i++) {
peer_of[i] = -1;
wake_r[i] = -1;
wake_w[i] = -1;
}
}
/* ---- socket ---------------------------------------------------------- */
int socket(int domain, int type, int protocol) {
if (domain == AF_UNIX) {
int fd = real_socket(domain, type, protocol);
if (fd >= 0) return fd;
/* socket(AF_UNIX) blocked – fall back to socketpair(). */
int sv[2];
if (real_socketpair(domain, type, protocol, sv) == 0) {
if (sv[0] >= 0 && sv[0] < 1024) {
is_shimmed[sv[0]] = 1;
peer_of[sv[0]] = sv[1];
int wp[2];
if (pipe(wp) == 0) {
wake_r[sv[0]] = wp[0];
wake_w[sv[0]] = wp[1];
}
}
return sv[0];
}
errno = EPERM;
return -1;
}
return real_socket(domain, type, protocol);
}
/* ---- listen ---------------------------------------------------------- */
int listen(int sockfd, int backlog) {
if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
listener_fd = sockfd;
return 0;
}
return real_listen(sockfd, backlog);
}
/* ---- accept ---------------------------------------------------------- */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
if (sockfd >= 0 && sockfd < 1024 && is_shimmed[sockfd]) {
/* Block until close() writes to the wake pipe. */
if (wake_r[sockfd] >= 0) {
char buf;
real_read(wake_r[sockfd], &buf, 1);
}
errno = ECONNABORTED;
return -1;
}
return real_accept(sockfd, addr, addrlen);
}
/* ---- close ----------------------------------------------------------- */
int close(int fd) {
if (fd >= 0 && fd < 1024 && is_shimmed[fd]) {
int was_listener = (fd == listener_fd);
is_shimmed[fd] = 0;
if (wake_w[fd] >= 0) { /* unblock accept() */
char c = 0;
write(wake_w[fd], &c, 1);
real_close(wake_w[fd]);
wake_w[fd] = -1;
}
if (wake_r[fd] >= 0) { real_close(wake_r[fd]); wake_r[fd] = -1; }
if (peer_of[fd] >= 0) { real_close(peer_of[fd]); peer_of[fd] = -1; }
if (was_listener)
_exit(0); /* conversion done – exit */
}
return real_close(fd);
}
"""
if __name__ == "__main__":
import sys
result = run_soffice(sys.argv[1:])
sys.exit(result.returncode)脚本/office/unpack.py
"""Unpack Office files (DOCX, PPTX, XLSX) for editing.
Extracts the ZIP archive, pretty-prints XML files, and optionally:
- Merges adjacent runs with identical formatting (DOCX only)
- Simplifies adjacent tracked changes from same author (DOCX only)
Usage:
python unpack.py <office_file> <output_dir> [options]
Examples:
python unpack.py document.docx unpacked/
python unpack.py presentation.pptx unpacked/
python unpack.py document.docx unpacked/ --merge-runs false
"""
import argparse
import sys
import zipfile
from pathlib import Path
import defusedxml.minidom
from helpers.merge_runs import merge_runs as do_merge_runs
from helpers.simplify_redlines import simplify_redlines as do_simplify_redlines
SMART_QUOTE_REPLACEMENTS = {
"\u201c": "“",
"\u201d": "”",
"\u2018": "‘",
"\u2019": "’",
}
def unpack(
input_file: str,
output_directory: str,
merge_runs: bool = True,
simplify_redlines: bool = True,
) -> tuple[None, str]:
input_path = Path(input_file)
output_path = Path(output_directory)
suffix = input_path.suffix.lower()
if not input_path.exists():
return None, f"Error: {input_file} does not exist"
if suffix not in {".docx", ".pptx", ".xlsx"}:
return None, f"Error: {input_file} must be a .docx, .pptx, or .xlsx file"
try:
output_path.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(input_path, "r") as zf:
zf.extractall(output_path)
xml_files = list(output_path.rglob("*.xml")) + list(output_path.rglob("*.rels"))
for xml_file in xml_files:
_pretty_print_xml(xml_file)
message = f"Unpacked {input_file} ({len(xml_files)} XML files)"
if suffix == ".docx":
if simplify_redlines:
simplify_count, _ = do_simplify_redlines(str(output_path))
message += f", simplified {simplify_count} tracked changes"
if merge_runs:
merge_count, _ = do_merge_runs(str(output_path))
message += f", merged {merge_count} runs"
for xml_file in xml_files:
_escape_smart_quotes(xml_file)
return None, message
except zipfile.BadZipFile:
return None, f"Error: {input_file} is not a valid Office file"
except Exception as e:
return None, f"Error unpacking: {e}"
def _pretty_print_xml(xml_file: Path) -> None:
try:
content = xml_file.read_text(encoding="utf-8")
dom = defusedxml.minidom.parseString(content)
xml_file.write_bytes(dom.toprettyxml(indent=" ", encoding="utf-8"))
except Exception:
pass
def _escape_smart_quotes(xml_file: Path) -> None:
try:
content = xml_file.read_text(encoding="utf-8")
for char, entity in SMART_QUOTE_REPLACEMENTS.items():
content = content.replace(char, entity)
xml_file.write_text(content, encoding="utf-8")
except Exception:
pass
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Unpack an Office file (DOCX, PPTX, XLSX) for editing"
)
parser.add_argument("input_file", help="Office file to unpack")
parser.add_argument("output_directory", help="Output directory")
parser.add_argument(
"--merge-runs",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Merge adjacent runs with identical formatting (DOCX only, default: true)",
)
parser.add_argument(
"--simplify-redlines",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Merge adjacent tracked changes from same author (DOCX only, default: true)",
)
args = parser.parse_args()
_, message = unpack(
args.input_file,
args.output_directory,
merge_runs=args.merge_runs,
simplify_redlines=args.simplify_redlines,
)
print(message)
if "Error" in message:
sys.exit(1)脚本/office/validate.py
"""
Command line tool to validate Office document XML files against XSD schemas and tracked changes.
Usage:
python validate.py <path> [--original <original_file>] [--auto-repair] [--author NAME]
The first argument can be either:
- An unpacked directory containing the Office document XML files
- A packed Office file (.docx/.pptx/.xlsx) which will be unpacked to a temp directory
Auto-repair fixes:
- paraId/durableId values that exceed OOXML limits
- Missing xml:space="preserve" on w:t elements with whitespace
"""
import argparse
import sys
import tempfile
import zipfile
from pathlib import Path
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
def main():
parser = argparse.ArgumentParser(description="Validate Office document XML files")
parser.add_argument(
"path",
help="Path to unpacked directory or packed Office file (.docx/.pptx/.xlsx)",
)
parser.add_argument(
"--original",
required=False,
default=None,
help="Path to original file (.docx/.pptx/.xlsx). If omitted, all XSD errors are reported and redlining validation is skipped.",
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
help="Enable verbose output",
)
parser.add_argument(
"--auto-repair",
action="store_true",
help="Automatically repair common issues (hex IDs, whitespace preservation)",
)
parser.add_argument(
"--author",
default="Claude",
help="Author name for redlining validation (default: Claude)",
)
args = parser.parse_args()
path = Path(args.path)
assert path.exists(), f"Error: {path} does not exist"
original_file = None
if args.original:
original_file = Path(args.original)
assert original_file.is_file(), f"Error: {original_file} is not a file"
assert original_file.suffix.lower() in [".docx", ".pptx", ".xlsx"], (
f"Error: {original_file} must be a .docx, .pptx, or .xlsx file"
)
file_extension = (original_file or path).suffix.lower()
assert file_extension in [".docx", ".pptx", ".xlsx"], (
f"Error: Cannot determine file type from {path}. Use --original or provide a .docx/.pptx/.xlsx file."
)
if path.is_file() and path.suffix.lower() in [".docx", ".pptx", ".xlsx"]:
temp_dir = tempfile.mkdtemp()
with zipfile.ZipFile(path, "r") as zf:
zf.extractall(temp_dir)
unpacked_dir = Path(temp_dir)
else:
assert path.is_dir(), f"Error: {path} is not a directory or Office file"
unpacked_dir = path
match file_extension:
case ".docx":
validators = [
DOCXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
]
if original_file:
validators.append(
RedliningValidator(unpacked_dir, original_file, verbose=args.verbose, author=args.author)
)
case ".pptx":
validators = [
PPTXSchemaValidator(unpacked_dir, original_file, verbose=args.verbose),
]
case _:
print(f"Error: Validation not supported for file type {file_extension}")
sys.exit(1)
if args.auto_repair:
total_repairs = sum(v.repair() for v in validators)
if total_repairs:
print(f"Auto-repaired {total_repairs} issue(s)")
success = all(v.validate() for v in validators)
if success:
print("All validations PASSED!")
sys.exit(0 if success else 1)
if __name__ == "__main__":
main()脚本/office/validators/init.py
下载脚本/office/validators/init.py
"""
Validation modules for Word document processing.
"""
from .base import BaseSchemaValidator
from .docx import DOCXSchemaValidator
from .pptx import PPTXSchemaValidator
from .redlining import RedliningValidator
__all__ = [
"BaseSchemaValidator",
"DOCXSchemaValidator",
"PPTXSchemaValidator",
"RedliningValidator",
]脚本/office/validators/base.py
下载脚本/office/validators/base.py
二进制资源
脚本/office/validators/docx.py
下载脚本/office/validators/docx.py
二进制资源
脚本/office/validators/pptx.py
下载脚本/office/validators/pptx.py
二进制资源
脚本/office/validators/redlined.py
下载脚本/office/validators/redlined.py
二进制资源
脚本/模板/comments.xml
二进制资源
脚本/模板/commentsExtended.xml
二进制资源
脚本/模板/注释Extensible.xml
下载脚本/模板/commentsExtensible.xml
二进制资源
脚本/模板/commentsIds.xml
二进制资源
脚本/模板/people.xml
二进制资源
参见 GitHub
文档共同创作
指导用户完成共同创作文档的结构化工作流程。当用户想要编写文档、提案、技术规范、决策文档或类似的结构化内容时使用。此工作流程可帮助用户高效地传输上下文、通过迭代细化内容并验证文档是否适合读者。当用户提到编写文档、创建提案、起草规范或类似的文档任务时触发。
当用户想要对 PDF 文件执行任何操作时,请使用此技能。这包括从 PDF 中读取或提取文本/表格、将多个 PDF 组合或合并为一个、拆分 PDF、旋转页面、添加水印、创建新 PDF、填写 PDF 表单、加密/解密 PDF、提取图像以及对扫描的 PDF 进行 OCR 使其可搜索。如果用户提到.pdf 文件或要求生成一个,请使用此技能。
claudeskills文档