PPTX to JSON - PraisonAI PPT
PPTX โ JSON Extraction
๐ Overview
pptx_to_json is the inverse of create_presentation(). It reads an existing .pptx file
and extracts its content as a dict that conforms to the praisonaippt JSON schema โ the same
format used to create presentations.
This enables round-trip workflows:
JSON โโcreate_presentation()โโโบ PPTX โโpptx_to_json()โโโบ JSON
No new dependencies are required beyond python-pptx, which is already a core dependency.
๐ Quick Start
Python API
from praisonaippt import pptx_to_json
# Returns a dict
data = pptx_to_json("presentation.pptx")
# Save to file (pretty-printed)
pptx_to_json("presentation.pptx", output_path="output.json")
# Compact JSON (no indentation)
pptx_to_json("presentation.pptx", output_path="output.json", pretty=False)
CLI
# Saves to presentation.json (auto-named)
praisonaippt convert-json presentation.pptx
# Specify output file
praisonaippt convert-json presentation.pptx --json-output output.json
# Compact JSON
praisonaippt convert-json presentation.pptx --json-output out.json --no-pretty
๐ป CLI Reference
Command
praisonaippt convert-json <input_file> [options]
Arguments
| Argument | Description |
|---|---|
input_file |
Path to .pptx or .ppt file to extract (required) |
Options
| Option | Default | Description |
|---|---|---|
--json-output PATH |
<input>.json |
Output JSON file path |
--output-format FORMAT |
"json" |
Output format ("json" or "yaml") |
--pretty |
True |
Write indented, human-readable JSON |
--no-pretty |
โ | Write compact single-line JSON |
Examples
# Basic extraction (auto output path)
praisonaippt convert-json my_presentation.pptx
# โ writes my_presentation.json
# Named output
praisonaippt convert-json my_presentation.pptx --json-output extracted.json
# Compact JSON for embedding in scripts
praisonaippt convert-json my_presentation.pptx --json-output data.json --no-pretty
Error Handling
# File not found
praisonaippt convert-json missing.pptx
# Error: Input file not found: missing.pptx
# Wrong file type
praisonaippt convert-json slides.pdf
# Error: Input file must be a PowerPoint file (.pptx or .ppt)
๐ Python API Reference
pptx_to_json()
def pptx_to_json(
pptx_path: str,
output_path: Optional[str] = None,
pretty: bool = True,
images_dir: Optional[str] = None,
output_format: str = 'json',
) -> dict:
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
pptx_path |
str |
(required) | Path to .pptx or .ppt file |
output_path |
str or None |
None |
If set, writes output to this path |
pretty |
bool |
True |
Indent output JSON (set False for compact) |
images_dir |
str or None |
None |
Optional directory to save extracted images |
output_format |
str |
"json" |
Output format ("json" or "yaml") |
Returns
dict โ conforms to the praisonaippt JSON schema (same structure as input files for create_presentation).
Raises
| Exception | When |
|---|---|
ValueError |
Input file has a non-PPTX extension |
FileNotFoundError |
Input file does not exist |
Examples
from praisonaippt import pptx_to_json
# --- In-memory ---
data = pptx_to_json("presentation.pptx")
print(data["presentation_title"])
print(len(data["sections"]))
# --- Save to file ---
pptx_to_json("presentation.pptx", output_path="output.json")
# --- Compact JSON ---
pptx_to_json("presentation.pptx", output_path="data.json", pretty=False)
# --- Error handling ---
try:
data = pptx_to_json("missing.pptx")
except FileNotFoundError as e:
print(f"File missing: {e}")
except ValueError as e:
print(f"Wrong type: {e}")
PPTXToJSONConverter Class
For advanced use cases, use the converter class directly:
from praisonaippt.pptx_to_json import PPTXToJSONConverter
converter = PPTXToJSONConverter("presentation.pptx")
data = converter.convert()
๐ Output JSON Schema
The output dict mirrors the praisonaippt input schema exactly, plus two metadata fields:
_source: extracted
_extraction_warnings:
- 'background_image: file path not recoverable from PPTX binary'
presentation_title: Great Faith
presentation_subtitle: Mark 10:30 (NKJV)
slide_size: widescreen
slide_style:
text_color: white
reference_position: top
alignment: left
font_name: Palatino
highlight_color: '#FF8C00'
annotation_color: '#1E50C8'
sections:
- section: 1. Centurion
verses:
- reference: Matthew 8:5-10 (NKJV)
text: 10 When Jesus heard it, He marveled...
highlights:
- I have not found such great faith
- section: To Be Victorious
verses: []
- section: 1. They Didn't Wait for God
verses:
- reference: ''
text: 'Woman with the Issue of Blood
Centurion
Canaanite'
list_type: bullet
- section: 1. Tithe
verses:
- reference: ''
text: 'ืึทืขึฒืฉึตืืจ (maสฟฤลฤr) โ tithe
ืขึธืฉึทืืจ (สฟฤลกar) โ to be rich'
highlights:
- ืึทืขึฒืฉึตืืจ
- ืขึธืฉึทืืจ
large_text:
ืึทืขึฒืฉึตืืจ: 80
ืขึธืฉึทืืจ: 80
Metadata Fields
| Field | Value | Description |
|---|---|---|
_source |
"extracted" |
Always present; marks this was extracted (not hand-authored) |
_extraction_warnings |
list of strings | Populated when features could not be fully recovered |
Note: The
_sourceand_extraction_warningskeys are ignored bycreate_presentation()โ strip them or leave them in for a round-trip; both work.
โ Feature Extraction Table
All features from the praisonaippt JSON schema are handled:
| Feature | Extraction | Notes |
|---|---|---|
presentation_title |
โ Lossless | Largest font run on slide 0 |
presentation_subtitle |
โ Lossless | Second text block on slide 0 (may be a Bible ref) |
slide_size |
โ Lossless | Mapped from slide dimensions |
slide_style.background_color |
โ Lossless | Extracted from solid fill |
slide_style.background_image |
โ ๏ธ Lossy | Image detected but path not recoverable; noted in _extraction_warnings |
slide_style.text_color |
โ Best-effort | Most common run color |
slide_style.reference_position |
โ Lossless | Inferred from textbox vertical position |
slide_style.alignment |
โ Best-effort | Most frequent paragraph alignment |
slide_style.font_name |
โ Lossless | Most common run.font.name |
slide_style.highlight_color |
โ Best-effort | Most common non-body run color |
slide_style.annotation_color |
โ Best-effort | Superscript-baseline run color |
sections[].section |
โ Lossless | Single bold block on section slide |
sections[].verses = [] |
โ Lossless | Empty sections preserved |
verses[].reference |
โ Lossless | Detected via Unicode-aware regex |
verses[].reference = "" |
โ Lossless | Empty reference emitted correctly |
verses[].text |
โ Lossless | Body content including verse numbers |
verses[].text = "" |
โ Lossless | Empty text preserved |
verses[].highlights (strings) |
โ Best-effort | Colored runs matching highlight color |
verses[].highlights (objects with color/bold/underline) |
โ Lossless | Run color/formatting extracted |
verses[].highlights[].annotation number |
โ ๏ธ Lossy | Bubble chars (โถโท) not recoverable; omitted from output |
verses[].large_text |
โ Lossless | Runs with font size โฅ 1.4ร body size |
verses[].list_type |
โ Lossless | Bullet prefix โข โ "bullet", N. โ "numbered" |
verses[].font_size |
โ Lossless | Per-verse body font size |
verses[].alignment |
โ Lossless | Paragraph alignment |
| Tamil / Hebrew / Unicode text | โ Lossless | Full Unicode support throughout |
๐ Round-Trip Workflow
from praisonaippt import create_presentation, pptx_to_json
# 1. Create PPTX from JSON
data = {
"presentation_title": "Great Faith",
"sections": [
{
"section": "1. Centurion",
"verses": [
{
"reference": "Matthew 8:10 (NKJV)",
"text": "I have not found such great faith.",
"highlights": ["great faith"]
}
]
}
]
}
pptx_path = create_presentation(data, output_file="great_faith.pptx")
# 2. Extract JSON back
extracted = pptx_to_json(pptx_path)
# 3. Feed back into create_presentation (works without stripping _ keys)
roundtrip = create_presentation(extracted, output_file="roundtrip.pptx")
Batch Extraction
import os
from praisonaippt import pptx_to_json
pptx_files = [f for f in os.listdir('.') if f.endswith('.pptx')]
for pptx in pptx_files:
json_out = pptx.replace('.pptx', '.json')
pptx_to_json(pptx, output_path=json_out)
print(f"Extracted: {pptx} โ {json_out}")
โ ๏ธ Known Limitations
| Limitation | Reason | Workaround |
|---|---|---|
background_image path not in output |
Image bytes in ZIP โ path metadata stripped by PowerPoint | Manually add the path back to slide_style.background_image |
annotation numbers (โถโท) not recovered |
Bubble chars are rendered glyphs with no metadata | Re-add annotation numbers manually if needed |
| Style heuristics are best-effort | No semantic metadata in PPTX format | Visual inspection recommended for complex slides |
| Externally-authored PPTX may differ | Non-praisonaippt PPTX may use different layouts | Heuristics handle most cases; failures produce per-slide warnings |
๐ Related Documentation
- Python API Reference
- Command Reference
- PDF Conversion Guide
- Rich Text Formatting Guide
- Examples and Templates
Need help? Open an issue on GitHub