Format options control what data types are included in your response. Specify one or more formats to receive HTML, markdown, screenshots, or extracted links alongside your extracted data.
Common uses:
- Full page content: Get raw HTML for custom processing
- Readable text: Convert pages to clean markdown format
- Visual records: Capture screenshots for monitoring or archival
- Link extraction: Get all URLs from the page for further crawling
You can combine multiple formats in a single request. All specified formats will be included in the response.
Supported parameters
Available in - Extract and Crawl.
| Parameter | Type | Description | Default |
|---|
formats | List (String) | Sets what data types are included in your response | ["html"] |
| Format | Description | Use Case |
|---|
html | Raw HTML content | Custom parsing, archival, full DOM access |
markdown | Clean markdown conversion | Readable text, content analysis, LLM processing |
screenshot | Page screenshot (base64) | Visual verification, monitoring, documentation |
links | All extracted URLs | Link discovery, crawling, sitemap building |
Usage
Request one format type - html (default). Best for:
- Custom HTML parsing
- Preserving exact page structure
- Accessing all DOM elements and attributes
- Archival purposes
from nimble import Nimble
nimble = Nimble(api_key="YOUR-API-KEY")
result = nimble.extract({
"url": "https://www.example.com",
"formats": ["html"] # default
})
# Access HTML content
html_content = result["data"]["html"]
print(html_content)
Combine multiple formats to get different data representations:
from nimble import Nimble
nimble = Nimble(api_key="YOUR-API-KEY")
result = nimble.extract({
"url": "https://www.example.com",
"formats": ["html", "markdown", "screenshot", "links"]
})
print(result)
Convert the page to clean, readable markdown. Best for:
- Clean text extraction
- Content analysis
- LLM processing
- Human-readable output:
from nimble import Nimble
nimble = Nimble(api_key="YOUR-API-KEY")
result = nimble.extract({
"url": "https://www.example.com/article",
"formats": ["markdown"]
})
# Access markdown content
markdown_content = result["data"]["markdown"]
print(markdown_content)
Capture a visual snapshot of the page. Best for:
- Visual verification
- Monitoring page changes
- Documentation and reporting
- Debugging layout issues:
from nimble import Nimble
import base64
nimble = Nimble(api_key="YOUR-API-KEY")
result = nimble.extract({
"url": "https://www.example.com",
"formats": ["screenshot"],
"render": True
})
# Access screenshot (base64 encoded)
screenshot_data = result["data"]["screenshot"]
# Decode and save
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(screenshot_data))
Screenshots require page rendering to be enabled (render: true). The image is returned as base64-encoded PNG data.
Extract all URLs found on the page. Best for:
- Link discovery
- Building sitemaps
- Crawling workflows
- Finding internal/external links:
from nimble import Nimble
nimble = Nimble(api_key="YOUR-API-KEY")
result = nimble.extract({
"url": "https://www.example.com",
"formats": ["links"]
})
# Access extracted links
links = result["data"]["links"]
for link in links:
print(link)
Combining with other features
Formats work seamlessly with parsing, browser actions, and other features:
from nimble import Nimble
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
nimble = Nimble(api_key="YOUR-API-KEY")
result = nimble.extract({
"url": "https://www.example.com/product",
"render": True,
"formats": ["html", "markdown", "screenshot"],
"schema": Product,
"browser_actions": [
{
"wait": {
"delay": 1000
}
}
]
})
# Access different formats
product_data = result["data"]["parsed"]
html = result["data"]["html"]
markdown = result["data"]["markdown"]
screenshot = result["data"]["screenshot"]
Example response
When formats are specified, all requested data is included in the response. The response includes:
- html: Raw HTML if requested
- markdown: Converted markdown if requested
- screenshot: Base64-encoded PNG if requested
- links: Array of extracted URLs if requested
- parsed: Structured data if parsing was used
- metadata: Execution details and formats included:
{
"status": "success",
"data": {
"html": "<!DOCTYPE html><html><head>...</head><body>...</body></html>",
"markdown": "# Article Title\n\nThis is the article content...",
"screenshot": "iVBORw0KGgoAAAANSUhEUgAAA...",
"links": [
"https://www.example.com/about",
"https://www.example.com/contact",
"https://www.example.com/products",
"https://external-site.com"
],
"parsed": {
"title": "Example Article",
"author": "John Doe"
}
},
"metadata": {
"driver": "vx8",
"execution_time_ms": 1850,
"formats_included": ["html", "markdown", "screenshot", "links"]
}
}
Best practices
Choose formats based on your needs:
- Use
html when you need full DOM access
- Use
markdown for clean text and content analysis
- Use
screenshot for visual verification
- Use
links for discovering URLs to crawl
Avoid unnecessary formats:
# ❌ Don't request all formats if you only need one
formats=["html", "markdown", "screenshot", "links"]
# ✅ Request only what you need
formats=["markdown"]
- Each format adds processing time
- Screenshots require rendering and are slower
- HTML and markdown are faster to generate
- Request only needed formats for optimal performance
Link filtering
Process links after extraction:
# Filter internal links only
internal_links = [
link for link in result["data"]["links"]
if link.startswith("https://www.example.com")
]
# Filter by file type
pdf_links = [
link for link in result["data"]["links"]
if link.endswith(".pdf")
]