Written By Hanzala Saleem
Updated At May 21, 2026 | 10 min read
A webpage to Markdown converter fetches a live URL, renders the page in a real browser, strips out ads, navigation, scripts, and other non-content noise, then returns clean Markdown (.md). The fastest path for one-off use is a browser extension. The fastest path for production is an API like ScreenshotAPI's extract_markdown parameter. Both paths are covered below with working code.
This guide compares every practical approach to converting a webpage to Markdown: browser extensions for one-off grabs, command-line tools for local scripting, Python and Node.js libraries for static HTML, headless-browser pipelines for JavaScript-rendered SPAs, and managed APIs for automated workflows. Working code examples in Python and Node.js are included for the API path, along with a feature comparison table you can use to pick the right tool.
A webpage to Markdown converter is a tool or API that fetches a live URL, renders the page (executing JavaScript when needed), extracts only the meaningful content (headings, paragraphs, code blocks, tables, links), strips out navigation, ads, scripts, cookie banners, and trackers, and returns the result as clean Markdown syntax in a (.md) file.
The distinction matters: converting a raw HTML file to Markdown is straightforward. Converting a live webpage is harder because modern sites are JavaScript-rendered SPAs (React, Next.js, Vue), where the real content doesn't exist in the initial HTML response; it loads dynamically. A proper webpage to Markdown tool must render the page in a headless browser before extraction.
Markdown is a lightweight markup language created by John Gruber and Aaron Swartz in 2004. The premise: write plain text with a handful of readable formatting conventions, and it renders as structured HTML. The original 2004 release defined the syntax used today; later forks like CommonMark, GitHub Flavored Markdown, and MultiMarkdown added extensions but kept the core.
Everything in Markdown is designed to be readable even before it's rendered, which is exactly why it's become the standard format for:
Raw HTML is not the same as readable content. A typical modern webpage carries tens of kilobytes of <div> soup, inline event handlers, cookie banners, and CSS class names that carry zero informational value.
Converting a webpage or URL to Markdown solves this cleanly. You get headings, paragraphs, code blocks, tables, and links without the surrounding noise.
At a technical level, converting a live URL to Markdown involves four distinct steps:
A plain HTTP GET returns only the initial HTML shell. On a server-rendered or static site, that may be enough, but for the JavaScript-heavy frameworks that dominate modern web development (React, Vue, Next.js, SvelteKit, Nuxt), the meaningful content does not exist in the initial response. A headless browser (Chromium driven by Playwright or Puppeteer, Firefox via geckodriver) must launch, navigate to the URL, wait for the page to finish executing, and then capture the fully rendered DOM. This is the most expensive step in the pipeline, and the one most converters get wrong.
Raw DOM still contains navigation menus, footers, sidebars, related-post widgets, comment sections, and ads. The conversion logic identifies the primary content block, typically using readability-style heuristics (text density, link-to-text ratio, heading proximity) or a tuned content-extraction library like Mozilla's Readability.js. Everything else is discarded.
Strip remaining inline styles, tracking attributes (data-analytics-*, aria-* where not semantically required), empty elements, and any script tags that survived extraction. This step is where the converter decides what counts as semantic content versus what counts as presentation noise. Bad decisions here either lose meaning (dropping useful <aside> content) or keep clutter (preserving "share on Twitter" buttons).
Walk the cleaned DOM tree and map each HTML element to its Markdown equivalent. The reference table below shows the common mappings.
Here's a quick reference for how common HTML maps to Markdown:
| HTML Element | Markdown Equivalent |
|---|---|
| <h1>, <h2>, <h3> | #, ##, ### |
| <strong>, <b> | **bold** |
| <em>, <i> | italic* |
| <a href="..."> | [text](url) |
| <code> | `code` |
| <pre><code> | ```language |
| <ul>, <ol> | - item, 1. item |
| <blockquote> | > quote |
| <img src="..."> |  |
ScreenshotAPI is primarily known as a screenshot and webpage capture service, but it ships with a content extraction feature that returns structured Markdown from any URL via the extract_markdown parameter. The feature uses a real Chromium browser to render the page, then walks the rendered DOM and returns a downloadable .md file alongside the screenshot.
This matters when you need both a visual record and the underlying content in the same request (compliance archiving, content audits, RAG pipelines that store both an image preview and the searchable text), or when you want to add Markdown extraction to an existing screenshot workflow with one extra parameter rather than introducing a second tool.
Base endpoint:
https://shot.screenshotapi.net/v3/screenshotKey parameters for content extraction:
| Parameter | Value | Description |
|---|---|---|
| token | Your API key | Required for authentication |
| url | Target URL (encoded) | The webpage to process |
| output | json | Returns JSON with text content and screenshot URL |
| extract_markdown | true | Includes full readable text content in the response as md file |
| block_ads | true | Strips ads before extraction |
| fresh | true | Bypasses cache, fetches live content |
import requests
from urllib.parse import quote
API_TOKEN = "your_api_token_here"
TARGET_URL = "https://example.com/article"
params = {
"token": API_TOKEN,
"url": quote(TARGET_URL, safe=""),
"output": "json",
"extract_markdown": "true",
"block_ads": "true",
"fresh": "true"
}
response = requests.get(
"https://shot.screenshotapi.net/v3/screenshot",
params=params
)
response.raise_for_status()
data = response.json()
markdown_file = data.get("markdown_file")
if not markdown_file:
raise Exception("markdown_file not found in response")
markdown_response = requests.get(markdown_file)
markdown_response.raise_for_status()
markdown_content = markdown_response.text
with open("output.md", "w", encoding="utf-8") as f:
f.write(markdown_content)
print(f"Extracted {len(markdown_content)} characters of Markdown")const axios = require('axios');
const fs = require('fs');
const API_TOKEN = 'your_api_token_here';
const TARGET_URL = 'https://example.com/article';
async function extractMarkdown(url) {
const params = new URLSearchParams({
token: API_TOKEN,
url: encodeURIComponent(url),
output: 'json',
extract_markdown: 'true',
block_ads: 'true',
fresh: 'true'
});
const response = await axios.get(
`https://shot.screenshotapi.net/v3/screenshot?${params}`
);
const markdownFile = response.data.markdown_file;
if (!markdownFile) {
throw new Error('markdown_file not found in response');
}
// Fetch markdown content
const markdownResponse = await axios.get(markdownFile);
const markdown = markdownResponse.data || '';
fs.writeFileSync('output.md', markdown, 'utf8');
console.log(`Extracted ${markdown.length} characters of Markdown`);
return markdown;
}
extractMarkdown(TARGET_URL).catch(console.error);import requests
import time
from urllib.parse import quote
API_TOKEN = "your_api_token_here"
URLS = [
"https://example.com/page-1",
"https://example.com/page-2",
"https://example.com/page-3",
]
API_ENDPOINT = "https://shot.screenshotapi.net/v3/screenshot"
def extract_markdown(url):
params = {
"token": API_TOKEN,
"url": quote(url, safe=""),
"output": "json",
"extract_markdown": "true",
"block_ads": "true",
"fresh": "true"
}
# Request ScreenshotAPI
response = requests.get(API_ENDPOINT, params=params)
response.raise_for_status()
data = response.json()
# Get markdown file URL
markdown_file = data.get("markdown_file")
if not markdown_file:
raise Exception(f"markdown_file not found for {url}")
# Download markdown content
markdown_response = requests.get(markdown_file)
markdown_response.raise_for_status()
return markdown_response.text
for url in URLS:
try:
slug = url.rstrip("/").split("/")[-1] or "index"
# Extract markdown content
markdown_content = extract_markdown(url)
# Save markdown file
with open(f"{slug}.md", "w", encoding="utf-8") as f:
f.write(markdown_content)
print(f"Saved {slug}.md ({len(markdown_content)} characters)")
# Respect rate limits
time.sleep(1)
except Exception as e:
print(f"Failed to process {url}: {e}")Problem: You have a documentation site or a set of reference URLs you want to make searchable by an LLM.
Workflow:
Why Markdown? HTML to Markdown conversion reduces token count significantly while preserving the semantic hierarchy (headings, lists, code blocks) that improves retrieval accuracy in RAG systems.
Problem: You want to detect when competitors update their pricing page, feature list, or documentation.
Workflow:
Result: A clean, readable delta, not a messy HTML diff full of class name changes.
Problem: Migrating a WordPress blog with 300+ posts to a Markdown-based static site generator (Hugo, Astro).
Workflow:
Extract and store third-party API documentation, changelog pages, or reference material as versioned Markdown files for offline use or internal knowledge bases, especially useful when you can't guarantee a vendor's docs will remain available.
There are several practical ways to convert a URL to Markdown. The right choice depends on whether you need automation, whether the target sites use JavaScript rendering, and how much infrastructure you want to operate.
| Approach | Good For | JavaScript Rendering | Automation | Cost |
|---|---|---|---|---|
| Browser extension (MarkDownload, etc.) | Quick one-off grabs | Supported | Not Supported | Free |
| Online converter tool | Manual, occasional use | Limited / Varies | Not Supported | Free / Freemium |
| Pandoc (CLI) | Local HTML files, scripted conversion | Not Supported | Supported | Free |
| Turndown.js | In-browser or Node.js projects | Not Supported (requires pre-fetched HTML) | Supported | Free |
| html2text (Python) | Server-side pipeline, simple pages | Not Supported | Supported | Free |
| markdownify (Python) | Python pipelines, static pages | Not Supported | Supported | Free |
| Playwright + Turndown | Full control, custom pipelines | Supported | Supported | Free (infrastructure cost) |
| ScreenshotAPI | Automated pipelines, screenshot + extraction combined | Supported | Supported | Paid (free tier available) |
How to choose:
Converting webpages to Markdown is no longer a niche developer trick. With the growth of RAG systems, local knowledge bases, and LLM-powered applications, it's become a standard data preparation step across the industry.
The HTML-to-Markdown path strips away everything that doesn't matter, the layout, the tracking, the navigation chrome, and gives you content that is portable, version-controllable, and readable by both humans and machines.
For production pipelines at scale, the combination of headless browser rendering (to handle modern SPAs) and clean Markdown output is the pattern that holds up. Whether you self-host that with Playwright or use a managed service like ScreenshotAPI depends on how much infrastructure you want to own.
The working code examples in this guide are a starting point. The hardest part is usually deciding what to build once you have clean, structured content flowing reliably. That's a good problem to have.
A webpage to Markdown converter is a tool or API that fetches a URL, renders the page in a browser (to handle JavaScript), extracts the main readable content, and returns it formatted as Markdown, stripping out HTML tags, scripts, ads, navigation, and other non-content elements.
The most common reasons are: building RAG pipelines (clean Markdown reduces token usage and improves retrieval quality), migrating content to Markdown-based static site generators, automating documentation extraction, monitoring competitor content changes, and feeding structured content into LLMs.
Yes. ScreenshotAPI's extract_markdown=true parameter generates a .md file alongside the screenshot output and returns a download URL in the JSON response. The feature renders the page in a real Chromium browser before extraction, so JavaScript-rendered SPAs are handled correctly. See the scraping documentation for the complete parameter reference.
Only if the extraction tool uses a real headless browser. Simple HTTP fetchers return the initial HTML shell without executing JavaScript, so SPA-built pages will be empty or incomplete. Services like ScreenshotAPI use full Chromium rendering, so dynamically loaded content is captured correctly.
HTML to Markdown converts a raw HTML string or a local file. Webpage to Markdown fetches a live URL, renders it in a browser (executing JavaScript), extracts the main content block, and then converts it. The live URL path is more complex and requires network access, JS execution, and content isolation before the actual conversion step.
Yes, Markdown is significantly more token-efficient than raw HTML and preserves semantic structure (heading hierarchy, lists, code blocks) that helps language models reason over content more accurately. This is why it's become the standard pre-processing format for RAG systems.
The most commonly used Python libraries are markdownify (pip install markdownify) for simple HTML strings, and html2text for a more configurable conversion. For live URLs with JavaScript rendering, you'd combine Playwright for fetching with one of these libraries for conversion.
Crawl the sitemap (usually at /sitemap.xml) to collect all URLs, then run a batch extraction script like the bulk Python example above against each URL. Process pages in sequence with rate limiting tuned to your API plan, save each as an individual .md file using the URL slug as the filename, and optionally add YAML frontmatter with metadata (title, url, extraction_date, tags). For sites without a sitemap, crawl from the homepage and follow internal links breadth-first.