Website Scraping & Content Extraction

With ScreenshotAPI, you can pull structured data directly from the webpages without much effort, and the result will include valuable data in the form of text, complete HTML markup, Markdown, and image links. Thus, you can automate a lot of tasks that require handling structured data.

It also supports extracting page content in a consistent format regardless of layout complexity or dynamic rendering. You can use it to convert web pages into machine-readable formats, build content pipelines, or collect assets like images and embedded resources for further processing.

Note: Web Scraping is only available when the output type is set to JSON and is not supported for Scrolling Screenshots or other visual-only render modes.


const axios = require("axios");

let config = {
	method: "get",
	maxBodyLength: Infinity,
	url: "https://shot.screenshotapi.net/v3/screenshot?token=TOKEN&url=https%3A%2F%2Fwww.apple.com%2F&extract_html=true&extract_text=true&extract_markdown=true&get_image_urls=true&[OPTIONS]",
	headers: { }
};

axios.request(config).then((response) => {
	console.log(JSON.stringify(response.data));
}.catch((error) => {
	console.log(error);
});


<?php
	$client = new http.Client;
	$request = new http.Client.Request;
	$request->setRequestUrl("https://shot.screenshotapi.net/v3/screenshot?token=TOKEN&url=https%3A%2F%2Fwww.apple.com%2F&extract_html=true&extract_text=true&extract_markdown=true&get_image_urls=true&[OPTIONS]");
	$request->setRequestMethod("GET");
	$request->setOptions(array());

	$client->enqueue($request)->send();
	$response = $client->getResponse();
	echo $response->getBody();
?>


package main

import (
	"fmt"
	"net/http"
	"io"
)

func main() {
	url := "https://shot.screenshotapi.net/v3/screenshot?token=TOKEN&url=https%3A%2F%2Fwww.apple.com%2F&extract_html=true&extract_text=true&extract_markdown=true&get_image_urls=true&[OPTIONS]"
	method := "GET"

	client := &http.Client { }
	req, err := http.NewRequest(method, url, nil)

	if err != nil {
		fmt.Println(err)
		return
	}

	res, err := client.Do(req)
	if err != nil {
		fmt.Println(err)
		return
	}

	defer res.Body.Close()
	body, err := io.ReadAll(res.Body)
	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println(string(body))
}


OkHttpClient client = new OkHttpClient().newBuilder().build();
MediaType mediaType = MediaType.parse("text/plain");
RequestBody body = RequestBody.create(mediaType, "");
Request request = new Request.Builder().url("https://shot.screenshotapi.net/v3/screenshot?token=TOKEN&url=https%3A%2F%2Fwww.apple.com%2F&extract_html=true&extract_text=true&extract_markdown=true&get_image_urls=true&[OPTIONS]"
	.method("GET", body)
	.build();
Response response = client.newCall(request)
	.execute();


import requests

url = "https://shot.screenshotapi.net/v3/screenshot?token=TOKEN&url=https%3A%2F%2Fwww.apple.com%2F&extract_html=true&extract_text=true&extract_markdown=true&get_image_urls=true&[OPTIONS]"
payload = {}
headers = {}

response = requests.request("GET", url, headers, data=payload)
print(response.text)


require "uri"
require "net/http"

url = URI("https://shot.screenshotapi.net/v3/screenshot?token=TOKEN&url=https%3A%2F%2Fwww.apple.com%2F&extract_html=true&extract_text=true&extract_markdown=true&get_image_urls=true&[OPTIONS]")

https = Net::HTTP.new(url.host, url.port)
https.use_ssl = true

request = Net::HTTP::Get.new(url)

response = https.request(request)
puts response.read_body

Extract HTML

Parameter Name : extract_html

parameter allows you to retrieve the raw HTML source of the rendered webpage after it has been processed by the browser. The output is returned as a .html file containing the final DOM structure of the page at the time of capture.

This includes the fully rendered HTML after JavaScript execution, meaning dynamic content, API-loaded data, and DOM modifications are reflected in the output. However, it does not include external assets such as CSS files, images, or other linked resources.

Options

true: Extracts and returns the rendered HTML content as a .html file.
false: Skips HTML extraction and only returns the visual output (screenshot/PDF/etc.).

When to use

Use this parameter when you need access to the underlying HTML structure of a webpage for analysis, debugging, SEO auditing, or content extraction workflows.
It is especially useful for developers and automation systems that require both visual and structural representations of a page.

Default value:false.

Extract Text

Parameter Name : extract_text

This parameter enables extraction of the plain text content from the rendered webpage after processing. It strips away all HTML tags, CSS styles, scripts, and other non-text resources, returning only the readable textual content of the page.

When enabled, the system generates a .txt file containing the extracted content and returns a URL to access it. This makes it suitable for downstream processing such as text analysis, indexing, summarization, or feeding into search and AI systems.

Options

true: Extracts all visible text from the rendered page and provides a downloadable .txt file containing the cleaned content.
false: Skips text extraction and only returns the visual render output.

When to use

Use this parameter when you need structured or unstructured text data from a webpage for analysis, SEO processing, content indexing, or AI-based workflows.
It is especially useful for converting web pages into machine-readable text formats for further processing.

Default value:false.

Extract Markdown

Parameter Name : extract_markdown

It allows you to extract the rendered webpage content as a structured .md (Markdown) file after the page has been fully processed. It converts the visible content of the page into a clean, structured format while preserving key semantic elements such as headings, lists, links, and basic hierarchy.

Unlike raw HTML extraction, Markdown output removes unnecessary tags, scripts, and layout noise, resulting in a lightweight and readable representation of the page content that is easier to process and reuse.

When enabled, the system generates a separate .md file alongside the screenshot output and returns a downloadable URL, allowing you to access both the visual capture and structured text version of the same page.

Options

true: Extracts the webpage content as a structured Markdown file (.md) and returns a file URL.
false: Skips Markdown extraction and only returns the visual render output.

When to use

Use this parameter when you need clean, structured content from a webpage for documentation, data processing, or content transformation workflows.
It is especially useful for AI pipelines, RAG systems, knowledge base creation, and converting web content into reusable documentation formats.

Default value:false.

Get Image Urls

Parameter Name : get_image_urls

This parameter enables extraction of all image URLs present on the rendered webpage. When activated, the system scans the final rendered DOM and collects references to all detected images, returning them in a structured format.

This is useful when you need to analyze media assets used on a page, collect image resources for indexing, or process visual content separately from the main HTML structure.

Options

true: Extracts and returns a structured list of all image URLs found on the webpage.
false: Skips image extraction and only returns the standard render output.

When to use

Use this parameter when you need to collect or analyze images from a webpage, such as for content auditing, media tracking, or building image datasets.
It is especially useful for AI pipelines, RAG systems, knowledge base creation, and converting web content into reusable documentation formats.

Default value:false.