Core Features

Website Crawling AI

Learn how to index your public marketing pages and support documentation automatically. Sentrup crawls sitemaps, scrapes content dynamically, checks hashes, and generates high-density vector embeddings.


Sitemap & URL Discovery

Sentrup initiates crawling by fetching the host's sitemap.xml file. If a sitemap is found, URLs are parsed using high-performance regex scanners. If no sitemap is present, the discovery module falls back to scanning and extraction of links on the landing homepage.

Waterfall Scraper Pipeline

To optimize budget and parsing accuracy, Sentrup implements a three-tier **waterfall scraper architecture**:

Scraper TierServiceTarget Use CaseLatencyCost
Tier 1 (Cheap Static)Trafilatura / Local HTML FetchStatic pages, plain markdown, articles<500msFree
Tier 2 (Medium Dynamic)Jina Reader APIDocumentation sites, guides, API hubs~1.2sLow
Tier 3 (Heavy SPA/CF)Firecrawl Scrape APIReact/Next SPAs, Cloudflare-protected domains~3.5sPay-per-scrape

Incremental Hashing & Vectorization

Sentrup prevents duplicate work by checking **SHA-256 hashes** on every crawl cycle. When text is extracted from a page, a hash of the cleaned text is calculated and compared against Qdrant coordinates. If the hash matches, the vector space is left untouched. If the text has changed, the old coordinates are deleted, the new content is split into **300-word semantic chunks**, and new embeddings are constructed using `gemini-embedding-2`.

Crawl Request API Payload

The dashboard initiates crawls by sending a POST request to the administrative documents crawler endpoint:

POST /api/v1/admin/documents/crawl
Content-Type: application/json
Authorization: Bearer <your-admin-token>

{
  "url": "https://www.sentrup.com",
  "max_pages": 15,
  "source_priority": "high",
  "allowed_paths": ["/docs", "/faq", "/pricing"]
}