Core Features
Website Crawling AI
Learn how to index your public marketing pages and support documentation automatically. Sentrup crawls sitemaps, scrapes content dynamically, checks hashes, and generates high-density vector embeddings.
Sitemap & URL Discovery
Sentrup initiates crawling by fetching the host's sitemap.xml file. If a sitemap is found, URLs are parsed using high-performance regex scanners. If no sitemap is present, the discovery module falls back to scanning and extraction of links on the landing homepage.
Waterfall Scraper Pipeline
To optimize budget and parsing accuracy, Sentrup implements a three-tier **waterfall scraper architecture**:
| Scraper Tier | Service | Target Use Case | Latency | Cost |
|---|---|---|---|---|
| Tier 1 (Cheap Static) | Trafilatura / Local HTML Fetch | Static pages, plain markdown, articles | <500ms | Free |
| Tier 2 (Medium Dynamic) | Jina Reader API | Documentation sites, guides, API hubs | ~1.2s | Low |
| Tier 3 (Heavy SPA/CF) | Firecrawl Scrape API | React/Next SPAs, Cloudflare-protected domains | ~3.5s | Pay-per-scrape |
Incremental Hashing & Vectorization
Sentrup prevents duplicate work by checking **SHA-256 hashes** on every crawl cycle. When text is extracted from a page, a hash of the cleaned text is calculated and compared against Qdrant coordinates. If the hash matches, the vector space is left untouched. If the text has changed, the old coordinates are deleted, the new content is split into **300-word semantic chunks**, and new embeddings are constructed using `gemini-embedding-2`.
Crawl Request API Payload
The dashboard initiates crawls by sending a POST request to the administrative documents crawler endpoint:
POST /api/v1/admin/documents/crawl
Content-Type: application/json
Authorization: Bearer <your-admin-token>
{
"url": "https://www.sentrup.com",
"max_pages": 15,
"source_priority": "high",
"allowed_paths": ["/docs", "/faq", "/pricing"]
}