Multi-Seed Domain Discovery
Problem
When an entity owns multiple domains, a single seed only discovers domains that share certificates with that one seed. Real-world testing proved this:
--seed generali.it |
--seed generali.com |
|
|---|---|---|
| Total domains | 59 | 57 |
| In common | 43 | 43 |
| Unique to this seed | 16 | 14 |
Neither seed alone sees the full picture. The .it seed finds Czech/Italian/Slovak domains, while .com finds investment-related domains.
Solution: Tagged Sources + Cross-Seed Boost
How it works
-
Seed-tagged sources: Instead of anonymous
ct_san_expansion, sources are tagged with which seed produced them:ct_san_expansion:walmart.com. This preserves provenance. -
Parallel per-seed expansion: Each seed runs its own Strategy B (SAN expansion) in parallel, all rate-limited by the existing crt.sh semaphore (max 5 concurrent queries).
-
Cross-seed detection: After all strategies complete, domains with seed-tagged sources from 2+ independent seeds receive a
cross_seed_verifiedsource with 0.90 base confidence. -
Seed-to-seed validation: During validation, each seed's CT records are checked for the other seeds as SANs. If seeds share a certificate, that proves common ownership.
Usage
# Single seed (backward compatible)
domain-scout --name "Walmart" --seed walmart.com
# Multiple seeds
domain-scout --name "Walmart" --seed walmart.com --seed samsclub.com
# Three seeds with deep mode
domain-scout --name "Generali" --seed generali.it --seed generali.com --seed generali.de --deep
The --seed flag is repeatable. Using 3+ seeds auto-bumps timeout to 150s.
Confidence scoring
| Source | Base Score | Notes |
|---|---|---|
cross_seed_verified |
0.90 | Found from 2+ independent seeds |
ct_org_match |
0.85 | Cert O= matches company name |
ct_san_expansion:{seed} |
0.80 | SAN on same cert as a seed |
ct_seed_subdomain:{seed} |
0.75 | Subdomain of a seed |
rdap_match |
0.70 | RDAP registrant matches |
ct_seed_related:{seed} |
0.40 | Found in CT search for a seed |
dns_guess |
0.30 | Only guessed + resolved |
Boosts: multi-source (+0.05/+0.10), resolves (+0.05), org similarity (+0.05), shared infra (+0.05).
Scenario analysis
Walmart + Sam's Club (cross-verification works)
Seeds walmart.com and samsclub.com both independently discover walmartlabs.com through separate CT searches. Two independent discovery paths converging on the same domain is hard to produce by coincidence.
walmartlabs.com:ct_san_expansion:walmart.com+ct_san_expansion:samsclub.com+cross_seed_verified= 1.00
Generali (overlapping ccTLD seeds)
Seeds generali.it and generali.com share 43 domains in common. The overlapping domains get cross-verification boost, while seed-unique domains retain their single-seed scores.
generali.de: found from both seeds + org match = 1.00 (cross-verified)ceskapojistovna.cz: found only from.itseed = 0.80 (SAN expansion only)
M&A / sold subsidiary (no false cross-verification)
If Walmart sells ASDA, asda.com might only appear from walmart.com's historical certs, not from samsclub.com. Since it's only found from one seed, no cross-verification boost is applied.
asda.com:ct_san_expansion:walmart.comonly = 0.85 (no cross-seed boost)
CDN false positives
CDN domains on multi-tenant certs are filtered by the CDN detection filter (10+ unrelated base domains + low org match). If a CDN domain appears as ct_seed_related from multiple seeds, cross-verification is not applied because there are no strong sources (ct_san_expansion or ct_seed_subdomain). The score stays at 0.40 base + minor boosts = 0.50, well below the 0.60 inclusion threshold.
Code review findings
Code-simplifier improvements applied
-
Extracted
_extract_contributing_seeds()helper — eliminates duplicate seed extraction logic in_apply_cross_seed_boost()and_build_output(). -
Extracted
_collect_cert_names()helper — deduplicates SAN + CN collection pattern used in_strategy_org_search()and_strategy_seed_expansion(). -
Optimized
_validate_seed()co-hosted detection — replaced O(n*m) nested loop with abase_to_seedreverse lookup dict and set intersection. -
Cleaned up backward compat tests — replaced
try/except/passwith_STUB_RESULTshared fixture, removing fragile exception suppression. -
Removed stale
type: ignorecomments — the code-simplifier fixed type annotations properly instead of suppressing warnings, reducing total mypy errors from 16 to 13.
Review findings (no action needed)
-
Shared
errorslist across parallel tasks (low risk) — Multiple async tasks append to the sameerrorslist. This is safe because Python's GIL prevents concurrent list mutations, and asyncio tasks only yield atawaitpoints. No action needed. -
_apply_cross_seed_boostunusedseedsparameter (nit) — Theseedsparameter is passed but not used in the method body. Kept for API consistency — future enhancements may need it for filtering (e.g., only count verified seeds). -
Source count inflation with tagged sources (acceptable) — A domain with
ct_san_expansion:a.com+ct_san_expansion:b.com+cross_seed_verifiedhas 3 sources, triggering the +0.10 multi-source boost. This is intentional — being found from multiple seeds is stronger evidence.
Data model changes
EntityInput.seed_domain:str | None->list[str]. Default:[]. Thediscover()method acceptsstr | None | list[str]for backward compatibility and coerces to list internally.ScoutResult.seed_domain_assessment:str | None->dict[str, str]. Maps each seed domain to its assessment (confirmed,suspicious,invalid,timeout,error).ScoutResult.seed_cross_verification:dict[str, list[str]](new). Maps each seed to the list of other seeds that share certificates with it.DiscoveredDomain.seed_sources:list[str](new). Which seed domains contributed to discovering this domain.DiscoveredDomain.evidence:list[str]->list[EvidenceRecord](v0.2.0). Each evidence entry is now a structured record withsource_type,cert_id,cert_org,similarity_score, andseed_domain. See API Reference.ScoutResult.run_metadata: replacessearch_metadata: dict(v0.2.0). TypedRunMetadatawithschema_version,tool_version,timestamp, and config snapshot. See API Reference.
Test coverage
103 unit tests covering:
- Cross-seed detection (6 tests): single seed, two seeds, mixed types, same-seed-different-types, three seeds, non-seed sources
- Scoring (6 tests): cross_seed_verified base, tagged source parity with old untagged, no-seeds compat, combined scoring
- Build output (3 tests): seed_sources population, multi-seed is_seed, empty seeds
- Backward compat (3 tests): string seed, None seed, list seed
- Model changes (5 tests): defaults, constructors, serialization
- Simulated scenarios (5 tests): Walmart cross-verification, Generali overlap, M&A no-false-cross, CDN false positive, unrelated domains
- Post-M&A edge cases (3 tests): pre-integration brand (no cross-verify), divested subsidiary with mismatched org, cross-verify across different source types
- Post-spin-off scenarios (4 tests): shared legacy domain (HP/HPE), child-only domain (PayPal/eBay), non-resolving transition domain excluded from output, single-seed-only domain
- Look-alike entities (3 tests): independent domains no cross-verify (Delta Air/Faucet), weak-only shared infrastructure correctly rejected, completely isolated seeds (Apple Inc/Hospitality)
- Cross-verification edge cases (8 tests): empty evidence, 5-seed domain, duplicate seed no cross-verify, boost idempotency, score capping at 1.0, seed domain own-tag-only, seed cross-verified from other seed,
_extract_contributing_seedsdirect test - Build output edge cases (3 tests): non-resolving excluded despite high confidence, below-threshold excluded, descending sort order
Known limitations documented by tests
- Shared infrastructure with strong sources: If two unrelated companies share a cert with
ct_san_expansion(not justct_seed_related), the domain still gets cross-verified. The CDN filter catches large multi-tenant certs (10+ base domains), but smaller shared certs could still produce false positives. - Boost idempotency gap: Calling
_apply_cross_seed_boosttwice is idempotent for sources (set) but appends duplicate evidence entries (list). Documented intest_boost_idempotency.
Fixed in this PR
- Weak-evidence escalation (fixed): Previously, two
ct_seed_relatedtags from different seeds triggeredcross_seed_verified(0.90 base), jumping to 1.0. Now,_apply_cross_seed_boostrequires at least one strong source (ct_san_expansionorct_seed_subdomain) to apply the boost. Weak-only cross-seed signals stay at their base score (0.50 with boosts, below inclusion threshold).