Koretex
VERIFIED USER-GENERATED DATA · IN PRIVATE BETA

The human layer behind every AI dataset.

A globally distributed network of verified contributors capturing whatever your model needs — language, voice, video, edge cases, anything else. Tell us what you need, our team runs the campaign through our app, our network fulfils it. Every record attested at source, every contributor KYC’d, delivered in days.

Every capture
cryptographically attested
Every contributor
KYC-verified human
Every record
provenance receipt
Every campaign
built to your rubric
// why this matters

Better models start with data you can stand behind.

A Koretex contributor captures a workshop task with a phone mounted on a chest harness — every record provable end-to-end.

Every team building AI today has figured out the same thing: the ceiling on model performance isn’t compute — it’s the data behind it. The cleanest, most rigorous, most provable data wins. That’s true whether you’re training a vision model on long-tail objects, a voice model on accents and dialects, a fraud system on adversarial documents, or a humanoid on first-person task demonstrations.

The dominant crowdsourced model wasn’t built for that ceiling. Pay-per-upload, checkbox attribution, validate-after-the-fact — those choices optimise for upload velocity, not data integrity. They produce volume, not provability. At equilibrium they produce something worse: marketplaces that have to pull out of entire countries after 99% of submissions turn out to be scams.

Koretex is built around a different default. Every record provable, every contributor verified, every capture attested — not as a premium tier, but as the baseline.

// what we deliver

Built for the data your model actually needs.

A range of dataset SKUs we’ve scoped, plus custom bounties for anything else. Every record captured through our app by a KYC-verified contributor — attested at source, delivered to your spec.

LANGUAGE & VOICE

Localised speech datasets

Multilingual recordings across accents, dialects, code-switching contexts, and low-resource languages. Transcribed, segmented, and labelled to your spec. Critical for any voice or LLM team shipping outside English.

EGOCENTRIC VIDEO

First-person task capture

POV recordings of hands at work, environments, interactions, demonstrations. Captured in real kitchens, garages, warehouses, fields. For vision, robotics, and embodied-AI teams that need scenes a studio rig can’t reach.

KYC & RED-TEAM

Adversarial identity datasets

Real verified humans across demographics producing edge-case identity documents, liveness samples, and adversarial submissions your fraud system needs to be hardened against. Compliance-aware workflows on request.

OBJECT & SCENE

Long-tail visual data

Diverse object and scene photography across 80+ countries (tools, foods, packaging, vehicles, signage, environments). The cultural and geographic breadth public datasets simply don’t cover.

DEMOGRAPHIC PANELS

Targeted contributor cohorts

Custom panels by geography, age, language, occupation, lived experience. For preference data, RLHF, eval scoring, or any workload where who produced the data matters as much as the data itself.

↳ CUSTOM CAMPAIGNS

Anything else, scoped with you

Bring us a rubric and a volume target. Our team scopes the campaign with you, routes it through our network, and delivers to spec. Anything a phone can capture, including new workflows we'll design with you.

// how it works

You bring the spec. We run the campaign.

Campaigns are designed, scoped, and operated by our team. You don’t need to learn a portal or wrangle contributors yourself — we handle the routing, quality control, and delivery end-to-end. Every campaign is bespoke; every record is attested when it lands in your inbox.

STEP 01 · BRIEF

Bring us the spec

Rubric, volume target, geographic mix, demographic filters, device or sensor requirements. As precise or as exploratory as you have it — we sharpen it with you on a call. Most campaigns get scoped in a single 30-minute conversation.

STEP 02 · RUN

We launch and operate

Our team routes the campaign through our network to qualifying contributors across 80+ countries. They capture in our app on their own phones — full sensor suite available where the spec requires it. An on-device LLM handles the first pass of labelling; humans review, correct, and escalate.

STEP 03 · DELIVER

You get attested data

Per-record provenance, capture-time signatures, contributor identity hashes, gold-set agreement scores. Delivered in your format. Pay only for records that pass your tolerance. Most campaigns close in days, not months.

// the network behind every campaign

A supply side no studio or in-house ops team can match.

GLOBAL
80+ countries

Contributors across every continent. All walks of life, demographics, and lived experiences — the geographic diversity production AI actually needs.

VERIFIED
KYC-bound identity

One verified human, one account. Government ID + biometric match at onboarding, bound to the payout wallet. No sybils, no scrapers, no laundered uploads.

MOBILE-FIRST
Full sensor suite

Camera, microphone, IMU, GPS, depth where available. Modern phones are calibrated capture rigs — we use everything they ship with.

APP-CAPTURED
No external uploads

Contributors capture through our app, not by uploading existing files. Pre-existing or scraped content fails at the door — provenance is built in from the first frame.

// how verification works

Five things that make Koretex data provable.

Every architectural choice is downstream of one assumption: the team buying the data will eventually be asked where it came from. We’ve built the answer in from day one.

01 · CAPTURE

Cryptographic attestation

Every recording is signed at the moment of capture by hardware-attested keys. C2PA-compliant. Pre-existing content can’t be uploaded because pre-existing content has no valid signature.

02 · VALIDATION

On-device live review

A small model watches the capture in real time. Bad framing, missing hands, incomplete tasks — caught while the contributor can still re-shoot. Validation happens before upload, not after.

03 · IDENTITY

KYC-verified contributors

One verified human, one account. Government ID + biometric match at onboarding, linked to the USDC payout wallet. Sybil attacks become architecturally hard.

04 · TASK

Active-prompt sessions

Task scripts are generated server-side at session start, with random prompts injected mid-capture. You can’t pre-record. You can’t fake it. You have to actually be there, doing the thing.

05 · AUDIT

Per-record provenance

Every dataset ships with per-record receipts: capture timestamp, device class, operator hash, validation log, consent record, payment record. When legal asks, you have an answer.

↳ THE COMPOUND

All five, by default

Most data marketplaces ship one or two of these as an upgrade tier. Koretex ships all five on every record — because anything less is the version that doesn’t hold up when it matters.

Built for:Humanoid & embodied AIOn-device & edge modelsDefence & dual-useRegulated industriesFoundation-model RLHFEval & safety scoring
// what cheap data actually costs

Most crowdsourced training data is a checkbox. Koretex is verified end-to-end.

The first wave of crowdsourced training-data marketplaces optimised for upload velocity. Pay-per-upload, checkbox attribution, validate-after-the-fact. It’s how you get to a billion uploads fast. It’s also how you get to data your legal team can’t defend, marketplaces pulling out of countries after 99% of submissions turn out to be scams, and training corpora that look fine in a demo and toxic in a deposition. Koretex made the opposite architectural choices.

 Cheap crowdsourced dataKoretex
Identity verificationEmail + checkboxGovernment ID + biometric
Capture provenance“I own this” checkboxHardware-attested signing
Quality validation50-of-50 voting (post-hoc)On-device model (live)
Fraud surfaceArchitecturally massiveArchitecturally minimal
Audit trailNonePer-record receipts
When legal asks…“We hope it’s clean”“Here’s the audit log”
// founder note

I’m Sean. I’m building Koretex because the teams training AI today deserve training data they can actually trust — whatever shape it takes. Whether you need 50,000 voice recordings in Tagalog, 10,000 attested first-person task videos, or an adversarial dataset for your fraud model, you should be able to brief one conversation and trust every record that comes back. That’s what we’re building. If a dataset is on your critical path, I’d love a conversation.

Prefer to write it out? Drop a one-liner over email — what you’re training, rough volume, why the data audit matters — and you’ll get a reply within a day.