
Member of Technical Staff - Data
Job Description
Posted on: March 3, 2026
We are one of Australia's two private foundation model labs, currently still in stealth. We build AI forecasting systems. Our reasoning models beat human superforecasters at prediction tasks. We’re backed by Blackbird Ventures and notable angels, including Balaji Srinivasan, Synthesia founders, and Supabase founders. The salary range for this job is $200,000 - $300,000 p/a.
Our founders include the founder of one of the largest DevOps infrastructure companies in the world, Forbes 30 Under 30 alumni, and the creator of core infrastructure for many quant funds.
The Role
You will own the evaluation and learning loop in our Python research stack and decision-surface workflow.
This is not a "build a model and hand it off" role. There is no data team cleaning features for you. There is no ML platform abstracting away infrastructure. You will be the person who diagnoses why Brier score spiked last week and whether it was calibration drift or resolution loss; who determines if a surface optimization actually improved out-of-sample performance or just overfit; and who builds the leakage prevention infrastructure because nobody else will. You will spend more time asking "is this result real" than "how do I build a fancier model."
You will work at the boundary of research and production, with direct handoff into our TypeScript strategy runtime.
What You’ll Do
- Build and own point-in-time, leakage-resistant evaluation pipelines for forecasting and trading decisions.
- Define and enforce rigorous validation: temporal splits, walk-forward tests, segment/regime slicing, and regression benchmarks.
- Improve calibration quality using robust methods (for example isotonic/platt variants) with reproducible model selection and monitoring.
- Connect forecast quality to execution quality: measure where edge is created, where it is lost, and why.
- Improve decision-surface optimization pipelines (entry/exit thresholds, exits, policy knobs) using robust objective design and stability checks.
- Maintain Python-to-TypeScript parity for exported artifacts so research outputs are production-faithful.
- Convert postmortems into systematic improvements across prompts, models, features, and policy logic.
- Prototype quickly, then harden into tested, documented components used by the broader team.
Requirements
- Proven track record building robust ML/forecasting evaluation systems end-to-end, not just metric dashboards.
- Deep understanding of temporal leakage modes and practical defenses in real data systems.
- Strong applied statistics: calibration, uncertainty, proper scoring rules, experiment design, and significance under noise.
- Strong Python engineering skills for production-grade research tooling and reproducible pipelines.
- Solid SQL/Supabase/Postgres fluency for large-scale, time-dependent analytical workflows.
- Experience with optimization workflows (for example Optuna) and model/policy tuning under realistic constraints.
- Ability to work with TypeScript engineers to keep runtime behavior aligned with research assumptions.
- Strong product judgment: prioritize changes that move both forecast reliability and trade performance.
Nice To Have
- Experience with prediction markets, market microstructure, DeFi, or other on-chain datasets.
- Experience tying calibration quality directly to capital allocation and risk controls.
- Familiarity with parity-contract testing across languages and artifact schemas.
- Experience in high-velocity research environments where correctness and speed both matter.
Why Us
- Real traction: our live system already outperforms human superforecasters.
- Frontier technical problem: AI reasoning + forecasting + market execution.
- Small, technical founding team with high ownership and fast iteration.
- Backed by top-tier investors and operators.
- Remote-friendly with Sydney and San Francisco presence.
How To Apply
Send your resume and a brief note covering:
- The most rigorous leakage-resistant evaluation system you have built.
- A concrete case where improved calibration or decision policy changed real business/trading outcomes.
- How you would structure a point-in-time evaluation + optimization loop that is production-faithful across Python research and TypeScript execution.
Optionally, a quick-fire question — given this Brier decomposition:
- Period A: BS=0.180 Reliability=0.040 Resolution=0.100 Uncertainty=0.240
- Period B: BS=0.220 Reliability=0.020 Resolution=0.060 Uncertainty=0.240
Which period had better forecasts, and why?
Apply now
Please let the company know that you found this position on our job board. This is a great way to support us, so we can keep posting cool jobs every day!
Remote-Work.app
Get Remote-Work.app on your phone!

Member of Technical Staff - Data

Mortgage Broker

Data Scientist (Python & SQL) - Freelance AI Trainer

Freelance Video Editor – Education Content

