Built to give you an honest score
Most free IQ tests are designed to flatter — short question sets, generous scoring, no confidence intervals. High scores get shared. AurorIQ is built differently. This page explains exactly how.
- Item Response Theory (IRT) scoring — same framework as clinical assessments
- Adaptive question selection — difficulty adjusts in real time to your responses
- Normed on a representative sample — mean 100, SD 15, clinical standard scale
- Confidence intervals reported with every score — not a single inflated number
- Explicit limitations — we state clearly what this test cannot measure
Methodology
AurorIQ uses Item Response Theory — the same mathematical framework as the WAIS-IV and Stanford-Binet 5 — to estimate cognitive ability from a short sequence of adaptively selected questions.
The first question is calibrated to IQ 100. Your response gives the algorithm an initial signal about your ability level (θ).
Using maximum likelihood estimation, the algorithm re-estimates θ and its uncertainty after each answer. It selects the next item with the highest Fisher information at that θ — the question that most reduces uncertainty.
After 25 questions the algorithm has accumulated enough information to produce a stable θ. This is converted to the IQ scale (IQ = 100 + 15θ) and reported with a 95% confidence interval.
Questions are drawn from 5 cognitive domains. Per-domain performance generates the breakdown shown in your results — though domain-level estimates carry wider confidence intervals than the full-scale IQ.
The key advantage of IRT over classical test theory (CTT): an IRT ability estimate is not tied to the specific questions asked. A different 25-item adaptive test from the same item bank would produce a comparable score. CTT scores depend on the items administered, which is why easy-question tests produce inflated scores.
AurorIQ's item bank was calibrated on a representative adult sample. Item parameters (difficulty b, discrimination a, guessing c) were estimated via marginal maximum likelihood. Items with poor fit or low discrimination were excluded from the operational bank.
Cognitive Domains
Five domains drawn from the Cattell–Horn–Carroll (CHC) model of cognitive abilities. Together they provide broad coverage of the abilities that load on general intelligence (g). Click any domain to expand.
Pattern recognition items assess fluid intelligence most directly. You are shown a visual or abstract matrix and must identify the rule governing the sequence to choose the missing element — closely resembling Raven's Progressive Matrices, the most widely used culture-reduced measure of fluid IQ.
These items have minimal verbal load and relatively high cross-cultural fairness compared with verbal or numeric items.
Numeric reasoning items assess quantitative reasoning (Gq) and fluid reasoning applied to numerical material. Items are designed to minimise the role of memorised procedures — a person with primary-school maths can solve most items by reasoning rather than formula.
Difficulty ranges from simple number sequences to multi-step word problems requiring quantitative modelling.
Verbal items assess both fluid reasoning applied to linguistic material and crystallised intelligence (Gc) — vocabulary and semantic knowledge. They carry the highest weight because verbal ability is the strongest single predictor of g in English-language populations.
Non-native English speakers may receive slightly lower verbal scores, partially mitigated by down-weighting relative to pattern and spatial domains.
Spatial items assess visual-spatial processing (Gv) — the ability to mentally manipulate 2D and 3D shapes, identify perspectives, and reason about geometric relationships. Entirely non-verbal; low cultural loading.
Spatial ability is a significant independent predictor of STEM performance and shows distinct heritability from verbal ability, suggesting a partially separable cognitive system.
Working memory items assess holding and manipulating information simultaneously — the central executive of Baddeley's model. WM capacity correlates with g at r≈0.6–0.7, among the highest single-construct correlates of general intelligence.
These items receive the lowest weight because WM is most sensitive to testing conditions — fatigue, distraction, and anxiety affect WM disproportionately compared with other domains.
Validity
How closely does AurorIQ measure what it claims to measure? We assess validity by comparing scores against established criterion measures.
These figures come from our internal validation study. Because participants are self-selected, estimates should be treated as approximate. The WAIS-IV criterion validity (r=0.82) reflects participants who took both assessments — not a fully representative adult sample.
For comparison: the WAIS-IV has published test-retest reliability of 0.94–0.96. AurorIQ's 0.89 is lower — reflecting the reduced precision of a 25-item unproctored online test versus a 90-minute clinically administered battery. We consider this an honest trade-off for accessibility.
Limitations
A platform willing to state what it cannot do is more trustworthy than one that claims perfection. This section is deliberate.
IQ tests — including AurorIQ — measure a specific and limited set of cognitive abilities under specific conditions. These are genuine limitations you should understand before interpreting your result.
-
Not a clinical assessment
AurorIQ scores cannot be used for Mensa applications, educational placement, disability assessments, employment screening, or any clinical or legal purpose. Only a proctored assessment by a licensed psychologist using a validated instrument (WAIS-IV, Stanford-Binet 5) qualifies for those purposes.
-
Condition sensitivity
Your score is sensitive to testing conditions. Fatigue, distraction, anxiety, time of day, and recent illness all affect performance. A single result on a single day is not definitive. If you took the test in poor conditions, retake it — your results are not stored on our servers.
-
Language and cultural assumptions
The verbal domain has material cultural loading. The test was developed in English and normed on English-speaking adults. Non-native English speakers may receive scores that underestimate their true fluid intelligence. We partially mitigate this by down-weighting verbal items relative to pattern and spatial items.
-
What IQ doesn't measure at all
Creativity, emotional intelligence, practical wisdom, character, motivation, domain expertise, and most of what determines whether a person lives a good life are not measured by IQ tests. A high score is an advantage in specific contexts — it is not a measure of your worth or your ceiling.
-
Extreme score reliability
Scores below 80 or above 130 have wider confidence intervals than scores near the mean. The item bank has fewer highly discriminating items at the extremes. Treat extreme scores as directional indicators, not precise measurements.
Scoring
How your raw responses are converted into an IQ score on the standard mean-100, SD-15 scale.
The IRT ability estimate (θ) is a z-score on the latent ability scale, converted to IQ via IQ = 100 + (15 × θ). The confidence interval around θ is derived from the test information function and transformed identically.
Norms are based on an adult sample aged 18–65 from English-speaking countries, with approximately equal representation across age deciles. Scores are interpreted against the full adult population, not age-specific subgroups — unlike the WAIS-IV, which uses age-stratified norms.
AurorIQ vs Other Tests
How does AurorIQ compare to typical free online tests and to a clinical WAIS-IV assessment?
| Feature | AurorIQ | Typical free test | Clinical (WAIS-IV) |
|---|---|---|---|
| IRT scoring | ✓ | ✗ | ✓ |
| Adaptive questions | ✓ | ✗ | ✓ |
| Confidence interval reported | ✓ | ✗ | ✓ |
| Representative norms | ~ | ✗ | ✓ |
| Inflation-free scoring | ✓ | ✗ | ✓ |
| Cognitive domains | 5 domains | 1–2 domains | 5+ domains |
| Validity vs WAIS-IV | r = 0.82 | r ≈ 0.38 | r = 1.0 |
| Test-retest reliability | 0.89 | ~0.60–0.70 | 0.94–0.97 |
| Valid for Mensa / clinical use | ✗ | ✗ | ✓ |
| Free, no account required | ✓ | ~ | ✗ |
| Typical cost | Free | Free–£20 | £200–£600 |
Technical FAQ
Questions from sceptical users about our methodology, scoring, and claims.
In adaptive testing, question count is less important than question selection quality. A well-calibrated adaptive test of 25 items focused near your ability level provides more measurement information than a 50-item non-adaptive test where half the questions are too easy or too hard to discriminate effectively.
The Fisher information accumulated by 25 adaptively selected items in AurorIQ typically exceeds the information from 35–40 items of a fixed-difficulty test. This is why all major computerised adaptive tests use 20–35 items rather than 50+.
The honest answer: we don't know with certainty. Our norming sample is self-selected — voluntary participants who chose to take the test — and self-selected samples are rarely fully representative of the adult population.
We mitigate this by: using IRT rather than classical test theory (making ability estimates less sample-dependent); applying post-stratification weights based on age and education; and anchoring norms to a subset of participants who also completed Raven's Matrices as a reference. We report this limitation in every score interpretation.
All IQ scores have measurement error — typically ±5–10 points even for clinical instruments. Two scores within that range are not contradictory; they are consistent with the same underlying ability.
If the difference is larger (15+ points), the most likely explanations are: the other test gave an inflated score (most free tests do); testing conditions differed significantly; or practice effects from multiple test exposures. To compare scores meaningfully, both tests would need documented representative norms on the same mean-100, SD-15 scale.
Mensa requires proctored testing under controlled conditions, administered by a licensed professional or at a supervised Mensa testing event. This prevents cheating and ensures the score reflects unassisted genuine performance.
No self-administered online test qualifies — including well-designed ones like AurorIQ. This is not a criticism of online test accuracy; it is a quality-control requirement that testing conditions be verifiable. If you believe you qualify, take the supervised Mensa Admission Test.
Taking more time does not increase your score — only correct answers do. The IRT model scores based on the pattern of right and wrong answers weighted by item parameters, not response time.
You can look up answers, but doing so defeats the purpose entirely. An inflated score that doesn't reflect genuine ability is useless information — and results are stored only in your browser, not our servers, so there's no credential to show anyone. The only person you'd be deceiving is yourself.
Take the most honest free IQ test online
25 adaptive questions. IRT scoring. A confidence interval with your result. No email. No paywall. No inflated score.