TestPrep Istanbul

How does the TOEFL iBT calculate your section scores?

TP
TestPrep Istanbul
May 18, 202611 min read

The TOEFL iBT (Test of English as a Foreign Language, Internet-based Test) is a standardised academic English proficiency assessment used by universities and institutions worldwide to evaluate the English language readiness of non-native speakers. The scoring system transforms candidate performance into four section scores (Reading, Listening, Speaking, and Writing) on a 0–30 scale, plus a total score of 0–120. Understanding how these scores are calculated—through raw-to-scaled conversion, automated evaluation, human rating, and statistical equating—enables candidates to set evidence-based targets and allocate preparation time with precision.

The difference between raw scores and scaled scores on the TOEFL iBT

The TOEFL iBT does not report raw scores directly. A raw score is simply the count of questions answered correctly within a section; the TOEFL converts these into scaled scores to account for differences in difficulty across test forms. This conversion ensures that a score of 25 on the Reading section carries the same meaning regardless of which specific set of passages and questions appeared in a candidate's test.

The conversion uses item response theory (IRT), a psychometric model that estimates candidate ability based on both the number of correct responses and the difficulty of the items answered. An adaptive algorithm calibrates each test form before it is operationalised, assigning difficulty parameters to every question. When a candidate completes the test, the IRT model maps the pattern of correct and incorrect responses to a consistent ability estimate, which is then translated onto the familiar 0–30 scaled scale.

Candidates frequently ask whether all questions carry equal weight. In the Reading and Listening sections, questions are not uniformly weighted; items assessing higher-difficulty skills contribute more to the final scaled score than easier items. This is why two candidates who answer the same number of questions correctly may receive different scaled scores.

Reading section: scoring mechanics and question weighting

The Reading section comprises 2 passages with approximately 10 questions each, drawn from academic texts in disciplines such as natural sciences, social sciences, and humanities. The total raw score achievable in Reading is the sum of correct responses across all questions, but the conversion to a scaled score is non-linear.

Questions in the Reading section fall into several families: basic comprehension questions (factual information, negative factual information, and sentence simplification), inferential questions (rhetorical purpose, tone, and implication), and synthesising questions (prose summary and insert-a-sentence). Each family carries a different weight in the scoring model. Questions that require integration of information across paragraphs—such as prose summary items—are assigned greater influence because they reflect higher-order reading ability.

The Reading section scaled score is calculated by applying the IRT conversion table specific to that test form. Because test forms vary slightly in difficulty, the raw-score-to-scaled-score mapping shifts accordingly. A candidate who answers 35 of 40 Reading questions correctly might receive a scaled score ranging from 24 to 28, depending on the difficulty profile of those specific questions. This underscores the importance of targeting high-difficulty comprehension skills rather than merely maximising the number of correct answers.

Listening section: scoring mechanics and question weighting

The Listening section tests the ability to understand academic lectures and conversations in campus settings. Candidates hear 3–4 lectures and 2–3 conversations, answering questions that assess comprehension of stated information, attitude, and purpose. Like Reading, the Listening section uses a non-linear conversion from raw to scaled scores.

Question types in Listening include basic comprehension (main idea, detail), pragmatic comprehension (function, attitude, tone), and connecting information (organization, inference, synthesis). Questions demanding synthesis of multiple speakers or ideas carry higher difficulty weights in the IRT model, reflecting their greater cognitive demand.

One important scoring consideration for Listening is that partial credit is not awarded. Each correct response contributes one raw point; there is no penalty for incorrect answers. Candidates should therefore answer every question, even if uncertainty exists, because expected-value reasoning favour leaving no question unanswered.

Speaking section: the dual scoring system of AI and human raters

The Speaking section consists of four tasks: one independent task (personal experience or opinion) and three integrated tasks (reading–listening–speaking or listening–speaking). Each task is scored on a 0–4 rubric scale, and the raw rubric scores are converted to a 0–30 scaled score.

Every Speaking response undergoes dual evaluation: an automated speech scoring system (AS3) and at least one human rater, both operating independently. The automated system analyses acoustic features (pronunciation, fluency, rhythm) and linguistic features (vocabulary range, grammatical accuracy, discourse coherence). Human raters apply the same analytical rubric, evaluating delivery, language use, and topic development.

The final score for each task is determined by averaging the human and automated scores, rounded to the nearest 0.5 increment. If the human and automated scores diverge significantly (a gap of 2 or more points on the rubric scale), a second human rater is brought in to resolve the discrepancy. This quality-control mechanism ensures that neither human bias nor automated-system error unduly influences the outcome.

The Speaking rubric evaluates three dimensions: delivery (clarity of speech, pronunciation, and fluency), language use (grammatical accuracy and lexical appropriateness), and topic development (completeness and coherence of ideas). The independent task weights topic development more heavily, while integrated tasks require candidates to synthesise information from multiple sources, making language use and coherence proportionally important.

Writing section: human rating and e-rater evaluation

The Writing section comprises two tasks: an integrated Writing task (reading, listening, and writing a response) and an independent Writing task (essay based on a prompt). Each is scored on a 0–5 rubric scale and converted to a 0–30 scaled score for the section.

As with the Speaking section, Writing responses receive dual evaluation: a trained human rater and the ETS e-rater automated scoring engine. The e-rater analyses dozens of linguistic features, including vocabulary sophistication, syntactic complexity, grammatical accuracy, and organisational patterns. Human raters evaluate the same criteria through the lens of professional judgement.

The Writing rubric for the integrated task assesses reading–listening comprehension (accuracy of content), quality of writing (organisation, clarity, coherence), and English language proficiency. The independent essay rubric focuses on development (ideas and supporting examples), organisation, and language use. Human and e-rater scores are averaged; a third evaluator is introduced if the discrepancy exceeds one point on the rubric scale.

One common misconception is that longer essays automatically score higher. While adequate development requires sufficient length, essays that are wordy without substance can score poorly. The rubric explicitly rewards concision, logical progression, and relevant supporting details over mere word count.

Score equating across TOEFL iBT test forms

To maintain score consistency across different test administrations, the TOEFL iBT employs statistical equating. Because no two test forms are identical, equating adjusts for minor differences in difficulty so that a score of 25 represents the same ability level regardless of which test form was taken.

Equating is accomplished through common-item equating and concurrent equating. In common-item equating, a subset of questions that appear in both an operational form and a new form is used to calibrate the difficulty of the new form. In concurrent equating, the IRT model simultaneously estimates ability and item difficulty across all test takers, producing a unified scale.

This process has practical implications for candidates. Minor fluctuations in section scores between test attempts do not necessarily indicate improvement or regression; they may reflect slight differences in the difficulty of the specific questions encountered. MyBest scores (the highest section scores achieved across all valid TOEFL iBT test attempts within a two-year window) provide a more reliable indicator of ability by aggregating the most favourable performance across administrations.

Understanding your TOEFL score report and MyBest scores

The TOEFL iBT score report presents section scores, total score, and performance descriptors indicating whether a candidate falls into the advanced, high-intermediate, intermediate, or low-intermediate proficiency band. For each section, a vertical performance bar graphically represents the score relative to the full 0–30 range.

MyBest scores represent the highest section scores achieved across all valid TOEFL iBT tests taken in the past two years. These are displayed alongside the most recent test scores on the score report. Institutions vary in how they interpret MyBest scores; some consider them as supplementary information, while others use them as the primary basis for admission decisions. Candidates should verify the score-use policy of each target institution.

The score report also includes percentile rankings that contextualise performance relative to the global test-taking population. A Reading score of 26, for example, typically corresponds to a percentile ranking that indicates the proportion of test takers scoring below that level. Percentile ranks differ across sections because the distribution of scores varies; Listening scores tend to cluster higher than Reading scores across the global population.

Common pitfalls and how to avoid them

Candidates often misunderstand the relationship between raw accuracy and scaled scores. Assuming that a fixed number of correct answers always yields a specific scaled score leads to misguided preparation strategies. The non-linear conversion means that marginal improvements at the high end of raw-score performance yield disproportionately large gains in scaled scores. Conversely, improving from a very low raw-score baseline produces modest scaled-score increases until a threshold is reached.

Another pitfall is neglecting Speaking and Writing preparation in favour of Reading and Listening drills. Because the integrated tasks require simultaneous processing of multiple input sources, developing the specific skills tested by those tasks—note-taking, summarisation, and source integration—requires targeted practice distinct from general language exposure.

Candidates who rely solely on automated scoring tools for Speaking and Writing practice may also underperform. Automated tools provide useful feedback on surface-level features, but they do not replicate the nuanced judgement of human raters on discourse coherence, argument quality, or pragmatic appropriateness. Pairing automated feedback with self-review against the official rubrics and, where possible, expert human evaluation produces more reliable score gains.

Conclusion and next steps

The TOEFL iBT scoring system is a sophisticated integration of psychometric modelling, automated evaluation, and human professional judgement. Understanding the mechanics of raw-to-scaled conversion, the weighting of higher-difficulty items, and the dual-evaluation process for productive skills equips candidates with the knowledge needed to set precise targets and design focused preparation plans. By targeting the specific competencies evaluated at the high end of the rubric and using the official scoring rubrics as the primary reference, candidates can systematically close the gap between their current performance and their target score. TestPrep's complimentary diagnostic assessment offers a natural starting point for candidates seeking a sharper preparation plan.

Frequently asked questions

How is the TOEFL iBT total score calculated?
The total TOEFL iBT score is the sum of the four section scores (Reading, Listening, Speaking, and Writing), each reported on a 0–30 scale. The total therefore ranges from 0 to 120. Each section score undergoes a conversion from raw correct responses to a scaled score using item response theory, which accounts for the difficulty of specific questions encountered.
Do all questions in the TOEFL Reading and Listening sections count equally towards the final score?
No. Questions that assess higher-order skills, such as inference, synthesis, and integration of information, carry greater weight in the item response theory model than factual-detail questions. This means two candidates with the same raw count of correct answers may receive different scaled scores depending on the difficulty profile of the questions answered correctly.
How does ETS evaluate TOEFL Speaking responses?
Every Speaking response is evaluated by both an automated speech scoring system (AS3) and at least one trained human rater. Both evaluators apply the same 0–4 rubric across delivery, language use, and topic development. The final score averages the automated and human scores, rounded to the nearest half-point. A second human rater is introduced if the two scores diverge significantly.
What is the TOEFL MyBest score and how is it calculated?
MyBest scores display the highest section scores achieved across all valid TOEFL iBT test attempts within a two-year window. They are calculated by selecting the highest scaled score from each section across all recorded attempts and reporting those alongside the most recent test scores. Institutions apply MyBest scores according to their individual score-use policies.
Can I improve my TOEFL score simply by answering more questions correctly?
Improving raw accuracy is necessary but not sufficient. The non-linear scaling means that performance on higher-difficulty questions has a disproportionate impact on the scaled score. Preparation strategies should therefore prioritise developing the skills assessed by the most heavily weighted question types—integrated synthesis, critical inference, and extended discourse production—rather than focusing exclusively on increasing the raw number of correct responses.
Quick Reply
Free Consultation