Understand the Statistics
Statistics enable you to improve items that will be used again in tests, determine the validity of a test's score in measuring student aptitude, and identify specific areas of instruction that need greater focus.
Item Statistics
Item statistics assess the items that made up a test.
Item Difficulty (P)
P Value refers to the degree of challenge for items based on the percentage of students who chose the correct answer. Item difficulty is relevant for determining whether students have learned the concept being tested.
 Also Called: P Value, Difficulty Index
 Score Range: 0.00 to 1.00
The higher the value, the easier the question. In other words, if no one answers an item correctly, the value would be 0.00. An item that everyone answers correctly would have a value of 1.00.
Desired Score for Classroom Tests
A classroom test ideally is comprised of items with a range of difficulties that average around .5 (generally, item difficulties between .3 and .7).
Score Band Interpretation
P ≥ .70  Easy  The correct answer is chosen by more than 70 percent of the students 
.30 ≤ P < .70  Average  The correct answer is chosen by 3070 percent of the students 
P < .30  Challenging  The correct answer is chosen by less than 30 percent of the students 
Corrective Actions
Items that are too easy (p > .80) or too difficult (p < .30) do not contribute to test reliability and should be used sparingly. It's best to review additional indicators such as item discrimination and point biserial to determine what action may be needed for these items.
Formula
This value is determined simply by calculating the proportion of students that answer the item correctly using the following formula.
N_{p} = number of students who answered correctly
N = number of students who answered
The correlation items include both dichotomous and nondichotomous items (see the table to see which items are dichotomous or nondichotomous). If the item is nondichotomous then the points earned may be partial points earned.
Dichotomous  NonDichotomous 



Discrimination Index (D)
Item discrimination is a correlation value (similar to point biserial) that relates the item performance of students who have mastered the material to the students who have not. It serves as an indicator of how well the question can tell the difference between high and low performers.
 Also Called: Item Discrimination
 Score Range: 1.00 to 1.00
Items with higher values are more discriminating. Items with lower values are typically too easy or too hard. This matrix provides a simplified view of how items are determined to have high or low values.
Student Performs Well on Test  Student Performs Poorly on Test  
Student Gets Item Right  High D  Low D 
Student Gets Item Wrong  Low D  High D 
Desired Score for Classroom Tests
.20 or higher
Score Band Interpretation and Color Coding
D ≥ .70  Excellent  Best for determining top performers from bottom performers 
.60 ≤ D < .70  Good  Item discriminates well in favor of top performers 
.40 ≤ D < .60  Acceptable  Item discriminates reasonably well 
.20 ≤ D < .40  Needs Review  May need corrective action unless it is a mastery level question 
D < .20  Unacceptable  Needs corrective action 
Corrective Actions
Items with low discrimination should be reviewed to determine if it is ambiguously worded or if the instruction in the classroom needs work. Items with negative values should be scrutinized for errors or discarded. For example, a negative value may indicate that the item was miskeyed, is ambiguous, or is misleading.
Formula
When calculating item discrimination, first all students taking the test are ranked according to the total score, then the top 27 percent (high performers) and the bottom 27 percent (low performers) are determined. Finally, item difficulty is calculated for each group and subtracted using the following formula.
P_{H} = item difficulty score for high performers
P_{L} = item difficulty score for low performers
Point Biserial Correlation (r_{pb})
Point biserial is a correlation value (similar to item discrimination) that relates student item performance to overall test performance. It serves as an indicator of how well the question can tell the difference between high and low performers. The main difference between point biserial and item discrimination I is that every person taking the test is used to compute point biserial scores, and only 54% (27% upper + 27% lower) are used to compute the item discrimination scores.
 Also Called: Item Discrimination II, Discrimination Coefficient
 Score Range: 1.00 to 1.00
A high point biserial value means that students selecting the correct response are students with higher total scores, and students selecting incorrect responses to an item are associated with lower total scores. Very low or negative point biserial values can help identify items that are flawed.
Desired Score for Classroom Tests
.20 or higher
Score Band Interpretation
r_{pb} ≥ .30  Excellent  Best for determining top performers from bottom performers 
.20 ≤ r_{pb} < .30  Good  Reasonably good, but subject to improvement 
.10 ≤ r_{pb} < .20  Acceptable  Usually needs improvement 
r_{pb} < .10  Poor  Needs corrective action 
Corrective Actions
Items with low discrimination should be reviewed to determine if it is ambiguously worded or if the instruction in the classroom needs work. Items with negative values should be scrutinized for errors or discarded. For example, a negative value may indicate that the item was miskeyed, is ambiguous, or is misleading.
Formula
Point biserial identifies items that correctly discriminate between high and low groups, as defined by the test as a whole.
M_{p} = mean score for students answering the item correctly
M_{q} = mean score for students answering item incorrectly
S_{t} = standard deviation for the whole test
p = proportion of students answering correctly
q = proportion of students answering incorrectly
Test Statistics
Test statistics assess the performance of the test as a whole.
Cronbach's Alpha Reliability (Ɑ)
Cronbach's alpha is the consistency reliability of the test based on the composite scores of its items. It serves as an indicator of the extent to which the test is likely to produce consistent scores.
 Also Called: Internal Consistency Reliability, Coefficient Alpha
 Score Range: 0.00 to 1.00
High reliability means that students who answered a given question correctly were more likely to answer other questions correctly. Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly.
Desired Score for Classroom Tests
.70 or higher
Score Band Interpretation and Color Coding
P ≥ .90  Excellent  In the range of the best standardized tests 
.70 ≤ α < .90  Good  In the desired range of most classroom tests 
.60 ≤ α < .70  Acceptable  There are some items that could be improved 
.50 ≤ α < .60  Poor  Suggests need for corrective action, unless it purposely contains very few items 
α < .50  Unacceptable  Should not contribute heavily to the course grade and needs corrective action 
Corrective Actions
There are a few ways to improve test reliability.
 Increase the number of items on the test.
 Use items with higher item discrimination values.
 Include items that measure higher, more complex levels of learning, and include items with a range of difficulty with most questions in the middle range.
 If one or more essay questions are included on the test, grade them as objectively as possible.
Formula
The standardized Cronbach's alpha formula divides the reliability of items by the total variance in the composite scores of those items (a ratio of true score variance to total variance) using the following formula.
k = total number of items on the test
r = mean interitem correlation