Score Report Interpretation
There are four standard types of reports that you can receive after your exam has been scored. These reports include:
The Class Response, Item Analysis and Roster reports are available both electronically or in hard copy format. The Score Distribution reports are available in hard copy only. If you request that your results be sent electronically, an email will be sent to you immediately after scoring is complete with a OneDrive link included. The link will take you to a OneDrive folder containing the reports that you requested. Hard copies of reports that you request will be returned to you with your key and student scan forms.
This report is a valuable tool when the student questions the accuracy of his or her exam score.
The Class Response Report provides a printout of student test responses sorted by student ID number. The Response Description box lists the following:
- A dash (-) for a correct response
- A pound sign (#) when the scanner reads more than one response
- A space indicates no response was given
- A letter of the alphabet indicates the student’s incorrect response
- An asterisk (*) indicates a bonus test item
Students who do not fill in their bubbles dark enough on the scan form may receive an inaccurate score because the scanner could not read the response (resulting in a space, indicating no response). This is why it is important for the student to press firmly when bubbling in their answers. Or if a student does not erase properly and bubbles in a different answer for that question, it may be picked up as two responses to that question and will show up as “multiple marks” on the report. Steps are taken in the Test Scoring Unit to enhance accuracy of student scores by flagging the forms with bubbles that have no response or multiple response answers and making the necessary corrections when needed. However, it is first the responsibility of the student to fill in their scan form correctly. The instructor also should check forms when they are handed in to ensure that they have been filled in correctly.
Below the Response Description box is the key box, listing test item number and the answer for each.
The Exam # column will provide you with essay points for each student, when essay points are recorded in the instructor use column on the blue scan form or in the exam number column on the green scan form.
To write effective items, it is necessary to examine whether they are measuring the fact, idea, or concept for which they were intended. This is done by studying the student’s responses to each item. When formalized, the procedure is called “item analysis”. It is a scientific way of improving the quality of tests and test items in an item bank.
An item analysis provides three kinds of important information about the quality of test items.
- Item difficulty: A measure of whether an item was too easy or too hard.
- Item discrimination: A measure of whether an item discriminated between students who knew the material well and students who did not.
- Effectiveness of alternatives: Determination of whether distractors (incorrect but plausible answers) tend to be marked by the less able students and not by the more able students.
Item difficulty, item discrimination and the effectiveness of distractors on a multiple-choice test are automatically available with ParScore’s item analysis. An illustration of ParScore’s “Standard Item Analysis Report” printout can be found by clicking on the link below.
Additional Item Analysis
Optimal Item Difficulty
Item difficulty is important because it reveals whether an item is too easy or too hard. In either case, the item may add to the unreliability of the test because it does not aid in differentiating between those students who know the material and those who do not. For example, an item answered correctly by everyone does nothing to aid in the assignment of grades. The same is true for items that no one answers correctly.
The optimal item difficulty depends on the question-type and on the number of possible distractors. Many test experts believe that for a maximum discrimination between high and low achievers, the optimal levels (adjusting for guessing) are:
- 2 alternatives true and false = .75
- 3 alternatives multiple-choice = .67
- 4 alternatives multiple-choice = .63
- 5 alternatives multiple-choice = .60
Items with difficulties less than 30 percent or more than 90 percent definitely need attention. Such items should either be revised or replaced. An exception might be at the beginning of a test where easier items (90 percent or higher) may be desirable.
Item Discrimination I
The single best measure of the effectiveness of an item is its ability to separate students who vary in their degree of knowledge of the material tested and their ability to use it. If one group of students has mastered the material and the other group had not, a larger portion of the former group should be expected to correctly answer a test item. Item discrimination is the difference between the percentage correct for these two groups.
Item discrimination can be calculated by ranking the students according to total score and then selecting the top 27 percent and the lowest 27 percent in terms of total score. For each item, the percentage of students in the upper and lower groups answering correctly is calculated. The difference is one measure of item discrimination (IDis). The formula is:
IDis = (Upper Group Percent Correct) – (Lower Group Percent Correct)
Item #1 in the attached report would have an IDis of
100% – 62.5% = 37.5% (or .375 as a decimal).
The maximum item discrimination difference is 100 percent. This would occur if all those in the upper group answered correctly and all those in the lower group answered incorrectly. Zero discrimination occurs when equal numbers in both groups answer correctly. Negative discrimination, a highly undesirable condition, occurs when more students in the lower group then the upper group answer correctly. Item #6 on the attached report has a negative IDis.
The following levels may be used as a guideline for acceptable items.
|Negative IDis||Unacceptable – check item for error|
|0% – 24%||Usually unacceptable – might be approved|
|25% – 39%||Good item|
|40% – 100%||Excellent item|
Item Discrimination II
The point biserial correlation (PBC) measures the correlation between the correct answer (viewed as 1 = right and 0 = wrong) on an item and the total test score of all students. The PBC is sometimes preferred because it identifies items that correctly discriminate between high and low groups, as defined by the test as a whole instead of the upper and lower 27 percent of a group. ParScore reports this value as item discrimination in column 5 on the “Standard Item Analysis Report” attached.
Inspection of the attached report shows that the PBC can generate a substantially different measurement of item discrimination than the simple item discrimination difference described above. Often, however, the measures are in close agreement.
Generally, the higher the PBC the better the item discrimination, and thus, the effectiveness of the item. The following criteria may be used to evaluate test items.
|.30 and above||Very good items|
|.20 to .29||Reasonably good items, but subject to improvement|
|.10 to .19||Marginal items, usually needing improvement|
|.00 to .09||Poor items, to be rejected or revised|
Distractors and Effectiveness
Although Item Discrimination statistics measure important characteristics about test item effectiveness, they don’t reveal much about the appropriateness of item distractors. By looking at the pattern of responses to distractors, teachers can often determine how to improve the test.
The effectiveness of a multiple-choice question is heavily dependent on its distractors. If two distractors in a four-choice item are implausible, the question becomes, in effect, a true false item. It is, therefore, important for teachers to observe how many students select each distractor and to revise those that draw little or no attention. Use of “all of the above” and “none of the above” is generally discouraged.
Reliability and Validity
The importance of a test achieving a reasonable level of reliability and validity cannot be overemphasized. To the extent a test lacks reliability, the meaning of individual scores is ambiguous. A score of 80, say, may be no different than a score of 70 or 90 in terms of what a student knows, as measured by the test. If a test is not reliable, it is not valid.
Reliability of a Test
Despite differences between the format and construction of various tests, there are two standards by which tests (as compared to items) are assessed. These two standards are reliability and validity.
Reliability refers to the consistency of test scores; how consistent a particular student’s test scores are from one testing to another. In theory, if test A is administered to Class X, and one week later is administered again to the same class, individual scores should be about the same both times (assuming unchanging conditions for both sessions, including familiarity with the test). If the students received radically different scores the second time, the test would have low reliability. Seldom, however, does a teacher administer a test to the same students more than once, so the reliability coefficient must be calculated a different way. Conceptually, this is done by dividing a homogeneous test into two parts (usually even and odd items) and treating them as two tests administered at one sitting. The calculation of the reliability coefficient, in effect, compares all possible halves of the test to all other possible halves.
One of the best estimates of reliability of test scores from a single administration of a test is provided by the Kuder-Richardson Formula 20 (KR20). On the “Standard Item Analysis Report” attached, it is found in the top center area. For example, in this report the reliability coefficient is .87. For good classroom tests, the reliability coefficients should be .70 or higher.
To increase the likelihood of obtaining higher reliability, a teacher can:
- increase the length of the test;
- include questions that measure higher, more complex levels of learning, and include questions with a range of difficulty with most questions in the middle range; and
- if one or more essay questions are included on the test, grade them as objectively as possible.
Validity of a Test
Content or curricular validity is generally used to assess whether a classroom test is measuring what it is supposed to measure. For example, a test is said to have content validity if it closely parallels the material which has been taught and the thinking skills that have been important in the course. Whereas reliability is expressed as a quantitative measure (e.g., .87 reliability), content validity is obtained through a rational or logical analysis of the test. That is, one logically compares the test content with the course content and determines how well the former represents the latter.
A quantitative method of assessing test validity is to examine each test item. This is accomplished by reviewing the discrimination (IDis) of each item. If an item has a discrimination measure of 25 percent or higher, it is said to have validity, that is, it is doing what it is suppose to be doing – discriminating between those that are knowledgeable and those that are not knowledgeable.
The standard roster report that you will receive includes the student ID number, name and scores. The roster is available in raw, percent and t-score formats.
You will receive a roster report for each exam submitted for scoring and on that roster you will receive scores for your current exam, in addition to scores for previous exams scored. At the end of the semester or course period, you may request your roster with scores from each test scored including a total column with the cumulative total scores. Each score column will list the total possible points (from the answer key) at the top of the column.
Rosters with student names are sorted alphabetically by last name.
See sample Roster Report:
There are two different standard score distribution reports available to you – a Score Distribution Percentile Report and Score Distribution Histogram Report. This report provides an analysis of how student scores are distributed for the current test. It combines all versions of the test. In this report, the relationship of percentages to points is illustrated, which is helpful when deciding percentage cut-offs to use for grade criteria.
The Score Distribution Percentile Report lists each raw score that occurs on the test and the percentage of possible points it represents. It reports the number of students (frequency) achieving that score, and what percent of the class they represent. The report also calculates a cumulative percentage and a percentile. The percentile reported is the percentage of tests scores that are less than the score listed.
The Histogram Report allows you to view data graphically. Scores are converted to a percentage (the horizontal axis) and the frequency (number of students) is plotted on the vertical axis. This graph provides an easy to understand visual of student standing relative to other students taking the test.
View sample distribution reports:
Polk Library, Lower Level - Room 4
801 Elmwood Avenue, Oshkosh, WI 54901
Phone: (920) 424-1432