#BTColumn – Reliability, validity and fairness at CXC

Disclaimer: The views and opinions expressed by this author are their own and do not represent the official position of the Barbados Today.

by Michael A. Clarke

I was invited to offer an independent assessment of this situation as I perceive it. I was invited as an educator with an interest in assessment and grading and as a Caribbean national with a demonstrated interest in advancing education quality, equity, and student achievement in the Caribbean and the wider world.

Because this is an education assessment issue, let me go to the Bible of Standards and Assessment. The most recent version is the 2014 edition of “Standards for Educational and Psychological Testing” published by the American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME]. This book will be referred to as “the Standards.”

As the issue here is one of testing, I must frame the discussion by clarifying three key aspects of testing, three key concepts that must be understood before this discussion can move forward. These are the concepts of Reliability, Validity, and Fairness. The definitions are taken from “the Standards.”

Reliability refers to “the consistency of scores across replications of a testing procedure, regardless of how this consistency is estimated or reported” (the Standards, p 33).

You Might Be Interested In

What does reliability mean in this context? It means that if Jane Doe had a time machine and could go back and forth in time and prepare for and take different versions of a particular assessment, say the CAPE Chemistry assessment, then her score on different versions in different years would be essentially the same. Statisticians would say over 95 per cent of the time the score difference would be less than two standard errors of measurement. If the score difference is greater than two standard errors of measurement more than 5 per cent of the time, then that difference is said to be statistically significant. Reliability is lost if score differences are statistically significant.

Validity refers to the degree to which evidence and theory support interpretations of test scores for proposed use of tests (the Standards, p 11).

In this context, these scores are used for promotion; college, professional and higher education matriculation; career and employment entry; scholarship awards, and possibly in other ways that I may not be aware of.

Validity is linked to reliability. Over the years, employers and educators have developed and crafted expectations associated with specific grade profiles. Data and evidence of student performances have informed expectations of what it means if a student has a particular grade in a particular discipline. If the scores are not reliable, that is, the scores are not consistent with historic scores, then those expectations are no longer valid. Without score reliability, there cannot
be test validity.

Fairness: this term is challenging. Let me quote from “the Standards”. “The term fairness has no single meaning and is used in different ways in public discourse” (the Standards, p 49).

Yet, we all have an intuitive understanding of this concept of fairness. Taking the words from a famous phrase used in 1964 by United States Supreme Court Justice Potter Stewart in a completely different context: “Fairness is hard to define, but ‘I know it when I see it.’”

Reliability, Validity, and Fairness. Let us now bring this home to the current situation. Let us start with what is without question. CXC made two significant changes.

Change 1: CXC changed its test administration protocol. Instead of three papers – Paper 1, Paper 2, and Paper 3, CXC decided to base its assessment on only two papers, Paper 1 and Paper 3. This changed its test composition protocol, the effective weights and contributing factors of its assessment, and the structure of its grade award protocol.

Dr Juliet Melville has written a precise and concise analysis where she clearly points out the fallacy behind CXC insisting “the weight of various papers in the determination of the final grade remained the same.”

Clearly the weight of Paper 2 went from whatever it was to zero, so no further discussion is required on that point. Thank you, Dr Melville.

Change 2: CXC changed the grading protocol for Paper 3. The moderation process was extended from “a random sample selection as in the past” to “the moderation of all centers and subjects.” This change revealed that “several uncertainties exist in school communities about subject profiles and SBA moderation.” (Executive Summary, p 3).

I must say that I take issue with the next sentence in the Executive Summary: “The Committee is of the view that the requirement to moderate all Paper 03’s from all schools and for all subjects served to increase the thoroughness and improve the reliability of the process in 2020 compared to previous years.” What is my basis for my position?

Page 1 of the Executive Summary states: “The Report from the Technical Advisory Committee, TAC for short, indicated that students’ scores from the expanded moderation process were generally lower than in previous years.”

This strongly indicates a lack of the historic consistency that is the hallmark of reliability.

I would agree that the extended moderation process increased thoroughness. I would even agree that it may have improved precision and accuracy in grading. What it did not do was to improve reliability.

Clearly, since the TAC indicated that “students’ scores from the expanded moderation process were generally lower than in previous years” [page 1 of IRT Report] these scores would no longer be reliable from a historical context: they were not consistent with previous iterations of the testing protocol. Hence, the results of the extended moderation process are not, by the definition, reliable. Without reliability, there is no validity. Assumptions made about the interpretation and meaning of assessment scores lose their meaning if those scores are not consistent, if those scores are not reliable.

Allow me to reiterate what I have said so far: the report of the Independent Review Team clearly indicates the results were neither Reliable nor Valid.

Were the results Fair? The concept of Fairness is complicated and is linked to Educational Philosophy. There is a Philosophy of “holding students harmless” for events and activities over which they have no responsibility. Many jurisdictions adopted this philosophy in the wake of this pandemic. In her remarks, Mrs Moore-Dent indicated some jurisdictions that articulated this philosophy. Many of the jurisdictions with which I work have also adopted this philosophy.

It is not clear what philosophy CXC has adopted. However, once the Technical Advisory Committee had reported that the expanded moderation process resulted in lowering students’ grades, this should have been addressed.

Students should have been held harmless to the consequences of the pandemic, and the consequences of a sudden change in the moderation protocol.

There is another thing I find very concerning. In the closing paragraphs of the executive summary, it was indicated that “a simulation exercise (with and without Paper 2), using data from previous years highlighted limitations in the model, namely, shifting in distribution of grades (reduction in Grades I to IV).”

It is clear that the exclusion of Paper 2 had a deleterious effect on those students who would have excelled in that Paper. A possible effect would be of changing a potential grade of I, the highest possible grade, to a IV, the lowest possible grade. This phenomenon is well known; it is referred to as Grade Compression.

In Paper 2, the written or long answer paper, students are required to demonstrate their reasoning. Therefore, students who have a deep understanding of the tested construct but are lured by good distractors have a chance to demonstrate their knowledge in this paper and improve their overall scores.

Conversely, students without a deep understanding of the tested construct but were lucky in their choices on the multiple-choice assessment will see their scores eroded here. It is for this reason that the College Board used only the Paper 2 equivalent for their AP Tests this year.

So, to clarify, without Paper 2, there is the potential for strong distractors to depress the scores of students with a deep understanding of the tested construct, and for luck to enhance the scores of students without a deep understanding of the tested construct. The possibility of going from a Grade I to a Grade IV by excluding Paper 2, Grade Compression referenced by Professor Guy Phillip Nason, is something that must be looked at very closely.

When the potential for Grade Compression by elimination of Paper 2 is compounded by a change protocol [extended monitoring] that results in students earning lower grades in one of the two remaining papers that are used for determination of grade awards, I cannot accept that this process was fair.

In conclusion, the Independent Review Report indicates that the resulting scores based on the changes in assessment structure and grading were neither reliable, valid, nor fair, and as such, major redress is required.

The most reasonable redress would be to regrade Paper 3 in a manner consistent with the historic grading to recapture what reliability of scores there might be.

his will not address all ills, but it is something that is doable. An additional option would be to offer any students who are willing the opportunity to take a Paper 2. If students can demonstrate a deep understanding of the tested construct, it would be unconscionable to allow an error in assessment to derail their academic progression.

In all this, students should be held harmless. Therefore, those students who are happy with their grades should get to keep those grades.

Michael A. Clarke PhD, EdM, MS, DipEd, ATCL, CAGO delivered this presentation on December 10 at a virtual press conference hosted by the Caribbean Coalition for CXC 2020 Redress.

About Us

Categories

Company

COmpany

Useful Links

Get Our News

Newsletter

Company

COmpany

BT Lifestyle

Newsletter

Queue

#BTColumn – Reliability, validity and fairness at CXC

You may also like

About Us

Categories

Company

COmpany

Useful Links

Get Our News

Newsletter

Company

COmpany

BT Lifestyle

Newsletter

Queue