Computer Adaptive Testing (CAT) has been around for almost thirty years. There are a lot of advantages to it; but, as with anything else, there are some areas where it just doesn’t make a lot of sense.
The foundation of CAT rests in Item Response Theory (IRT). With IRT, test developers generate stable item parameters for each test item, including difficulty level and degree of discriminability. IRT allows us to determine the likelihood of people at a certain ability level to get the item correct. So, while the average of everyone who takes a particular item might be 50%, with IRT we know that people who are at the 70th percentile in ability have an 85% chance of getting it correct. Discriminability takes it one more step because it looks at the slope of the response scale. For instance, two items may be of exactly the same difficulty level but with one item the slope is very steep, wherein everyone who is at the 70th percentile in ability will get it correct, while another may be more shallow and the percentage correct goes up more gradually. That’s the basis of IRT and it can be applied to dichotomous items, as well as polytomous personality scales.
The advantage CAT brings is that it can present different candidates with different items based on their general ability level and then hone in on their ability level in less time than a traditional test. It doesn’t make a lot of sense to give average first graders advanced calculus problems because they will get them all wrong. In the same vein, you don’t learn a lot from giving graduate level mathematicians simple algebra questions because they will get them all right. For those reasons, CAT really shines when you’re measuring things like logical reasoning, mathematical skills, or knowledge based items. In these situations, CAT not only reduces test taking time, but also reduces the potential for cheating or item piracy by limiting exposure to the items.
So where does CAT fall short? There are a couple of applications where CAT has some significant limitations. The first is with Situational Judgment Tests (SJT’s), which have grown dramatically in popularity because they are flexible, face valid, offer a solid measure of ability and provide an alternative measure of personality. The problem is CAT’s can’t accurately measure SJT items. The issue is that IRT, and therefore CAT’s only measure unidimensional items and SJT items are typically multidimensional. To work properly IRT needs to measure a unidimensional construct, even if it’s a relatively messy construct like knowledge of American history. It doesn’t work at all, if you’re measuring logical reasoning and conscientiousness and customer service at the same time, like you often are in SJT’s. Most, if not all, SJT’s gather information on multiple competency areas, such as honesty and customer service with a single robust item. It’s one of SJT’s greatest strengths. Therefore, CAT, or any approach that mechanically decides whether an item appears to a candidate or not based on an assumption of unidimensionality, is inherently flawed. Further, it will lead to skewed, if not downright incorrect, information about the candidate.
Another problem with CAT relates to cross-cultural applications. I’m a big fan of SJT’s. I think they’re some of the best ways to measure a broad range of traits in a user-friendly manner. When you use them internationally it’s not unusual for the content of the scenario to change from one culture to another. That’s not a big deal and is handled by a cultural review and sound translations. However, when you throw CAT into the mix it becomes a mess. Imagine developing a CAT based on an SJT developed here in the U.S. , which as I mentioned earlier is problematic in and of itself, and then trying to apply that to a new cultural and language. There is simply no reasonable way for underlying item parameters to remain invariant across cultures.
As it currently stands, CAT is a solid tool when your goal is to shorten the length and increase the item relevance of a long test of specific content knowledge or cognitive ability. But CAT’s value is counterproductive when applied to testing methods outside of that realm, such as with personality, SJT, or even biodata, where there is not only inherent multidimensionality but where significant cross-cultural issues come into play.