Abstract: This research explored the effectiveness of using a hierarchical 2-PL item response theory (IRT) model to explain differential item functioning (DIF) according to item-level features. Explaining DIF in terms of variance attributable to construct-irrelevant item-level features would allow testing programs to improve item writing and item review processes to account for the features shown to predict DIF. Whereas previous research in this area has used classical test theory for scaling and logistic regression for DIF detection, this study explained DIF in terms of a hierarchical IRT model. Latent trait models are more widely used in operational testing programs; additionally, simultaneous estimation allows uncertainty in parameter estimates to be considered during the estimation of item-level features’ relationship with DIF and is more parsimonious than a two-stage model. This simulation study assessed the parameter recovery and stability of the proposed model across 36 different conditions created by varying four parameters: the strength of the correlation between the amount of DIF and the item-level features, the proportion of examinees in the reference group, and the mean and mixture probability of the mixture distribution used to sample items’ DIF. The model successfully recovered person and item parameters, differences in groups’ mean ability, and the relationship between the amount of DIF observed in an item and the presence of DIF-related item-level features. Model performance varied according to the values of the four parameters used to create conditions, especially the proportion of examinees in the reference group, which exhibited meaningful effect sizes in ANOVAs used to assess the parameters’ impact on MSE and affected the model’s power to detect DIF. When there were equal numbers of examinees in the reference and focal groups, the power to detect DIF increased, but at the expense of higher false positive rates and poorer precision.