DRASTICALLY different sample sizes with categorical data

I am trying to compare the rate of instance of a few variables between two datasets. Dataset A has about 3,000 observations while Dataset B has 180,000. Is it an issue to simply run Chi-Square tests with such large and different size datasets? I feel like any minute difference will result in a significant result. I tried to use propensity scores to match observations at a 5:1 ratio, but all of the scores were either approximately 0 or 1. Any help would be appreciated.

categorical-data
sample-size
propensity-scores

asked Jul 14, 2021 at 20:46 JonKetchup91 JonKetchup91 79 5 5 bronze badges $\begingroup$ What's wrong with a tiny difference giving a small p-value? $\endgroup$ Commented Jul 14, 2021 at 20:54

1 Answer 1

$\begingroup$

There is nothing wrong with size disparities for the chi-square test, especially when the cell sizes are large. This is a different issue than a large overall sample size yielding statistically significant results even for small effect sizes. There is absolutely no benefit to discarding data to make your sample smaller (e.g., using propensity score matching) unless you are also reducing confounding in doing so, which you did not mention was a problem in your analysis. A statistically significant result doesn't bound you to anything or force you to do anything, so why does observing them worry you? It's up to you and your readers how to interpret such a result. If the effect is tiny then interpret it as a tiny effect. Statistical significance just means the effect is not likely to be observed simply by chance but rather reflects some structural difference between the datasets, however small that might be.

answered Jul 15, 2021 at 2:29 35.7k 3 3 gold badges 50 50 silver badges 113 113 bronze badges

$\begingroup$ Really appreciate your input. The issue I am having is that although there shouldn't be confounders in the datasets, variables like Sex are statistically significant even though the difference between them is less than 0.5%. Thus, my logic with propensity scores was to a) minimize confounding effects and b) level the playing field (so to speak) in the sample sizes. That being said, propensity scores didn't work because 99.9%+ were almost zero and the others SAS outputted equal to 1. So I am kind of at a loss on how to proceed. $\endgroup$

Commented Jul 15, 2021 at 17:07

$\begingroup$ Are you trying to formally test whether there are covariate differences between the samples, or just trying to diagnose potential imbalance? If the former, you should be using corrections for multiple testing, multivariate analyses, etc., and ideally those would reveal nonsignificant results if there is truly no difference. If you are just diagnosing imbalance, you shouldn't use hypothesis tests for exactly this reason. The cutoff of p Commented Jul 15, 2021 at 18:31

$\begingroup$ Also, if your propensity scores are all 0 or 1, it seems like you have a severe lack of overlap, meaning the samples are fundamentally incomparable, even though you claim the differences between them are small. Try diagnosing the sample by examining the propensity score model itself; you will likely find very high standardized regression coefficients, indicating severe differences between the groups. Things aren't adding up based on your description. $\endgroup$