I have six questions that each have a five-point Likert scale. We are interested in whether participants, overall, "agree" or "disagree" (or are neutral) toward the questions.
To do this, we currently calculate the average of the items, which results in an average from 1 to 5, then we collapse it into 3 bins, where 5 and 4 are "Agree", 3 is Neutral, and 2 and 1 are "Disagree".
We then compare this using a $\chi^2$ Goodness of Fit to the expected outcome, which we assume is randomly uniform: we anticipated that as 40% = agree, 20% = neutral, and 40% = disagree, since on a five-point scale 2 options are agree, 1 is neutral, and 2 options are disagree.
However I have doubts that we can average on a five point scale, then collapse into a 3 point scale and claim the expected averages are the same. I'm wondering if we can make the expected values claim if we collapse the categories into a 3-point scale before calculating the average. My colleagues don't see the problem. Is there a mathematical argument that shows whether these approaches (collapse before average, or collapse after average) are equivalent?
I have doubts that your approach is optimal. I think a more appropriate approach to your question is to do a two-tailed one-sample t-test where you set the null hypothesis population mean to 3. You will be able to detect if there is a statistically significant deviation from the middle of your likert-scale and then look at that actual mean difference and determine whether or not it is meaningful to you. Moreover, framing the question in this way assumes that you know the true value of the anchors and that a 3 really is the center of psychological space on the variable you are measuring. That is sort of a big assumption (this is one of many reasons why control groups and experimental manipulations are nice).
In regards to your question itself, binning then averaging will result in different results than averaging then binning. I started to come up with a mathematical example for this, or at the very least a simulation... then I noticed that a lot depends on how you precisely set up your bin boundaries when you are binning after averaging and binning the average after binning the items themselves. Another question relevant warning is that your bins (as stated) are not of equal size. The amount of the inequality depends on your bin cut points and where you do your binning. So do be sure to have a $\chi^2$ goodness-of-fit null hypothesis that matches your binning. Given that you set those proportions up to appropriately match your binning, then you will be able to make some sort of 'expected values' claim.
As a matter of pure theory, when you do binning you lose information. You are (informationally) better off to doing the averaging then the binning rather than vice versa. However, that assumes that an answer of 1 does mean something different than an answer of 2 to you. If you really think these are the same per your hypothesis then binning before averaging makes some kind of sense.
Thank you, your suggestion regarding the t-test was enlightening and appropriate.