A new posting on the Library Impact Data Project blog discusses an interesting dilemma that is all to common when analyzing raw data. How detailed can and should you get when dissecting that data? In this case, the problem is regarding user groups and the potential for revealing identities. The author is concerned that presenting data on all possible user groups might inadvertently reveal individuals' identities due to the very small numbers in certain groups. In this day and age of IRB's and academic privacy concerns, that would be "a BIG data protection no-no." Then, the issue becomes how to aggregate the groups in a logical and useful manner. Simply combining two groups (in the author's example, say, Black and Chinese ethnicities) would not necessarily be either logical or useful. Indeed, the author argues that there should be some commonality among the groups combined. In the public health setting where I came from, the ethnic groups were based largely outcomes - the groups that had the worse outcomes would be compared with the groups that had the best outcomes. Other issues to consider, which the author notes later in the posting, is the impact of the results - what could be done to change the outcomes? While librarians cannot change or effect the change of a person's ethnicity, they could direct programs to these groups that work more effectively. Qualitative research could be conducted to determine the reasons for differences in the measured factors that led to differences in outcomes.
In this case, the factors were number of E-Z Proxy logins and the number of downloads, and the outcomes were graduation levels (based on grades). Not sure if they looked, but I could not find any publication that examined these factors such as race and home country against the outcomes. This would be important to consider for aggregation in addition to similarities of groups. In this case, the White students used the electronic resources the least of all ethnic groups. If this group of students had greater rates of higher graduation levels, then perhaps ethnicity is a confounder, changing the relationship between e-resource usage and grades. Those from other countries in the European Union were using the e-resources the most - did they have a harder time using the resources? Was there a language barrier? Cultural differences in the organization of the libraries' resources?
The author does a good job of describing the issues associated with analyzing data - it's not simply comparing the averages of two groups. You need to consider aspects of the data set (number in each group, number in the entire set, the distributions of the values, the kind of measurement used), the relationship of the variables and groups (are the groups independent, are the variables distinct, are there similarities of the groups), as well as the context (the environment, the language, the culture, the population, etc.). In other words, you need to know your data well.