Main Point: Choices made in data cleaning have ethical implications.
Contents: instructor guide, student handout, suggested assessment questions
Summary: The goal of this lab is to show that data cleaning, while necessary, can be conducted with differing degrees of ethicality. The process of correcting data (changing location, imputing a new value, removing typos etc.), removing incomplete records, or leaving questionable data as is can affect the results of the analysis. Consequently, methods must be scrutinized. Students will evaluate the ethical ramifications of data cleaning choices on a fictional data set description.
Ethics background required: It is helpful if students have learned about the virtue ethics and utilitarian frameworks for making ethical decisions, though not absolutely required. These frameworks are covered in the first year curriculum.
Subject matter referred to in this lab: data cleaning, simple statistics
Placement in overall ethics curriculum:
Academic year: Years two through four – a class in data management or data analysis
Recommended previous lab: First year ethics curriculum (specifically virtue ethics and utilitarian lab)
Recommended follow-up lab: No direct follow-up lab
- In class: 20 minutes
- Out of class: none
Introduce the idea of and need for data cleaning if it has not already been covered in class
Practice recognizing ethical issues in data cleaning
Assess different methods of data cleaning for ethicality
Practice articulating and supporting the degree to which an approach to cleaning data is ethical in a particular situation
Ethical issues to be considered: Transparency, Diversity, Dignity, Privacy, Data Integrity, Reliability
Read through the entire lab. Print a student handout for each student. The handout includes a dataset description, questions for the students to consider, and definitions of potential ethical issues.
Professor introduces the concept of data cleaning
Professor reminds students of potential ethical issues to be considered along with introducing a description of the dataset to be worked on.
Students work individually and then in groups to decide if each of several suggestions for cleaning the described dataset would result in ethical problems
Depending on your class, you may wish to read/summarise this introductory information.
Bad/incorrect data leads to bad/incorrect decisions. Data can be “bad” (also called dirty) for a variety of reasons. Here are few:
- Information can be missing from a record because someone forgot to record information, or did not want to make an entry, or because the information isn’t available.
- The person entering the data could have introduced a typo.
- A tool used for measuring data may not have been calibrated correctly.
- The data may have been entered with non-uniform units.
One way to detect dirty data is to look for outliers – values that stick out because they do not fit the pattern of the rest of the data. (But note that not all outliers are incorrect and not all incorrect values will be outliers.)
GPAs of 5 (on a 4-point scale), human ages of 150 years, or other values that are beyond the feasible bounds for a measurement are clear indicators of a problem. Other outliers may be large (or small) but not impossible. And some incorrect values may not “stand out in the crowd” at all.
Suppose student GPA is recorded as 5.0 This would be unusual (impossible, really) at an institution where students are graded on a 4 point scale. But what should the data analyst working on checking the validity of the data do with such a value?
(Van den Broeck et al. 2005) defines data cleaning as
[the] process of detecting, diagnosing, and editing faulty data.
It is often said that data analysts spend 80% of their time cleaning and 20% analyzing (Browne-Anderson 2018). If the time is not spent to clean the data properly, the results could be incorrect, and the process fraught with ethical issues.
In our GPA scenario, the first step of cleaning would be to check to see if the university posting the GPA did, in fact, adhere to a 4 point scale. If so, the data point is erroneous and the issue needs to be addressed. But how?
The first course of action will typically be to see if the value can be reliably corrected. Is there another source for the missing or incorrect values? Might the error have been introduced somewhere in our data pipeline? Is there an earlier version of the data where the value appears to be correct?
The importance of fixing a missing/incorrect data problem depends both on the data and the purpose to which they are put. If the dataset was used to get the average GPA for a certain major, for example, and there are many students with that major in the data set, and only very few with missing/incorrect GPAs, then the estimate would change very little, even if we had the correct data, so we might choose to compute our estimate using only those records with complete data.
It is tempting to replace an obviously incorrect value with some estimate for that value – perhaps the mean or median computed from (a subset) of the data. But this is frought with potential problems – we don’t really know that the missing values should be “average.” A better way to do this is called imputation and (a) uses other values in the data to inform the imputed value and (b) takes into account the added uncertainty of this imputation when drawing conclusions. This latter step is important – we cannot treat unknown/estimated/imputed values the same way we treat known values. Imputation methods are typically not introduced in introductory statistics courses, and you may need to consult someone with more expertise to do this well. This may require the researcher to exhibit a certain amount of humility.
Dealing with missing data is especially challenging if the individuals with missing data are systematically different in some way from those with complete data.
Whatever approach is taken, it is important that a clear record is kept of the decisions made, in such a way that the entire analysis (including the data cleaning) could be reproduced by another person. Furthermore, these decisions and their potential impact on the conclusions drawn need to be clearly communicated to those relying on the analysis. Repeating the analysis usign a different set of decisions is a form of sensitivity analysis. If the results depend greatly on the data cleaning and analysis decisions, they it is crucial to know how and why those decisions were made. If several competing analyses yeild very similar results, then the stakes are lower.
Pass out the student handout that describes a dataset, the type of analysis for which it will be used, and some potential data cleaning transformations. The definitions of various ethical issues is also included.
Have the students consider the questions by themselves first and then with the class or in small groups.
Regarding confendiality and personally identifiable data: In a large university with many students of all types in each program, it may not be possible to identify individuals with these data. However, for a small university, a small program, or for students with unusual data values, it may be much easier. If the major was computer science, for example, and the gender was female, the student could be easily recognized. Similarly, if there were few veterans at the school, or few non-traditional (older) students, they could be easily recognized.
Since this survey also asks about things like mental illness and the possibility of pregnancy, dignity could be an ethical issue at stake here.
The data must be kept secure and only those with legitimate purposes should have access. Any summaries or analyses of the data should be sure to protect the individuals involved. Additionally, researchers may be in a position to advocate on behalf of marginalized groups based on data like these.
This could be a place to start a conversation about things like IRB approval, informed consent, etc.
Regarding missingness for age
Fill with the mean of the student population
Bad option as this is skewed data, so the mean age is likely an over-estimate for most students.
Fill with the median of the student population
Only slightly better. A median is often a better summary of a skewed distribution, but that doesn’t make the median a good estimate for an individual with missing data. Furthermore, treating “filled in” data as if it were correct can lead to incorrect conclusions.
Note: this is somewhat like some imputation methods, but those methods also take into account the greater uncertainty associated with imputed values.
Remove records where age is missing and make a note of this in the data cleaning documentation
If your sample size is sufficiently large and the amount of missing data small, you might go with this option.
But if individuals for whom age is missing are systematically different from those with age recorded (perhaps older students are reluctant to reveal their age), then our analysis won’t accurately reflect this difference (since we aren’t using any data for one of the groups). This could create bias and be a diversity issue.
Regarding missingness for parental college experience:
More likely to happen for parents with no college experience or students who are adopted. Removing these records could affect diversity.
Regarding missingness for physical disability:
Some female students might not want to indicate that they are pregnant or suffer from mental illness. Again, a diversity issue.
Regarding reliability and hospitality:
If there is no standardized way to fix errors in the study, it could not be replicated with other data (reliability). Even if the data cleaner documented every change, this would be hard to apply to a new case and would make a very cumbersome document for someone trying to find errors after cleaning or to replicate the study (hospitality).
Regarding the GPA outlier, this is simply a data integrity issue. If a rough semester caused a student to leave, the GPA could indeed have been 1.35. You cannot honestly make another assumption.
Ask the class to reflect on this additional example based on what they learned from the exercise:
A survey-taker has indicated a major that does not exist at the university. What possible choices are there to handle this, and what ethical issues would have to be considered with the choice?
Ask students to share anything that surprised them from this exercise, or changed how they might make a data cleaning decision.
Possible ideas for professor if students are struggling with reflection
If the analyst replaced the major with one that he/she thought was closest (perhaps replacing exercise science (doesn’t exist) with kinesiology (does exist)) but doesn’t record this change, this could be a violation of transparency or reliability. While this approach is reasonable, the cleaner should record the cleaning method so the results could be reproduced. It might also be wise to produce the final analysis both with the change made, and with the offending record removed and see if the results differed significantly.
Ask students to define data cleaning.
Ask students to provide an example of a data cleaning need along with an ethical and unethical way to address it. Ask them to provide the ethical issue raised by the poor choice.
Applying some the ethics frameworks learned previously to data cleaning:
Refer back to the original GPA example in the introduction (an outlier of 5 for the GPA) and have the students suggest 3 ways to deal with the outlier (assuming it really is an outlier). Have the students evaluate the cleaning solutions using both the utilitarian approach and the virtue ethics approach. Recall that in order to evaluate using the utilitarian approach they would ask questions like the following:
Who will be affected by this decision?
Who benefits from this decision?
Who will be harmed by this decision?
Do the benefits outweigh the harm?
When using the virtue ethics approach students should ask questions like:
- Does this solution violate my moral values (virtues) or those of the society to which I belong? (students should recall that virtues include such principles as adaptability, compassion, dignity, excellence, honesty, inclusivity, justice, patience, transparency)
Browne-Anderson, Hugo. 2018. “What Data Scientists Really Do, According to 35 Data Scientists.” Harvard Business Review. https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists.
Van den Broeck, Jan, Solveig Argeseanu Cunningham, Roger Eeckels, and Kobus Herbst. 2005. “Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities.” PLoS Med 2 (10): e267. https://doi.org/10.1371/journal.pmed.0020267.