Perhaps the most critical piece comes next: validation and cleaning. This is where domain expertise is even more critical to validate whether the data at hand can be transformed into something that will actually support the desired analyses. For example, I was once asked to help oversee a project compiling unemployment data by country going back several hundred years. The problem is that every country defines the concept of “unemployed” differently. Some lump all unemployed persons together, while others separate those looking versus not looking for work, or exclude or include disabled, work-at-home, social welfare receipts, college students and so on. These definitions often change over time, meaning that for one year of data, “unemployed” might refer only to unemployed bricklayers in one country, might exclude state-supported welfare recipients in another and might include all individuals, including full-time college students in another and then change the following year in some countries but not others. This resulted in very strange seesawing and stair stepping effects in the data when comparing countries over time that required extensive research and patching of the data to repair...