Normalizing data
Data that comes from the government is often generated across multiple departments by hand. This can result in inconsistencies in what kinds of values or formats are used to describe the same meaning. Normalizing values to be consistent across a dataset is therefore a common activity.
Step 1: Find all distinct values
First, you want to start by finding all of the distinct ranges of values for the different columns in your dataset. You can accomplish this by using a database query language (such as SQL’s DISTINCT), or by simply using the ‘filter’ property on a spreadsheet program.
For example, if you have a spreadsheet with contracting data, and one column is ‘Competed?’, you would expect the values to be ‘yes’ or ‘no’. But if this spreadsheet is an amalgam of spreadsheet data from multiple users and departments, your values could vary among the following: ‘Y’, ‘YES’, ‘yes’, 1, ‘True’, ‘T’, ‘t’, ‘N’, ‘NO’, ‘no’, 0, ‘False’, ‘F’, ‘f’, etc. Limiting all of these potential values to two clear options will make it easier to analyse the data, and also easier for those who follow in your footsteps.
Step 2: Sanity Check
Especially with financial data, numbers can be formatted several different ways. For example, are your negative values represented with a ‘-‘ or placed inside ‘( )’ or possibly even highlighted in red? Not all of these values will be easily read by a computer program (especially the color), so you’ll want to pick something clear and consistent to convert all your negative values to (probably the negative sign).
Is all your numerical data measured out in ones or is abbreviated in thousands? Especially with budget data, order of magnitude errors are not uncommon when one department thinks they’re reporting in thousands or millions by default but others expand their data all the way to the ones place. Are some values in scientific notation (e.g. 10e3938)? Make sure all your values are consistent, otherwise your analysis could contain serious errors.
A column of data requiring name normalization:

