Quartz: #bad_data guide
►https://github.com/Quartz/bad-data-guide
An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.
exemples :
– Lots and lots of garbage study results make it into major publications because journalists don’t understand p-values.
– Benford’s Law is a theory which states that small digits (1, 2, 3) appear at the beginning of numbers much more frequently than large digits (7, 8, 9). (...) Benford’s Law is an excellent first test
checklist “de base” :
Issues that your source should solve
Values are missing
Zeros replace missing values
Data are missing you know should be there
Rows or values are duplicated
Spelling is inconsistent
Name order is inconsistent
Date formats are inconsistent
Units are not specified
Categories are badly chosen
Field names are ambiguous
Provenance is not documented
Suspicious numbers are present
Data are too coarse
Totals differ from published aggregates
Spreadsheet has 65536 rows
Spreadsheet has dates in 1900 or 1904
Text has been converted to numbers
Issues that you should solve
Text is garbled
Data are in a PDF
Data are too granular
Data was entered by humans
Aggregations were computed on missing values
Sample is not random
Margin-of-error is too large
Margin-of-error is unknown
Sample is biased
Data has been manually edited
Inflation skews the data
Natural/seasonal variation skews the data
Timeframe has been manipulated
Frame of reference has been manipulated
Issues a third-party expert should help you solve
Author is untrustworthy
Collection process is opaque
Data asserts unrealistic precision
There are inexplicable outliers
An index masks underlying variation
Results have been p-hacked
Benford’s Law fails
It’s too good to be true
Issues a programmer should help you solve
Data are aggregated to the wrong categories or geographies
Data are in scanned documents