• Quartz: #bad_data guide
    https://github.com/Quartz/bad-data-guide

    An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

    exemples :

    – Lots and lots of garbage study results make it into major publications because journalists don’t understand p-values.
    – Benford’s Law is a theory which states that small digits (1, 2, 3) appear at the beginning of numbers much more frequently than large digits (7, 8, 9). (...) Benford’s Law is an excellent first test

    checklist “de base” :

    Issues that your source should solve

    Values are missing
    Zeros replace missing values
    Data are missing you know should be there
    Rows or values are duplicated
    Spelling is inconsistent
    Name order is inconsistent
    Date formats are inconsistent
    Units are not specified
    Categories are badly chosen
    Field names are ambiguous
    Provenance is not documented
    Suspicious numbers are present
    Data are too coarse
    Totals differ from published aggregates
    Spreadsheet has 65536 rows
    Spreadsheet has dates in 1900 or 1904
    Text has been converted to numbers

    Issues that you should solve

    Text is garbled
    Data are in a PDF
    Data are too granular
    Data was entered by humans
    Aggregations were computed on missing values
    Sample is not random
    Margin-of-error is too large
    Margin-of-error is unknown
    Sample is biased
    Data has been manually edited
    Inflation skews the data
    Natural/seasonal variation skews the data
    Timeframe has been manipulated
    Frame of reference has been manipulated

    Issues a third-party expert should help you solve

    Author is untrustworthy
    Collection process is opaque
    Data asserts unrealistic precision
    There are inexplicable outliers
    An index masks underlying variation
    Results have been p-hacked
    Benford’s Law fails
    It’s too good to be true

    Issues a programmer should help you solve

    Data are aggregated to the wrong categories or geographies
    Data are in scanned documents

    #data-journalisme