Dataset Nutrition Label

The
Dataset Nutrition Label
aims to create a standard for interrogating datasets for measures that will ultimately drive the creation of better, more inclusive machine learning models. Our current prototype includes several ‘modules’ across a variety of qualitative and quantitative data that we believe is useful for exploring several aspects in datasets before the development of models.



We developed this Label on ProPublica’s Dollars for Docs (2013-2015) dataset, which details payments made from pharmaceutical companies to doctors. You can navigate through the modules using the links on the left.

To learn more, please visit our website, read our paper abstract, or email us at nutrition@media.mit.edu.
Dataset Facts
ProPublica’s Dollars for Docs Data
Metadata
Filename
201612v1-docdollars-product_payments
Format
csv
Domain
healthcare
Keywords
Physicians, drugs, medicine, pharmaceutical, transactions
Type
tabular
Rows
500
Columns
18
Missing
5.2%
License
cc
Released
JAN 2017
Range
From
AUG 2013
To
DEC 2015
Description
This is the data used in ProPublica’s Dollars for Docs news application. It is primarily based on CMS’s Open Payments data, but we have added a few features. ProPublica has standardized drug, device and manufacturer names, and made a flattened table (product_payments) that allows for easier aggregating payments associated with each drug/device. In [1], one payment record can be attributed to up to five different drugs or medical devices. This table flattens the payments out so that each drug/device related to each payment gets its own line.
Provenance
Source
Name
U.S. Centers for Medicare & Medicaid Services
Variables
Id
A unique ID number for this payment & product combination. This is assigned by ProPublica for internal use
Applicable_manufacturer_or_applicable_gpo_making_payment_id
ID of the applicable manufacturer or submitting applicable GPO making the payment or other transfer of value
Date_of_payment
If a singular payment, then this is the actual date the payment was issued; if a series of payments or an aggregated set of payments, this is the date of the first payment to the covered recipient in this program year
General_transaction_id
System-assigned identifier to the general transaction at the time of submission
Program_year
The calendar year for which the payment is reported in Open Payments
Product_name
Derived from the 'name_of_associated_covered_drug_or_biologicalX' field (for drugs) or 'name_of_associated_covered_device_or_medical_supplyX' field (for medical devices). Where possible,multiple versions of the same product are converted to the same product_name (i.e. records for 'Zorvolex 65mg' and 'Zorvolex 35mg' will be converted to 'Zorvolex'). The original value is contained in original_product_name
Original_product_name
A copy of the original name_of_associated_covered_drug_or_biologicalX' field (for drugs) or 'name_of_associated_covered_device_or_medical_supplyX' field (for medical devices)
Product_ndc
If the product is a drug, this a copy of the original 'ndc_of_associated_covered_drug_or_biologicalX' field
Product_is_drug
't' if the product is a drug (contained in a 'name_of_associated_covered_drug_or_biologicalX' field). 'f' if the product is a medical device (contained in a 'name_of_associated_covered_device_or_medical_supplyX' field)
Payment_has_many
't' if the original payment record included data on more than one drug or device, i.e. 'name_of_associated_covered_drug_or_biological1' and 'name_of_associated_covered_drug_or_biological2', 'name_of_associated_covered_device_or_medical_supply1' and 'name_of_associated_covered_device_or_medical_supply2', etc.
Teaching_hospital_id
Open Payments system-generated unique identifier of the teaching hospital receiving the payment or other transfer of value
Physician_profile_id
ID of the physician receiving the payment or other transfer of value
Recipient_state
The state or territory abbreviation of the primary business address of the physician or teaching hospital or non-covered recipient entity receiving the payment or other transfer of value if the primary business address is in the United States
Applicable_manufacturer_or_applicable_gpo_making_payment_name
Textual proper name of the applicable manufacturer or applicable GPO making the payment or other transfer of value. This field has been standardized to eliminate different names attributable solely to punctuation
Teaching_hospital_ccn
A unique identifying number (CMS Certification Number) of the Teaching Hospital receiving the payment or other transfer of value
Product_slug
Used internally at ProPublica for web display on the Dollars for Docs app. You can pull up the corresponding Dollars for Docs page for a product by appending product_slug to https://projects.propublica.org/docdollars/products/, i.e. https://projects.propublica.org/docdollars/products/device-dental-cabinetry
Total_amount_of_payment_usdollars
U.S. dollar amount of payment or other transfer of value to recipient (manufacturer must convert to dollar currency if necessary)
Number_of_payments_included_in_total_amount
The number of discrete payments being reported in the 'Total Amount of Payment' data element
Statistics
Ordinal
name
type
count
uniqueEntries
mostFrequent
leastFrequent
missing
id
number
500
488 including missing
missing value (13)
multiple detected
2.60%
applicable_manufacturer_or_applicable_gpo_making_payment_id
number
500
4
100000000232 (417)
multiple detected
0%
date_of_payment
date
500
213 including missing
missing value (27)
multiple detected
5.40%
general_transaction_id
number
500
467 including missing
missing value (34)
multiple detected
6.80%
program_year
number
500
2 including missing
2014 (495)
missing value (5)
1.00%
Loading...
Nominal
name
type
count
uniqueEntries
mostFrequent
leastFrequent
missing
product_name
string
500
16 including missing
Xarelto (200)
Aciphex (1)
3.20%
original_product_name
string
500
15
Xarelto (212)
Aciphex (1)
0%
product_ndc
number
500
21 including missing
5045857810 (201)
multiple detected
5.00%
product_is_drug
boolean
500
2 including missing
t (492)
missing value (8)
1.60%
payment_has_many
boolean
500
3 including missing
f (267)
missing value (29)
5.80%
teaching_hospital_id
number
500
2 including missing
0 (464)
missing value (36)
7.20%
physician_profile_id
number
500
230 including missing
missing value (32)
multiple detected
6.40%
recipient_state
string
500
40
CA (56)
multiple detected
0%
applicable_manufacturer_or_applicable_gpo_making_payment_name
string
500
5 including missing
Janssen Pharmaceuticals, Inc (386)
multiple detected
7.00%
teaching_hospital_ccn
number
500
2 including missing
0 (481)
missing value (19)
3.80%
product_slug
string
500
15 including missing
drug-xarelto (196)
drug-aciphex (1)
8.20%
Loading...
Continuous
name
type
count
min
median
max
mean
standardDeviation
missing
zeros
total_amount_of_payment_usdollars
number
500
0.14
14.00
5000
134.21
501.99
9.40%
0%
Loading...
Discrete
name
type
count
min
median
max
mean
standardDeviation
missing
zeros
number_of_payments_included_in_total_amount
number
500
1
1.00
1
1.00
0.00
4.80%
0%
Loading...
Pair Plot
0246−1012340246−1012340246−1012340246−101234
Probabilistic Model
eliquis
CATXFLNCMIGAAZMDOKKYMOARLAWAKSHINMPRAKWIMEMNNESD00.050.1
Ground Truth Correlations
total_amount_of_payment_usdollars
total_populationWhite_aloneBlack_or_African_American_aloneAmerican_Indian_and_Alaska_Native_aloneAsian_aloneNative_Hawaiian_and_Other_Pacific_Islander_aloneSome_Other_Race_aloneTwo_or_More_RacesHispanic_or_LatinoUrbanRuralWhite_alone_fractionBlack_or_African_American_alone_fractionAmerican_Indian_and_Alaska_Native_alone_fractionAsian_alone_fractionNative_Hawaiian_and_Other_Pacific_Islander_alone_fractionSome_Other_Race_alone_fractionTwo_or_More_Races_fractionHispanic_or_Latino_fractionUrban_fractionRural_fractiontotal_amount_of_payment_usdollars
00.51