Breast Cancer

Breast cancer is the most common cancer and the second leading cause of cancer-related death among women in the U.S. Known breast cancer risk factors include age, race/ethnicity, reproductive factors, and benign breast disease. Family history of breast cancer and hereditary cancer syndromes, such as BRCA1/BRCA2 mutations, confer the strongest risk for this disease. Although there have been a number of genome-wide association studies (GWAS) to identify genetic predictors of breast cancer, most have focused on high-risk cohorts of women with a strong family history rather than population-based cohorts and few have looked at genetic predictors based upon breast cancer subtypes. For example, BRCA1 mutation carriers tend to develop estrogen receptor (ER)-negative breast cancers, whereas the majority of breast tumors in BRCA2 mutation carriers are ER-positive. The purpose of this algorithm is to identify breast cancer subtypes based upon tumor hormone receptor (HR) status. There are currently FDA-approved drugs, such as tamoxifen, which have been shown to reduce the incidence of ER-positive breast cancer by up to 50-65% among high-risk women. Identifying cohorts of women who are more likely to benefit from anti-estrogen therapy may lead to a more precision medicine approach to breast cancer prevention strategies.

Phenotype ID: 
List on the Collaboration Phenotypes List
Type of Phenotype: 
Ning Shang, George Hripcsak, Chunhua Weng, Wendy K. Chung, Katherine Crew
Contact Author: 
Date Created: 
Thursday, June 28, 2018
Owner Phenotyping Groups: 
Data Model: 

Suggested Citation

Ning Shang, George Hripcsak, Chunhua Weng, Wendy K. Chung, Katherine Crew. Breast Cancer. PheKB; 2018 Available from:



Fyi, KP’s pathology database captures both left and right breast densities. However, the BreastCancerDd6_V1_breastDensity.csv does not include a variable to  indicate the side of the breast measured -whether LEFT or RIGHT. Thoughts on how you would like us to address those cases where the densities of right and left breasts are coded differently in  our path database?



Hello Arvind, 

Thanks for pointing out this problem. We have added "location" column to reflect this information. Please check the new data dictionary file BreastCancerDd6_V1b_breastDensity.csv. Agagin, really appreciate your finding. 

Best regards,


In the BMI data dictionary (BreastCancerDd2_V1_BMI.csv) you have height listed in centimeters (cm) and weight in pounds (lbs). Did you want us to submit weight measures in kilograms (kg) or pounds (lbs)?

Regarding the hormone_receptor_status field In the demographics data dictionary (BreastCancerDd1_V1_Demo_phenotyping):

If a patient has an ER UNKNOWN and a PR NEG at age 21388 days, then subsequently has a ER NEG and a PR NEG at age 24042 days would their hormone_receptor_status be classified as NEG or UNKNOWN?

I am seeing values 0-6 which doesn't match with your DD. Is it possible some of these might be from a different scale than the one you are using? These are sometimes identified by Roman Numerals, in case that is meaningful to you. 

THanks you for finding this out. Can you provide textual description of 0-6? Then I will compare both scales and figure out how to combine this information in the data dictioanry. 

"BI-RADS 0-6 categories" is for reporting of abnormal results on the mammogram, not describing breast density. Here our dictionary is to query breast density description. 

Hi, 2 questions: 

  1)  I see in tables 1-5 that there are OMOP concept IDs, do you have OMOP queries you can share that pull these data? 

  2)  for family history of breast cancer, do you just need yes/no any family history of breast cancer, or were you limiting it to 1st degree relatives &/or using NLP to pull this data from notes or just pulling from tumor registry data & other sources?