Predictive Model for Breast Cancer Incidence

Bharat Solanky and Aadya Chawla

Link to GitHub page

Project Background:

More than 200,000 women are diagnosed with breast cancer every year. It is the most common form of cancer for women and is responsible for the second highest number of deaths for females in the United States. We want to know how can we better understand the properties of tumor cells and how age and tumor size affects the likelihood of being diagnosed with breast cancer. This is important because a study by Orlando Health showed that 22% of women age 35 to 44 have never had a mammogram and “have no plans to get one”. Our work can hopefully give us a better understanding of specifically how catastrophic this issue is and potentially provide a reason to spread more awareness on the subject.

Project Goals:

The goals of this project were to produce a predictive model to determine likelihood of breast cancer incidence in female patients using physical metrics, such as concavity, radius, and texture of perceived benign tumors. Moreover, we then wanted to understand how certain hormonal protein levels can further be used to give an indication on what type of surgery is required for women who have malignant tumors.

Project DataSet and Plan:

The datasets we worked with are from Kaggle.com 1 2 3 . As a background, the medical journals report that the accuracy of visually diagnosed breast FNA is about 94.3% with mean sensitivity of 91 percent and specificity of 87%. The dataset will be randomly divided into two disjoint subgroups, first to train the prediction model and other for testing the accuracy of the developed models. Based on preliminary analysis, the three features which are highly associated with the diagnosis of breast cancer are the following (the description of variables is below):

  1. Concave_points_worst: The average value for cancer is 0.18, whereas it is 0.07 for benign;
  2. Radius_worst: The average value for cancer is 21.13, whereas it is 13.38 for benign; and
  3. Texture_worst: The average value for cancer is 29.32 whereas it is 23.52 for benign;

Our goal was to also provide guidelines for medical professionals to assist them with the process of visual diagnosis (prediction) by identifying potentially extreme values above/below which the likelihood of breast cancer changes significantly. For example, for what value of Radius_worst does the possibility of cancer have a likelihood of 95% or 100%.

Collaboration :

We created a google colab drive to share files and so that we could easily edit and write code live. We met weekly leading up to the final deadline to divide the work and enhance our project objectives. Throughout the semester, we used text messages and met in person to communicate with each other.

ETL (Extraction, Transform, and Load):

We loaded three datasets which are all .csv files available on the Kaggle.com website. The first dataset has information of 32 features described above for 569 patients. The second dataset is for confirmed malignant tumor patients, both currently alive and dead, and has data for 4,534 patients, including tumor stage, tumor size, and patient hormone levels. The third dataset is for confirmed malignant tumor patients, both currently alive and dead, and has data for 317 patients, including tumor stage and type of surgery required

The first download does not have any missing data for any of the features. It does however have "unnamed data", which was promptly removed after loading it below.

Filtering, Sorting, and Plotting Data Points

Below we have filtered data by malignant patient type, sorted to only show concavity points of their cells, and plotted it as a histogram.

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
#mounting google collab to google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
! pwd

#directory contains csv file
%cd /content/drive/My Drive/DataScienceProject
!git pull


import pandas as pd
from matplotlib import pyplot as plt
from itertools import cycle, islice
pd.options.display.max_rows = 8

df = pd.read_csv("data.csv") 
#This reads the data file which is named data.csv

#Deletes unused column
del df['Unnamed: 32']

#Displays head for first dataset
df.head()
/content/drive/My Drive/DataScienceProject
/content/drive/My Drive/DataScienceProject
fatal: not a git repository (or any parent up to mount point /content)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Out[ ]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 32 columns

Below are the dtypes for certain variables within the dataset. Later on, the diagnosis is modified to become a float64 from an object data type to make it a numerical value which will help with K-neighbor regression calculation.

In [ ]:
df.dtypes
Out[ ]:
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
                            ...   
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
Length: 32, dtype: object

Below is shown the columns and head rows for the second data set. What was formerly "anaplastic; Grade IV" was cleaned up to match other grades in simply being a number (i.e. 4).

In [ ]:
df2 = pd.read_csv("Breast_Cancer.csv") 
#This reads the second data file which is named Breast_Cancer.csv

def transform(x):
    x = x.replace(" anaplastic; Grade IV", "4")
    return str(x)
df2['Grade'] = df2['Grade'].apply(transform)
df2['Grade'].value_counts()
df2.head()
Out[ ]:
Age Race Marital Status T Stage N Stage 6th Stage differentiate Grade A Stage Tumor Size Estrogen Status Progesterone Status Regional Node Examined Reginol Node Positive Survival Months Status
0 68 White Married T1 N1 IIA Poorly differentiated 3 Regional 4 Positive Positive 24 1 60 Alive
1 50 White Married T2 N2 IIIA Moderately differentiated 2 Regional 35 Positive Positive 14 5 62 Alive
2 58 White Divorced T3 N3 IIIC Moderately differentiated 2 Regional 63 Positive Positive 14 7 75 Alive
3 58 White Married T1 N1 IIA Poorly differentiated 3 Regional 18 Positive Positive 2 1 84 Alive
4 47 White Married T2 N1 IIB Poorly differentiated 3 Regional 41 Positive Positive 3 1 50 Alive

Exploratory Data Analysis (EDA)

Below is a bar chart showing the 4 stages of breast cancer with outcomes of patients plotted in both green and red. Green describes a patient being alive when this data set was taken and red represents a patient passing away during this time. As can be seen, if a patient's stage of breast cancer was higher prior to surgery, their likelihood of passing was also greater.

In [ ]:
my_colors = list(islice(cycle(['g', 'r']), None, len(df)))
df2.groupby('Grade')["Status"].value_counts(normalize=True)
df2.groupby('Grade')["Status"].value_counts(normalize=True).plot(kind='bar', legend = True, color=my_colors)
plt.legend(my_colors, loc='upper right', title='Status')
chart1 = df2.groupby('Grade')["Status"].value_counts(normalize=True).plot(kind='bar', legend = True, color=my_colors, title = 'Patient Outcomes by Cancer Grade')
plt.legend(['Alive', 'Dead'],
            loc='upper right', title='Status')


chart1.set_ylabel("Total")
chart1.set_xlabel("Cancer Grade & Status")
Out[ ]:
Text(0.5, 0, 'Cancer Grade & Status')

The histogram below shows how the number of tumors increase with age. The number of tumors spiked at age 45 in this graph. This may be seen as "general knowledge", but it affirms the statement that tumors naturally increase with age.

In [ ]:
chart2 = df2.hist(column='Age', legend = True)
plt.title('Frequency of tumors by age')
plt.xlabel('Age')
plt.ylabel('Frequency')
Out[ ]:
Text(0, 0.5, 'Frequency')

In the data frame displayed below, we can see a clear association between Tumor Size and grade of cancer...As the grade or "stage" of cancer increases, tumor size seems to match in also increasing. The chart doesn't clearly show this, as there are very few data points in the 4 category to make it visible, however the averages in the dataframe just above it make it clear.

In [ ]:
#Copying Tumor Size to another column with shorter name for ease
df2["Size"] = df2["Tumor Size"]

chart3 = df2.groupby("Grade").Size.plot.hist(alpha=.5, density=False, legend=True)
plt.title('Tumor Size by Cancer Grade')
plt.xlabel('Tumor Size')
print(df2.groupby("Grade").mean())
             Age  Tumor Size  Regional Node Examined  Reginol Node Positive  \
Grade                                                                         
1      55.289134   26.364641               12.675875               3.068140   
2      54.322416   29.729051               14.387920               3.922586   
3      52.615662   33.823582               15.111611               5.154815   
4      52.315789   44.157895               14.473684               6.157895   

       Survival Months       Size  
Grade                              
1            72.937385  26.364641  
2            72.179073  29.729051  
3            68.749775  33.823582  
4            64.421053  44.157895  
In [ ]:
df3 = pd.read_csv("BRCA.csv") 
#This reads the data file which is named BRCA.csv

df3.head()
#Third dataset --> contains data on surgery type and protein concentration, both of which are new data columns
Out[ ]:
Patient_ID Age Gender Protein1 Protein2 Protein3 Protein4 Tumour_Stage Histology ER status PR status HER2 status Surgery_type Date_of_Surgery Date_of_Last_Visit Patient_Status
0 TCGA-D8-A1XD 36.0 FEMALE 0.080353 0.42638 0.54715 0.273680 III Infiltrating Ductal Carcinoma Positive Positive Negative Modified Radical Mastectomy 15-Jan-17 19-Jun-17 Alive
1 TCGA-EW-A1OX 43.0 FEMALE -0.420320 0.57807 0.61447 -0.031505 II Mucinous Carcinoma Positive Positive Negative Lumpectomy 26-Apr-17 09-Nov-18 Dead
2 TCGA-A8-A079 69.0 FEMALE 0.213980 1.31140 -0.32747 -0.234260 III Infiltrating Ductal Carcinoma Positive Positive Negative Other 08-Sep-17 09-Jun-18 Alive
3 TCGA-D8-A1XR 56.0 FEMALE 0.345090 -0.21147 -0.19304 0.124270 II Infiltrating Ductal Carcinoma Positive Positive Negative Modified Radical Mastectomy 25-Jan-17 12-Jul-17 Alive
4 TCGA-BH-A0BF 56.0 FEMALE 0.221550 1.90680 0.52045 -0.311990 II Infiltrating Ductal Carcinoma Positive Positive Negative Other 06-May-17 27-Jun-19 Dead
In [ ]:
df3["Protein2"].value_counts()
#Drops null values
df3.dropna()
Out[ ]:
Patient_ID Age Gender Protein1 Protein2 Protein3 Protein4 Tumour_Stage Histology ER status PR status HER2 status Surgery_type Date_of_Surgery Date_of_Last_Visit Patient_Status
0 TCGA-D8-A1XD 36.0 FEMALE 0.080353 0.42638 0.54715 0.273680 III Infiltrating Ductal Carcinoma Positive Positive Negative Modified Radical Mastectomy 15-Jan-17 19-Jun-17 Alive
1 TCGA-EW-A1OX 43.0 FEMALE -0.420320 0.57807 0.61447 -0.031505 II Mucinous Carcinoma Positive Positive Negative Lumpectomy 26-Apr-17 09-Nov-18 Dead
2 TCGA-A8-A079 69.0 FEMALE 0.213980 1.31140 -0.32747 -0.234260 III Infiltrating Ductal Carcinoma Positive Positive Negative Other 08-Sep-17 09-Jun-18 Alive
3 TCGA-D8-A1XR 56.0 FEMALE 0.345090 -0.21147 -0.19304 0.124270 II Infiltrating Ductal Carcinoma Positive Positive Negative Modified Radical Mastectomy 25-Jan-17 12-Jul-17 Alive
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
330 TCGA-A8-A085 44.0 MALE 0.732720 1.11170 -0.26952 -0.354920 II Infiltrating Lobular Carcinoma Positive Positive Negative Other 01-Nov-19 04-Mar-20 Dead
331 TCGA-A1-A0SG 61.0 FEMALE -0.719470 2.54850 -0.15024 0.339680 II Infiltrating Ductal Carcinoma Positive Positive Negative Lumpectomy 11-Nov-19 18-Jan-21 Dead
332 TCGA-A2-A0EU 79.0 FEMALE 0.479400 2.05590 -0.53136 -0.188480 I Infiltrating Ductal Carcinoma Positive Positive Positive Lumpectomy 21-Nov-19 19-Feb-21 Dead
333 TCGA-B6-A40B 76.0 FEMALE -0.244270 0.92556 -0.41823 -0.067848 I Infiltrating Ductal Carcinoma Positive Positive Negative Lumpectomy 11-Nov-19 05-Jan-21 Dead

317 rows × 16 columns

Below are two grouped dataframes which are filtered to displayed the status (i.e. dead or alive) of patients at the time of taking the data based on their tumor stage during surgery. Clearly, as tumor stage increased, their chance of survival post surgery decreased.

In [ ]:
df3.groupby('Tumour_Stage')["Patient_Status"].value_counts()
Out[ ]:
Tumour_Stage  Patient_Status
I             Alive              51
              Dead               10
II            Alive             144
              Dead               38
III           Alive              60
              Dead               18
Name: Patient_Status, dtype: int64
In [ ]:
df3.groupby('Tumour_Stage')["Patient_Status"].value_counts(normalize=True)
Out[ ]:
Tumour_Stage  Patient_Status
I             Alive             0.836066
              Dead              0.163934
II            Alive             0.791209
              Dead              0.208791
III           Alive             0.769231
              Dead              0.230769
Name: Patient_Status, dtype: float64

This plotted bar chart further shows that as the stage of cancer got worse, a more invasive, more costly surgery was required. For context, a simple masectomy (plotted in yellow) requires removal of significant tissue from a patient's breast and modified radical mastecomy (plotted in red) requires complete removal of breast and skin that surrounds it.

In [ ]:
tumor_c = list(islice(cycle(['b', 'y', 'r', 'g']), None, len(df)))
chart4 = df3.groupby('Tumour_Stage')["Surgery_type"].value_counts(normalize=True).plot(kind='bar', legend=True, 
                                                                                       color = ['black', 'black', 'yellow', 'red', 
                                                                                                'black', 'red', 'yellow','black',
                                                                                                'red','black','yellow','black'],
                                                                                       title = 'Surgery Type According to Tumor Stage')
chart4.set_xlabel("Tumor Stage and Surgery Type")
chart4.set_ylabel("Number of surgeries")
Out[ ]:
Text(0, 0.5, 'Number of surgeries')

Below, it was attempted to see if the means of protein concentration values of patients had any relation to tumor stage. Although protein 2 and protein 4 concentrations both show consistently declining values with increasing tumor stage, it is unclear if this association has any indication of tumor stage prediction.

In [ ]:
df3.mean()
df3.groupby(by=["Tumour_Stage"]).mean()
<ipython-input-87-2145c3bb11f0>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df3.mean()
Out[ ]:
Age Protein1 Protein2 Protein3 Protein4
Tumour_Stage
I 62.359375 -0.014430 1.001318 -0.165147 0.037828
II 59.052910 -0.007734 0.964763 -0.065409 0.018023
III 55.753086 -0.094220 0.862207 -0.088845 -0.031453

From the first dataframe df (again shown below), we established a relationship between "radius_worst" (a variable describing the radius of the largest tumor present in a breast cancer patient) and "radius_mean" (mean radius of that patient) and its "radius_se" (standard error of the sizes of the radius from a patient) to understand how many standard errors away from the mean that nearly all of the diagnoses are malignant. This is a step we can use to help guide our question searching process. Below, it can be seen that as the multiplier value (5 to 7 to 9 to 11) increases, the percentage of diagnoses being malignant increases. This may show that within a patient's tumors, if one is significantly larger than the mean, then it is likely that that specific tumor is malignant.

In [ ]:
df.head()
Out[ ]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 32 columns

In [ ]:
my_colors = list(islice(cycle(['y', 'g']), None, len(df)))


rd = df[df["radius_worst"] > (df["radius_mean"] + 11 * df["radius_se"])]
rd["diagnosis"].value_counts()
rd["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ad36e9610>
In [ ]:
rd2 = df[df["radius_worst"] > (df["radius_mean"] + 9 * df["radius_se"])]
rd2["diagnosis"].value_counts()
rd2["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ad36b1fd0>
In [ ]:
rd3 = df[df["radius_worst"] > (df["radius_mean"] + 7 * df["radius_se"])]
rd3["diagnosis"].value_counts()
rd3["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ad36efdf0>
In [ ]:
rd4 = df[df["radius_worst"] > (df["radius_mean"] + 5 * df["radius_se"])]
rd4["diagnosis"].value_counts()
rd4["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4ad3606880>

K-Nearest Neighbors Method

The code below uses the K-Nearest Neighbors(knn) Classifier method to predict whether the diagnosis of the tumor is malignant or benign. The ending result is the root mean squared error in cross valuation of the prediction. We would like to perform more analysis to see how we can use k neighors to fully understand the relationship between the training variables, including mean radius, texture, perimeter, smoothness, etc., and the test value or diagnosis.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt

train = df.sample(frac=.5)
val = df.drop(train.index)

# Features used for prediction
features = ["radius_mean",	"texture_mean",	"perimeter_mean",	"area_mean",	"smoothness_mean",	"compactness_mean",	"concavity_mean"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["diagnosis"]
y_val = val["diagnosis"]

from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

def get_val_error(X_train_dict, y_train, X_val_dict, y_val):
    
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsClassifier(n_neighbors=7)
    model.fit(X_train_sc, y_train)

    # Make predictions on the validation set.
    y_pred = model.predict(X_val_sc)

    return model.score(X_val_sc,y_val)
In [ ]:
val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val)
val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train)

print("Test Score:",val1)
print("Test Score:",val2)

(val1+val2)/2
Test Score: 0.9157894736842105
Test Score: 0.9295774647887324
Out[ ]:
0.9226834692364714

As can be seen above, our knn classifier appears to be an excellent predictor. It accurately predicts 93% of diagnoses based off the 7 independent variables, those being radius, texture, perimeter, area, smoothness, compactness, and concavity. With such a high value, it's clear that breast cancer detection is possible with great accuracy. Moreover, with the Canadian Medical Association Journal citing that the combined ultrasound and mammography detection rate was 37.5%, it appears that taking simple measurements of breast tumor metrics (via a MRI scan) can reveal what appears to be a higher detection possibility.

Below the correlation between mean radius and diagnosis is found to be .730, incidicating a strong positive relationship between the two. This affirms above findings on diagnosis.

In [ ]:
def transform(x):
    x = x.replace("B", "0").replace("M", "1")
    return float(x)
df['diagnosis'] = df['diagnosis'].apply(transform)
 

x = df["radius_mean"]
y = df['diagnosis']
a = df["area_mean"]
r = np.corrcoef(x, y)
r
Out[ ]:
array([[1.        , 0.73002851],
       [0.73002851, 1.        ]])

Project Overview

We created 2 major predictors to find out how age, tumor size, and other common tumor variables can impact tumor diagnosis and determine likelihood of extensive surgery types.

  1. Firstly, by using mean radius and standard error of the radius of a person's tumor, we predicted whether the largest tumor is malignant or benign. (mean radius and standard error are independent variables and the simple equation (as shown above) predicted likelihood of malignant tumors increases as the radius of the largest tumor significantly exceeds the mean)
  2. Secondly, using k nearest neighbor classifier, we found out the relationship between chosen training variables, including mean radius, texture, perimeter, smoothness, etc., and the predictor value or diagnosis. Our model accrutaely predicted diagnosis with a low error value.

Project Conclusion:

Tumor size shows strong correlation to a malignant diagnosis Moreover, malignant tumors generally progress steadily to higher stages of cancer, which if not detected on time, require more extensive, costly surgeries and have significantly lower survival rates.

In [ ]:
%%shell
#pwd
jupyter nbconvert --to html ///content/drive/MyDrive/DataScienceProject/FinalMilestone.ipynb
[NbConvertApp] Converting notebook ///content/drive/MyDrive/DataScienceProject/FinalMilestone.ipynb to html
[NbConvertApp] Writing 461956 bytes to ///content/drive/MyDrive/DataScienceProject/FinalMilestone.html
Out[ ]: