Predictive Model for Breast Cancer Incidence¶

Bharat Solanky and Aadya Chawla

Project Background:¶

More than 200,000 women are diagnosed with breast cancer every year. It is the most common form of cancer for women and is responsible for the second highest number of deaths for females in the United States. We want to know how can we better understand the properties of tumor cells and how age and tumor size affects the likelihood of being diagnosed with breast cancer. This is important because a study by Orlando Health showed that 22% of women age 35 to 44 have never had a mammogram and “have no plans to get one”. Our work can hopefully give us a better understanding of specifically how catastrophic this issue is and potentially provide a reason to spread more awareness on the subject.

Project Goals:¶

The goals of this project were to produce a predictive model to determine likelihood of breast cancer incidence in female patients using physical metrics, such as concavity, radius, and texture of perceived benign tumors. Moreover, we then wanted to understand how certain hormonal protein levels can further be used to give an indication on what type of surgery is required for women who have malignant tumors.

Project DataSet and Plan:¶

The datasets we worked with are from Kaggle.com 1 2 3 . As a background, the medical journals report that the accuracy of visually diagnosed breast FNA is about 94.3% with mean sensitivity of 91 percent and specificity of 87%. The dataset will be randomly divided into two disjoint subgroups, first to train the prediction model and other for testing the accuracy of the developed models. Based on preliminary analysis, the three features which are highly associated with the diagnosis of breast cancer are the following (the description of variables is below):

Concave_points_worst: The average value for cancer is 0.18, whereas it is 0.07 for benign;
Radius_worst: The average value for cancer is 21.13, whereas it is 13.38 for benign; and
Texture_worst: The average value for cancer is 29.32 whereas it is 23.52 for benign;

Our goal was to also provide guidelines for medical professionals to assist them with the process of visual diagnosis (prediction) by identifying potentially extreme values above/below which the likelihood of breast cancer changes significantly. For example, for what value of Radius_worst does the possibility of cancer have a likelihood of 95% or 100%.

Collaboration :¶

We created a google colab drive to share files and so that we could easily edit and write code live. We met weekly leading up to the final deadline to divide the work and enhance our project objectives. Throughout the semester, we used text messages and met in person to communicate with each other.

ETL (Extraction, Transform, and Load):¶

We loaded three datasets which are all .csv files available on the Kaggle.com website. The first dataset has information of 32 features described above for 569 patients. The second dataset is for confirmed malignant tumor patients, both currently alive and dead, and has data for 4,534 patients, including tumor stage, tumor size, and patient hormone levels. The third dataset is for confirmed malignant tumor patients, both currently alive and dead, and has data for 317 patients, including tumor stage and type of surgery required

The first download does not have any missing data for any of the features. It does however have "unnamed data", which was promptly removed after loading it below.

Filtering, Sorting, and Plotting Data Points¶

Below we have filtered data by malignant patient type, sorted to only show concavity points of their cells, and plotted it as a histogram.

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

#mounting google collab to google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

! pwd

#directory contains csv file
%cd /content/drive/My Drive/DataScienceProject
!git pull


import pandas as pd
from matplotlib import pyplot as plt
from itertools import cycle, islice
pd.options.display.max_rows = 8

df = pd.read_csv("data.csv") 
#This reads the data file which is named data.csv

#Deletes unused column
del df['Unnamed: 32']

#Displays head for first dataset
df.head()

/content/drive/My Drive/DataScienceProject
/content/drive/My Drive/DataScienceProject
fatal: not a git repository (or any parent up to mount point /content)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Below are the dtypes for certain variables within the dataset. Later on, the diagnosis is modified to become a float64 from an object data type to make it a numerical value which will help with K-neighbor regression calculation.

df.dtypes

id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
                            ...   
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
Length: 32, dtype: object

Below is shown the columns and head rows for the second data set. What was formerly "anaplastic; Grade IV" was cleaned up to match other grades in simply being a number (i.e. 4).

df2 = pd.read_csv("Breast_Cancer.csv") 
#This reads the second data file which is named Breast_Cancer.csv

def transform(x):
    x = x.replace(" anaplastic; Grade IV", "4")
    return str(x)
df2['Grade'] = df2['Grade'].apply(transform)
df2['Grade'].value_counts()
df2.head()

Exploratory Data Analysis (EDA)¶

Below is a bar chart showing the 4 stages of breast cancer with outcomes of patients plotted in both green and red. Green describes a patient being alive when this data set was taken and red represents a patient passing away during this time. As can be seen, if a patient's stage of breast cancer was higher prior to surgery, their likelihood of passing was also greater.

my_colors = list(islice(cycle(['g', 'r']), None, len(df)))
df2.groupby('Grade')["Status"].value_counts(normalize=True)
df2.groupby('Grade')["Status"].value_counts(normalize=True).plot(kind='bar', legend = True, color=my_colors)
plt.legend(my_colors, loc='upper right', title='Status')
chart1 = df2.groupby('Grade')["Status"].value_counts(normalize=True).plot(kind='bar', legend = True, color=my_colors, title = 'Patient Outcomes by Cancer Grade')
plt.legend(['Alive', 'Dead'],
            loc='upper right', title='Status')


chart1.set_ylabel("Total")
chart1.set_xlabel("Cancer Grade & Status")

Text(0.5, 0, 'Cancer Grade & Status')

The histogram below shows how the number of tumors increase with age. The number of tumors spiked at age 45 in this graph. This may be seen as "general knowledge", but it affirms the statement that tumors naturally increase with age.

chart2 = df2.hist(column='Age', legend = True)
plt.title('Frequency of tumors by age')
plt.xlabel('Age')
plt.ylabel('Frequency')

Text(0, 0.5, 'Frequency')

In the data frame displayed below, we can see a clear association between Tumor Size and grade of cancer...As the grade or "stage" of cancer increases, tumor size seems to match in also increasing. The chart doesn't clearly show this, as there are very few data points in the 4 category to make it visible, however the averages in the dataframe just above it make it clear.

#Copying Tumor Size to another column with shorter name for ease
df2["Size"] = df2["Tumor Size"]

chart3 = df2.groupby("Grade").Size.plot.hist(alpha=.5, density=False, legend=True)
plt.title('Tumor Size by Cancer Grade')
plt.xlabel('Tumor Size')
print(df2.groupby("Grade").mean())

             Age  Tumor Size  Regional Node Examined  Reginol Node Positive  \
Grade                                                                         
1      55.289134   26.364641               12.675875               3.068140   
2      54.322416   29.729051               14.387920               3.922586   
3      52.615662   33.823582               15.111611               5.154815   
4      52.315789   44.157895               14.473684               6.157895   

       Survival Months       Size  
Grade                              
1            72.937385  26.364641  
2            72.179073  29.729051  
3            68.749775  33.823582  
4            64.421053  44.157895

df3 = pd.read_csv("BRCA.csv") 
#This reads the data file which is named BRCA.csv

df3.head()
#Third dataset --> contains data on surgery type and protein concentration, both of which are new data columns

df3["Protein2"].value_counts()
#Drops null values
df3.dropna()

Below are two grouped dataframes which are filtered to displayed the status (i.e. dead or alive) of patients at the time of taking the data based on their tumor stage during surgery. Clearly, as tumor stage increased, their chance of survival post surgery decreased.

df3.groupby('Tumour_Stage')["Patient_Status"].value_counts()

Tumour_Stage  Patient_Status
I             Alive              51
              Dead               10
II            Alive             144
              Dead               38
III           Alive              60
              Dead               18
Name: Patient_Status, dtype: int64

df3.groupby('Tumour_Stage')["Patient_Status"].value_counts(normalize=True)

Tumour_Stage  Patient_Status
I             Alive             0.836066
              Dead              0.163934
II            Alive             0.791209
              Dead              0.208791
III           Alive             0.769231
              Dead              0.230769
Name: Patient_Status, dtype: float64

This plotted bar chart further shows that as the stage of cancer got worse, a more invasive, more costly surgery was required. For context, a simple masectomy (plotted in yellow) requires removal of significant tissue from a patient's breast and modified radical mastecomy (plotted in red) requires complete removal of breast and skin that surrounds it.

tumor_c = list(islice(cycle(['b', 'y', 'r', 'g']), None, len(df)))
chart4 = df3.groupby('Tumour_Stage')["Surgery_type"].value_counts(normalize=True).plot(kind='bar', legend=True, 
                                                                                       color = ['black', 'black', 'yellow', 'red', 
                                                                                                'black', 'red', 'yellow','black',
                                                                                                'red','black','yellow','black'],
                                                                                       title = 'Surgery Type According to Tumor Stage')
chart4.set_xlabel("Tumor Stage and Surgery Type")
chart4.set_ylabel("Number of surgeries")

Text(0, 0.5, 'Number of surgeries')

Below, it was attempted to see if the means of protein concentration values of patients had any relation to tumor stage. Although protein 2 and protein 4 concentrations both show consistently declining values with increasing tumor stage, it is unclear if this association has any indication of tumor stage prediction.

df3.mean()
df3.groupby(by=["Tumour_Stage"]).mean()

<ipython-input-87-2145c3bb11f0>:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  df3.mean()

From the first dataframe df (again shown below), we established a relationship between "radius_worst" (a variable describing the radius of the largest tumor present in a breast cancer patient) and "radius_mean" (mean radius of that patient) and its "radius_se" (standard error of the sizes of the radius from a patient) to understand how many standard errors away from the mean that nearly all of the diagnoses are malignant. This is a step we can use to help guide our question searching process. Below, it can be seen that as the multiplier value (5 to 7 to 9 to 11) increases, the percentage of diagnoses being malignant increases. This may show that within a patient's tumors, if one is significantly larger than the mean, then it is likely that that specific tumor is malignant.

df.head()

my_colors = list(islice(cycle(['y', 'g']), None, len(df)))


rd = df[df["radius_worst"] > (df["radius_mean"] + 11 * df["radius_se"])]
rd["diagnosis"].value_counts()
rd["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4ad36e9610>

rd2 = df[df["radius_worst"] > (df["radius_mean"] + 9 * df["radius_se"])]
rd2["diagnosis"].value_counts()
rd2["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4ad36b1fd0>

rd3 = df[df["radius_worst"] > (df["radius_mean"] + 7 * df["radius_se"])]
rd3["diagnosis"].value_counts()
rd3["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4ad36efdf0>

rd4 = df[df["radius_worst"] > (df["radius_mean"] + 5 * df["radius_se"])]
rd4["diagnosis"].value_counts()
rd4["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4ad3606880>

K-Nearest Neighbors Method

The code below uses the K-Nearest Neighbors(knn) Classifier method to predict whether the diagnosis of the tumor is malignant or benign. The ending result is the root mean squared error in cross valuation of the prediction. We would like to perform more analysis to see how we can use k neighors to fully understand the relationship between the training variables, including mean radius, texture, perimeter, smoothness, etc., and the test value or diagnosis.

import numpy as np
import matplotlib.pyplot as plt

train = df.sample(frac=.5)
val = df.drop(train.index)

# Features used for prediction
features = ["radius_mean",	"texture_mean",	"perimeter_mean",	"area_mean",	"smoothness_mean",	"compactness_mean",	"concavity_mean"]

X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")

y_train = train["diagnosis"]
y_val = val["diagnosis"]

from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

def get_val_error(X_train_dict, y_train, X_val_dict, y_val):
    
    # convert categorical variables to dummy variables
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_val = vec.transform(X_val_dict)

    # standardize the data
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_val_sc = scaler.transform(X_val)
    
    # Fit a 10-nearest neighbors model.
    model = KNeighborsClassifier(n_neighbors=7)
    model.fit(X_train_sc, y_train)

    # Make predictions on the validation set.
    y_pred = model.predict(X_val_sc)

    return model.score(X_val_sc,y_val)

val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val)
val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train)

print("Test Score:",val1)
print("Test Score:",val2)

(val1+val2)/2

Test Score: 0.9157894736842105
Test Score: 0.9295774647887324

0.9226834692364714

As can be seen above, our knn classifier appears to be an excellent predictor. It accurately predicts 93% of diagnoses based off the 7 independent variables, those being radius, texture, perimeter, area, smoothness, compactness, and concavity. With such a high value, it's clear that breast cancer detection is possible with great accuracy. Moreover, with the Canadian Medical Association Journal citing that the combined ultrasound and mammography detection rate was 37.5%, it appears that taking simple measurements of breast tumor metrics (via a MRI scan) can reveal what appears to be a higher detection possibility.

Below the correlation between mean radius and diagnosis is found to be .730, incidicating a strong positive relationship between the two. This affirms above findings on diagnosis.

def transform(x):
    x = x.replace("B", "0").replace("M", "1")
    return float(x)
df['diagnosis'] = df['diagnosis'].apply(transform)
 

x = df["radius_mean"]
y = df['diagnosis']
a = df["area_mean"]
r = np.corrcoef(x, y)
r

array([[1.        , 0.73002851],
       [0.73002851, 1.        ]])

Project Overview

We created 2 major predictors to find out how age, tumor size, and other common tumor variables can impact tumor diagnosis and determine likelihood of extensive surgery types.

Firstly, by using mean radius and standard error of the radius of a person's tumor, we predicted whether the largest tumor is malignant or benign. (mean radius and standard error are independent variables and the simple equation (as shown above) predicted likelihood of malignant tumors increases as the radius of the largest tumor significantly exceeds the mean)
Secondly, using k nearest neighbor classifier, we found out the relationship between chosen training variables, including mean radius, texture, perimeter, smoothness, etc., and the predictor value or diagnosis. Our model accrutaely predicted diagnosis with a low error value.

Project Conclusion:¶

Tumor size shows strong correlation to a malignant diagnosis Moreover, malignant tumors generally progress steadily to higher stages of cancer, which if not detected on time, require more extensive, costly surgeries and have significantly lower survival rates.

%%shell
#pwd
jupyter nbconvert --to html ///content/drive/MyDrive/DataScienceProject/FinalMilestone.ipynb

[NbConvertApp] Converting notebook ///content/drive/MyDrive/DataScienceProject/FinalMilestone.ipynb to html
[NbConvertApp] Writing 461956 bytes to ///content/drive/MyDrive/DataScienceProject/FinalMilestone.html

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

	Patient_ID	Age	Gender	Protein1	Protein2	Protein3	Protein4	Tumour_Stage	Histology	ER status	PR status	HER2 status	Surgery_type	Date_of_Surgery	Date_of_Last_Visit	Patient_Status
0	TCGA-D8-A1XD	36.0	FEMALE	0.080353	0.42638	0.54715	0.273680	III	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Modified Radical Mastectomy	15-Jan-17	19-Jun-17	Alive
1	TCGA-EW-A1OX	43.0	FEMALE	-0.420320	0.57807	0.61447	-0.031505	II	Mucinous Carcinoma	Positive	Positive	Negative	Lumpectomy	26-Apr-17	09-Nov-18	Dead
2	TCGA-A8-A079	69.0	FEMALE	0.213980	1.31140	-0.32747	-0.234260	III	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Other	08-Sep-17	09-Jun-18	Alive
3	TCGA-D8-A1XR	56.0	FEMALE	0.345090	-0.21147	-0.19304	0.124270	II	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Modified Radical Mastectomy	25-Jan-17	12-Jul-17	Alive
4	TCGA-BH-A0BF	56.0	FEMALE	0.221550	1.90680	0.52045	-0.311990	II	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Other	06-May-17	27-Jun-19	Dead

	Patient_ID	Age	Gender	Protein1	Protein2	Protein3	Protein4	Tumour_Stage	Histology	ER status	PR status	HER2 status	Surgery_type	Date_of_Surgery	Date_of_Last_Visit	Patient_Status
0	TCGA-D8-A1XD	36.0	FEMALE	0.080353	0.42638	0.54715	0.273680	III	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Modified Radical Mastectomy	15-Jan-17	19-Jun-17	Alive
1	TCGA-EW-A1OX	43.0	FEMALE	-0.420320	0.57807	0.61447	-0.031505	II	Mucinous Carcinoma	Positive	Positive	Negative	Lumpectomy	26-Apr-17	09-Nov-18	Dead
2	TCGA-A8-A079	69.0	FEMALE	0.213980	1.31140	-0.32747	-0.234260	III	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Other	08-Sep-17	09-Jun-18	Alive
3	TCGA-D8-A1XR	56.0	FEMALE	0.345090	-0.21147	-0.19304	0.124270	II	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Modified Radical Mastectomy	25-Jan-17	12-Jul-17	Alive
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
330	TCGA-A8-A085	44.0	MALE	0.732720	1.11170	-0.26952	-0.354920	II	Infiltrating Lobular Carcinoma	Positive	Positive	Negative	Other	01-Nov-19	04-Mar-20	Dead
331	TCGA-A1-A0SG	61.0	FEMALE	-0.719470	2.54850	-0.15024	0.339680	II	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Lumpectomy	11-Nov-19	18-Jan-21	Dead
332	TCGA-A2-A0EU	79.0	FEMALE	0.479400	2.05590	-0.53136	-0.188480	I	Infiltrating Ductal Carcinoma	Positive	Positive	Positive	Lumpectomy	21-Nov-19	19-Feb-21	Dead
333	TCGA-B6-A40B	76.0	FEMALE	-0.244270	0.92556	-0.41823	-0.067848	I	Infiltrating Ductal Carcinoma	Positive	Positive	Negative	Lumpectomy	11-Nov-19	05-Jan-21	Dead

	Age	Protein1	Protein2	Protein3	Protein4
Tumour_Stage
I	62.359375	-0.014430	1.001318	-0.165147	0.037828
II	59.052910	-0.007734	0.964763	-0.065409	0.018023
III	55.753086	-0.094220	0.862207	-0.088845	-0.031453

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	...	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	...	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	...	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	...	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	...	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

	Age	Race	Marital Status	T Stage	N Stage	6th Stage	differentiate	Grade	A Stage	Tumor Size	Estrogen Status	Progesterone Status	Regional Node Examined	Reginol Node Positive	Survival Months	Status
0	68	White	Married	T1	N1	IIA	Poorly differentiated	3	Regional	4	Positive	Positive	24	1	60	Alive
1	50	White	Married	T2	N2	IIIA	Moderately differentiated	2	Regional	35	Positive	Positive	14	5	62	Alive
2	58	White	Divorced	T3	N3	IIIC	Moderately differentiated	2	Regional	63	Positive	Positive	14	7	75	Alive
3	58	White	Married	T1	N1	IIA	Poorly differentiated	3	Regional	18	Positive	Positive	2	1	84	Alive
4	47	White	Married	T2	N1	IIB	Poorly differentiated	3	Regional	41	Positive	Positive	3	1	50	Alive