Bharat Solanky and Aadya Chawla
More than 200,000 women are diagnosed with breast cancer every year. It is the most common form of cancer for women and is responsible for the second highest number of deaths for females in the United States. We want to know how can we better understand the properties of tumor cells and how age and tumor size affects the likelihood of being diagnosed with breast cancer. This is important because a study by Orlando Health showed that 22% of women age 35 to 44 have never had a mammogram and “have no plans to get one”. Our work can hopefully give us a better understanding of specifically how catastrophic this issue is and potentially provide a reason to spread more awareness on the subject.
The goals of this project were to produce a predictive model to determine likelihood of breast cancer incidence in female patients using physical metrics, such as concavity, radius, and texture of perceived benign tumors. Moreover, we then wanted to understand how certain hormonal protein levels can further be used to give an indication on what type of surgery is required for women who have malignant tumors.
The datasets we worked with are from Kaggle.com 1 2 3 . As a background, the medical journals report that the accuracy of visually diagnosed breast FNA is about 94.3% with mean sensitivity of 91 percent and specificity of 87%. The dataset will be randomly divided into two disjoint subgroups, first to train the prediction model and other for testing the accuracy of the developed models. Based on preliminary analysis, the three features which are highly associated with the diagnosis of breast cancer are the following (the description of variables is below):
Our goal was to also provide guidelines for medical professionals to assist them with the process of visual diagnosis (prediction) by identifying potentially extreme values above/below which the likelihood of breast cancer changes significantly. For example, for what value of Radius_worst does the possibility of cancer have a likelihood of 95% or 100%.
We created a google colab drive to share files and so that we could easily edit and write code live. We met weekly leading up to the final deadline to divide the work and enhance our project objectives. Throughout the semester, we used text messages and met in person to communicate with each other.
We loaded three datasets which are all .csv files available on the Kaggle.com website. The first dataset has information of 32 features described above for 569 patients. The second dataset is for confirmed malignant tumor patients, both currently alive and dead, and has data for 4,534 patients, including tumor stage, tumor size, and patient hormone levels. The third dataset is for confirmed malignant tumor patients, both currently alive and dead, and has data for 317 patients, including tumor stage and type of surgery required
The first download does not have any missing data for any of the features. It does however have "unnamed data", which was promptly removed after loading it below.
Below we have filtered data by malignant patient type, sorted to only show concavity points of their cells, and plotted it as a histogram.
from google.colab import drive
drive.mount('/content/drive')
#mounting google collab to google drive
from google.colab import drive
drive.mount('/content/drive')
! pwd
#directory contains csv file
%cd /content/drive/My Drive/DataScienceProject
!git pull
import pandas as pd
from matplotlib import pyplot as plt
from itertools import cycle, islice
pd.options.display.max_rows = 8
df = pd.read_csv("data.csv")
#This reads the data file which is named data.csv
#Deletes unused column
del df['Unnamed: 32']
#Displays head for first dataset
df.head()
Below are the dtypes for certain variables within the dataset. Later on, the diagnosis is modified to become a float64 from an object data type to make it a numerical value which will help with K-neighbor regression calculation.
df.dtypes
Below is shown the columns and head rows for the second data set. What was formerly "anaplastic; Grade IV" was cleaned up to match other grades in simply being a number (i.e. 4).
df2 = pd.read_csv("Breast_Cancer.csv")
#This reads the second data file which is named Breast_Cancer.csv
def transform(x):
x = x.replace(" anaplastic; Grade IV", "4")
return str(x)
df2['Grade'] = df2['Grade'].apply(transform)
df2['Grade'].value_counts()
df2.head()
Below is a bar chart showing the 4 stages of breast cancer with outcomes of patients plotted in both green and red. Green describes a patient being alive when this data set was taken and red represents a patient passing away during this time. As can be seen, if a patient's stage of breast cancer was higher prior to surgery, their likelihood of passing was also greater.
my_colors = list(islice(cycle(['g', 'r']), None, len(df)))
df2.groupby('Grade')["Status"].value_counts(normalize=True)
df2.groupby('Grade')["Status"].value_counts(normalize=True).plot(kind='bar', legend = True, color=my_colors)
plt.legend(my_colors, loc='upper right', title='Status')
chart1 = df2.groupby('Grade')["Status"].value_counts(normalize=True).plot(kind='bar', legend = True, color=my_colors, title = 'Patient Outcomes by Cancer Grade')
plt.legend(['Alive', 'Dead'],
loc='upper right', title='Status')
chart1.set_ylabel("Total")
chart1.set_xlabel("Cancer Grade & Status")
The histogram below shows how the number of tumors increase with age. The number of tumors spiked at age 45 in this graph. This may be seen as "general knowledge", but it affirms the statement that tumors naturally increase with age.
chart2 = df2.hist(column='Age', legend = True)
plt.title('Frequency of tumors by age')
plt.xlabel('Age')
plt.ylabel('Frequency')
In the data frame displayed below, we can see a clear association between Tumor Size and grade of cancer...As the grade or "stage" of cancer increases, tumor size seems to match in also increasing. The chart doesn't clearly show this, as there are very few data points in the 4 category to make it visible, however the averages in the dataframe just above it make it clear.
#Copying Tumor Size to another column with shorter name for ease
df2["Size"] = df2["Tumor Size"]
chart3 = df2.groupby("Grade").Size.plot.hist(alpha=.5, density=False, legend=True)
plt.title('Tumor Size by Cancer Grade')
plt.xlabel('Tumor Size')
print(df2.groupby("Grade").mean())
df3 = pd.read_csv("BRCA.csv")
#This reads the data file which is named BRCA.csv
df3.head()
#Third dataset --> contains data on surgery type and protein concentration, both of which are new data columns
df3["Protein2"].value_counts()
#Drops null values
df3.dropna()
Below are two grouped dataframes which are filtered to displayed the status (i.e. dead or alive) of patients at the time of taking the data based on their tumor stage during surgery. Clearly, as tumor stage increased, their chance of survival post surgery decreased.
df3.groupby('Tumour_Stage')["Patient_Status"].value_counts()
df3.groupby('Tumour_Stage')["Patient_Status"].value_counts(normalize=True)
This plotted bar chart further shows that as the stage of cancer got worse, a more invasive, more costly surgery was required. For context, a simple masectomy (plotted in yellow) requires removal of significant tissue from a patient's breast and modified radical mastecomy (plotted in red) requires complete removal of breast and skin that surrounds it.
tumor_c = list(islice(cycle(['b', 'y', 'r', 'g']), None, len(df)))
chart4 = df3.groupby('Tumour_Stage')["Surgery_type"].value_counts(normalize=True).plot(kind='bar', legend=True,
color = ['black', 'black', 'yellow', 'red',
'black', 'red', 'yellow','black',
'red','black','yellow','black'],
title = 'Surgery Type According to Tumor Stage')
chart4.set_xlabel("Tumor Stage and Surgery Type")
chart4.set_ylabel("Number of surgeries")
Below, it was attempted to see if the means of protein concentration values of patients had any relation to tumor stage. Although protein 2 and protein 4 concentrations both show consistently declining values with increasing tumor stage, it is unclear if this association has any indication of tumor stage prediction.
df3.mean()
df3.groupby(by=["Tumour_Stage"]).mean()
From the first dataframe df (again shown below), we established a relationship between "radius_worst" (a variable describing the radius of the largest tumor present in a breast cancer patient) and "radius_mean" (mean radius of that patient) and its "radius_se" (standard error of the sizes of the radius from a patient) to understand how many standard errors away from the mean that nearly all of the diagnoses are malignant. This is a step we can use to help guide our question searching process. Below, it can be seen that as the multiplier value (5 to 7 to 9 to 11) increases, the percentage of diagnoses being malignant increases. This may show that within a patient's tumors, if one is significantly larger than the mean, then it is likely that that specific tumor is malignant.
df.head()
my_colors = list(islice(cycle(['y', 'g']), None, len(df)))
rd = df[df["radius_worst"] > (df["radius_mean"] + 11 * df["radius_se"])]
rd["diagnosis"].value_counts()
rd["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)
rd2 = df[df["radius_worst"] > (df["radius_mean"] + 9 * df["radius_se"])]
rd2["diagnosis"].value_counts()
rd2["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)
rd3 = df[df["radius_worst"] > (df["radius_mean"] + 7 * df["radius_se"])]
rd3["diagnosis"].value_counts()
rd3["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)
rd4 = df[df["radius_worst"] > (df["radius_mean"] + 5 * df["radius_se"])]
rd4["diagnosis"].value_counts()
rd4["diagnosis"].value_counts().plot(kind='bar', legend = True, color=my_colors)
K-Nearest Neighbors Method
The code below uses the K-Nearest Neighbors(knn) Classifier method to predict whether the diagnosis of the tumor is malignant or benign. The ending result is the root mean squared error in cross valuation of the prediction. We would like to perform more analysis to see how we can use k neighors to fully understand the relationship between the training variables, including mean radius, texture, perimeter, smoothness, etc., and the test value or diagnosis.
import numpy as np
import matplotlib.pyplot as plt
train = df.sample(frac=.5)
val = df.drop(train.index)
# Features used for prediction
features = ["radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean"]
X_train_dict = train[features].to_dict(orient="records")
X_val_dict = val[features].to_dict(orient="records")
y_train = train["diagnosis"]
y_val = val["diagnosis"]
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
def get_val_error(X_train_dict, y_train, X_val_dict, y_val):
# convert categorical variables to dummy variables
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_val = vec.transform(X_val_dict)
# standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_val_sc = scaler.transform(X_val)
# Fit a 10-nearest neighbors model.
model = KNeighborsClassifier(n_neighbors=7)
model.fit(X_train_sc, y_train)
# Make predictions on the validation set.
y_pred = model.predict(X_val_sc)
return model.score(X_val_sc,y_val)
val1 = get_val_error(X_train_dict, y_train, X_val_dict, y_val)
val2 = get_val_error(X_val_dict, y_val, X_train_dict, y_train)
print("Test Score:",val1)
print("Test Score:",val2)
(val1+val2)/2
As can be seen above, our knn classifier appears to be an excellent predictor. It accurately predicts 93% of diagnoses based off the 7 independent variables, those being radius, texture, perimeter, area, smoothness, compactness, and concavity. With such a high value, it's clear that breast cancer detection is possible with great accuracy. Moreover, with the Canadian Medical Association Journal citing that the combined ultrasound and mammography detection rate was 37.5%, it appears that taking simple measurements of breast tumor metrics (via a MRI scan) can reveal what appears to be a higher detection possibility.
Below the correlation between mean radius and diagnosis is found to be .730, incidicating a strong positive relationship between the two. This affirms above findings on diagnosis.
def transform(x):
x = x.replace("B", "0").replace("M", "1")
return float(x)
df['diagnosis'] = df['diagnosis'].apply(transform)
x = df["radius_mean"]
y = df['diagnosis']
a = df["area_mean"]
r = np.corrcoef(x, y)
r
Project Overview
We created 2 major predictors to find out how age, tumor size, and other common tumor variables can impact tumor diagnosis and determine likelihood of extensive surgery types.
Tumor size shows strong correlation to a malignant diagnosis Moreover, malignant tumors generally progress steadily to higher stages of cancer, which if not detected on time, require more extensive, costly surgeries and have significantly lower survival rates.
%%shell
#pwd
jupyter nbconvert --to html ///content/drive/MyDrive/DataScienceProject/FinalMilestone.ipynb