Methods
Gene Fusions identification and differential expression analysis
We employed two Gene Fusions(GFs) detection tools, including arriba, STAR-Fusion, to identify, parse, and annotate GFs junctions within each sample, using the human genome version hg38 as the reference. In our pursuit of biologically robust GFs identification within each cohort, we undertook an approach metafusion that amalgamated results from two GFs detection tools. Firstly, we exclusively retained Fusions located on chromosomes 1 to 22, as well as the X and Y chromosomes. Secondly, GFs were mandated to exhibit genomic overlap with gene regions as defined in the reference genome annotation file, specifically “gencode.v34.annotation.gtf.”. Thirdly to ensure the precision and recall in GFs prediction with different tools, one can either rely on consensus predictions or all predctions and choosing junction-crossing reads and reads mates which flank (span) the breakpoint. Fourthly, we offer FusionAnnotator to identify and prioritize GFs that have been previously reported in cancer or normal samples. Lastly, the detected data can be divided into Cis-SAGes and other categories to further explore the debate on what should be considered true fusion RNAs. This meticulous methodological approach was instrumental in ensuring the acquisition of biologically sound and reliable GFs datasets, thereby fortifying the integrity of our analysis within each cohort. This analysis facilitated comparisons between patient groups, such as those who responded to checkpoint immunotherapy versus non-responders, as well as between samples collected before and during ICB treatment.
Response and Response2
We gathered and processed patient ‘Response’ data from various cohorts. However, due to inconsistencies among these cohorts, such as the usage of different terms like CR, PR, SD, PD, R, NR, DCB, and NDB, we harmonized the response variable based on the timing of treatment and treatment response. As a result, we introduced a new variable called ‘Response2’ with four distinct categories: Pre-R, Pre-NR, On-R, and On-NR. In cases where response data was not available, we assigned the category NE.
Cancer Type Abbreviation
Abbr. | Description |
---|---|
BLCA | Bladder urothelial carcinoma |
SKCM | Skin cutaneous melanoma |
KIRC & CCRCC | Kidney renal clear cell carcinoma |
HNSC & HNSCC | Head and neck squamous cell carcinoma |
NSCLC | Non-small cell lung cancer, including lung adenocarcinoma and lung squamous cell carcinoma |
SCLC | Small cell lung cancer |
SARC | Sarcoma |
BRCA | Breast cancer |
GBM | Glioblastoma Multiforme |
STAD | Stomach Adenocarcinoma |
SGC | Salivary gland cancer |
LUSC | Lung Squamous Cell Carcinoma |
MESO | Mesothelioma |
PCPG | Pheochromocytoma and Paraganglioma |
LIHC | Liver Hepatocellular Carcinoma |
PRAD | Prostate Adenocarcinoma |
ACC | Adrenocortical Carcinoma |
TGCT | Testicular Germ Cell Tumors |
PAAD | Pancreatic Adenocarcinoma |
UCEC | Uterine Corpus Endometrial Carcinoma |
THCA | Thyroid carcinoma |
CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma |
ESCA | Esophageal carcinoma |
READ | Rectum adenocarcinoma |
OV | Ovarian Serous Cystadenocarcinoma |
CHOL | Cholangiocarcinoma |
UCS | Uterine Carcinosarcoma |
LUAD | Lung Adenocarcinoma |
LGG | Brain Lower Grade Glioma |
HGSC | High-Grade Serous Carcinoma |
LSCC & LUSC | Lung Squamous Cell Carcinoma |
PDAC | Pancreatic Ductal Adenocarcinoma |
THYM | Thymoma |
KICH | Kidney Chromophobe |
KIRP | Kidney Renal Papillary Cell Carcinoma |
COAD | Colon Adenocarcinoma |
LAML & AML | Acute Myeloid Leukemia |
NBL | Neuroblastoma |
GNB | Ganglioneuroblastoma |
ImmunoFusion cohort naming maping
Cancer | Sample_type | Project_id | Immunofusion-cohort |
---|---|---|---|
BRCA | Primary Tumor | CPTAC-2 | CPTAC-BRCA |
COAD | Primary Tumor | CPTAC-2 | CPTAC-COAD |
HGSC | Primary Tumor | CPTAC-2 | CPTAC-OV |
LSCC | Primary Tumor | CPTAC-3 | CPTAC-LUSC |
UCEC | Primary Tumor | CPTAC-3 | CPTAC-UCEC |
LUAD | Primary Tumor | CPTAC-3 | CPTAC-LUAD |
GBM | Primary Tumor | CPTAC-3 | CPTAC-GBM |
HNSCC | Primary Tumor | CPTAC-3 | CPTAC-HNSC |
CCRCC | Primary Tumor | CPTAC-3 | CPTAC-KIRC |
PDAC | Primary Tumor | CPTAC-3 | CPTAC-PAAD |
PDAC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
LUAD | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
UCEC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
CCRCC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
HNSCC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
LSCC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
GTEX-Brain | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
– | Primary Blood Derived Cancer - Peripheral Blood/Recurrent Blood Derived Cancer - Peripheral Blood/Next Generation Cancer Model/Recurrent Blood Derived Cancer - Bone Marrow/Primary Blood Derived Cancer - Bone Marrow | TARGET-AML | TARGET-LAML |
Acute myeloid leukemia, NOS | Primary Blood Derived Cancer - Bone Marrow/Recurrent Blood Derived Cancer - Bone Marrow/Primary Blood Derived Cancer - Peripheral Blood/ Blood Derived Cancer - Bone Marrow/Post-treatment/Recurrent Blood Derived Cancer - Peripheral Blood/ Blood Derived Cancer - Peripheral Blood | TARGET-AML | TARGET-LAML |
Neuroblastoma, NOS | Primary Tumor/Recurrent Tumor/Recurrent Blood Derived Cancer - Bone Marrow | TARGET-NBL | TARGET-NBL |
Ganglioneuroblastoma | Primary Tumor/Recurrent Tumor | TARGET-NBL | TARGET-GNB |
– | Cell Lines | TARGET-AML | TARGET-CELL |
– | Bone Marrow Normal | TARGET-AML | TARGET-NORMAL |
Acute myeloid leukemia, NOS | Blood Derived Normal/Bone Marrow Normal | TARGET-AML | TARGET-NORMAL |
ACC | Primary Tumor | TCGA-ACC | TCGA-ACC |
BLCA | Primary Tumor | TCGA-BLCA | TCGA-BLCA |
BRCA | Primary Tumor | TCGA-BRCA | TCGA-BRCA |
BRCA | Metastatic | TCGA-BRCA | TCGA-BRCA |
CESC | Primary Tumor | TCGA-CESC | TCGA-CESC |
CESC | Metastatic | TCGA-CESC | TCGA-CESC |
CHOL | Primary Tumor | TCGA-CHOL | TCGA-CHOL |
COAD | Primary Tumor | TCGA-COAD | TCGA-COAD |
COAD | Recurrent Tumor | TCGA-COAD | TCGA-COAD |
COAD | Metastatic | TCGA-COAD | TCGA-COAD |
DLBC | Primary Tumor | TCGA-DLBC | TCGA-DLBC |
ESCA | Primary Tumor | TCGA-ESCA | TCGA-ESCA |
ESCA | Metastatic | TCGA-ESCA | TCGA-ESCA |
GBM | Primary Tumor | TCGA-GBM | TCGA-GBM |
GBM | Recurrent Tumor | TCGA-GBM | TCGA-GBM |
HNSC | Primary Tumor | TCGA-HNSC | TCGA-HNSC |
HNSC | Metastatic | TCGA-HNSC | TCGA-HNSC |
KICH | Primary Tumor | TCGA-KICH | TCGA-KICH |
KIRC | Primary Tumor | TCGA-KIRC | TCGA-KIRC |
KIRC | Additional - New Primary | TCGA-KIRC | TCGA-KIRC |
KIRP | Primary Tumor | TCGA-KIRP | TCGA-KIRP |
KIRP | Additional - New Primary | TCGA-KIRP | TCGA-KIRP |
LAML | Primary Blood Derived Cancer - Peripheral Blood | TCGA-LAML | TCGA-LAML |
LGG | Primary Tumor | TCGA-LGG | TCGA-LGG |
LGG | Recurrent Tumor | TCGA-LGG | TCGA-LGG |
LIHC | Primary Tumor | TCGA-LIHC | TCGA-LIHC |
LIHC | Recurrent Tumor | TCGA-LIHC | TCGA-LIHC |
LUAD | Primary Tumor | TCGA-LUAD | TCGA-LUAD |
LUAD | Recurrent Tumor | TCGA-LUAD | TCGA-LUAD |
LUSC | Primary Tumor | TCGA-LUSC | TCGA-LUSC |
MESO | Primary Tumor | TCGA-MESO | TCGA-MESO |
OV | Primary Tumor | TCGA-OV | TCGA-OV |
OV | Recurrent Tumor | TCGA-OV | TCGA-OV |
PAAD | Primary Tumor | TCGA-PAAD | TCGA-PAAD |
PAAD | Metastatic | TCGA-PAAD | TCGA-PAAD |
PCPG | Primary Tumor | TCGA-PCPG | TCGA-PCPG |
PCPG | Additional - New Primary | TCGA-PCPG | TCGA-PCPG |
PCPG | Metastatic | TCGA-PCPG | TCGA-PCPG |
PRAD | Primary Tumor | TCGA-PRAD | TCGA-PRAD |
PRAD | Metastatic | TCGA-PRAD | TCGA-PRAD |
READ | Primary Tumor | TCGA-READ | TCGA-READ |
READ | Recurrent Tumor | TCGA-READ | TCGA-READ |
SARC | Primary Tumor | TCGA-SARC | TCGA-SARC |
SARC | Recurrent Tumor | TCGA-SARC | TCGA-SARC |
SARC | Metastatic | TCGA-SARC | TCGA-SARC |
SKCM | Metastatic | TCGA-SKCM | TCGA-SKCM |
SKCM | Primary Tumor | TCGA-SKCM | TCGA-SKCM |
SKCM | Additional Metastatic | TCGA-SKCM | TCGA-SKCM |
STAD | Primary Tumor | TCGA-STAD | TCGA-STAD |
TGCT | Primary Tumor | TCGA-TGCT | TCGA-TGCT |
TGCT | Additional - New Primary | TCGA-TGCT | TCGA-TGCT |
THCA | Primary Tumor | TCGA-THCA | TCGA-THCA |
THCA | Metastatic | TCGA-THCA | TCGA-THCA |
THYM | Primary Tumor | TCGA-THYM | TCGA-THYM |
UCEC | Primary Tumor | TCGA-UCEC | TCGA-UCEC |
UCEC | Recurrent Tumor | TCGA-UCEC | TCGA-UCEC |
UCS | Primary Tumor | TCGA-UCS | TCGA-UCS |
UVM | Primary Tumor | TCGA-UVM | TCGA-UVM |
BLCA | Solid Tissue Normal | TCGA-BLCA | TCGA-NORMAL |
BRCA | Solid Tissue Normal | TCGA-BRCA | TCGA-NORMAL |
CESC | Solid Tissue Normal | TCGA-CESC | TCGA-NORMAL |
CHOL | Solid Tissue Normal | TCGA-CHOL | TCGA-NORMAL |
COAD | Solid Tissue Normal | TCGA-COAD | TCGA-NORMAL |
ESCA | Solid Tissue Normal | TCGA-ESCA | TCGA-NORMAL |
HNSC | Solid Tissue Normal | TCGA-HNSC | TCGA-NORMAL |
KICH | Solid Tissue Normal | TCGA-KICH | TCGA-NORMAL |
KIRC | Solid Tissue Normal | TCGA-KIRC | TCGA-NORMAL |
KIRP | Solid Tissue Normal | TCGA-KIRP | TCGA-NORMAL |
LIHC | Solid Tissue Normal | TCGA-LIHC | TCGA-NORMAL |
LUAD | Solid Tissue Normal | TCGA-LUAD | TCGA-NORMAL |
LUSC | Solid Tissue Normal | TCGA-LUSC | TCGA-NORMAL |
PAAD | Solid Tissue Normal | TCGA-PAAD | TCGA-NORMAL |
PCPG | Solid Tissue Normal | TCGA-PCPG | TCGA-NORMAL |
PRAD | Solid Tissue Normal | TCGA-PRAD | TCGA-NORMAL |
READ | Solid Tissue Normal | TCGA-READ | TCGA-NORMAL |
SARC | Solid Tissue Normal | TCGA-SARC | TCGA-NORMAL |
SKCM | Solid Tissue Normal | TCGA-SKCM | TCGA-NORMAL |
STAD | Solid Tissue Normal | TCGA-STAD | TCGA-NORMAL |
THCA | Solid Tissue Normal | TCGA-THCA | TCGA-NORMAL |
THYM | Solid Tissue Normal | TCGA-THYM | TCGA-NORMAL |
UCEC | Solid Tissue Normal | TCGA-UCEC | TCGA-NORMAL |
Clinical data preprocessing
TCGA
library(data.table)
library(dplyr)
# load("/home/data2/Projects/Fusion/unclean/TCGA_ALL_clinical.rdata")
#
# TCGA_ALL_clinical |>
# rename(Patient_ID = bcr_patient_barcode,
# Age = age_at_initial_pathologic_diagnosis,
# Sex = gender,
# Race = race,
# )
# [Discrepancy] [Not Applicable] [Not Available]
# [Discrepancy] [Not Available] [Unknown]
# # [Discrepancy] [Not Applicable]
# 29 105
# [Not Available] [Not Evaluated]
# 5140 4
# [Unknown]
library(UCSCXenaShiny)
table(tcga_clinical_fine$Stage_ajcc)
tcga_clinical
tcga_clinical_fine
tcga_genome_instability
tcga_purity
tcga_subtypes
tcga_surv
nrow(tcga_clinical_fine)
nrow(unique(tcga_clinical_fine))
= unique(tcga_clinical_fine) |>
tcga_clinical left_join(tcga_surv, by = c("Sample"="sample")) |>
left_join(tcga_purity, by = c("Sample"="sample"))
551, ]$Sample
tcga_clinical[
sum(startsWith(tcga_genome_instability$sample, "TCGA"))
= tcga_clinical |>
tcga_clinical left_join(tcga_genome_instability[startsWith(tcga_genome_instability$sample, "TCGA") & !duplicated(tcga_genome_instability$sample), ], by = c("Sample"="sample")) |>
#left_join(tcga_genome_instability[duplicated(tcga_genome_instability$sample), ], by = c("Sample"="sample")) |>
left_join(tcga_subtypes, by = c("Sample"="sampleID"))
sum(is.na(tcga_clinical$Sample))
sum(duplicated(tcga_clinical$Sample))
duplicated(tcga_clinical$Sample), ]$Sample
tcga_clinical[
= tcga_clinical[!duplicated(tcga_clinical$Sample), ] |>
tcga_clinical.final rename(Sample_ID = Sample, Sex = Gender, OS_Time = OS.time, OS_Status = OS, DSS_Time = DSS.time, DSS_Status = DSS, DFS_Time = DFI.time, DFS_Status = DFI, PFS_Time = PFI.time, PFS_Status = PFI) |>
select(-cancer_type) |>
mutate(Sex = ifelse(Sex == "FEMALE", "F", "M"),
Patient_ID = substr(Sample_ID, 1, 12))
sum(is.na(tcga_clinical.final$Sample_ID))
sum(is.na(tcga_clinical.final$Cancer))
sum(duplicated(tcga_clinical.final$Sample_ID))
= tcga_clinical.final |>
tcga_clinical.final select(Patient_ID, Sample_ID, everything()) |>
as.data.table()
tcga_clinical.final
fwrite(tcga_clinical.final, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")
CPTAC
library(data.table)
# Clinical data from CPTAC paper ------------------------------------------
= fs::dir_ls("CPTAC_Clinical_meta_data_v1/")
cli_list
= purrr::map(cli_list, fread) |> rbindlist(fill = TRUE, use.names = TRUE)
cli colnames(cli)
# Problem: Need to skip the second row
# This also includes samples from CPTAC-2/3, but the total number is fewer
= purrr::map(cli_list, function(x) {
cli2 # Since the column names are also different, read them twice—once for the column names and once for the data
= fread(x)
data1 = fread(x, skip = 2, header = FALSE)
data2 # Update the column names
colnames(data2) = colnames(data1)
$Cancer = stringr::str_remove(basename(x), "_meta.txt")
data2|> as.data.table()
data2 |> rbindlist(fill = TRUE, use.names = TRUE)
}) #colnames(cli2) = colnames(cli)
colnames(cli2)
== "Yes" & Normal == "Yes"] # This is to mark whether the patient has both tumor and normal samples
cli2[Tumor
= fs::dir_ls("meta_data_BCM/")
cli_list2
= purrr::map(cli_list2, function(x) {
cli3 # Since the column names are also different, read them twice—once for the column names and once for the data
= fread(x)
data1 = fread(x, skip = 2, header = FALSE)
data2 # 更新列名
colnames(data2) = colnames(data1)
$Cancer = stringr::str_remove(basename(x), "_meta.txt")
data2|> as.data.table()
data2 |> rbindlist(fill = TRUE, use.names = TRUE)
}) colnames(cli3)
all.equal(cli2, cli3)
all.equal(sort(colnames(cli2)), sort(colnames(cli3)))
= cli3[, colnames(cli2), with = FALSE]
cli4 all.equal(cli2, cli4)
identical(cli2, cli4)
::setdiff(cli2, cli4) |> View()
dplyr::setdiff(cli4, cli2) |> nrow()
dplyr
# rbind(cli2[idx == "01OV007"],
# cli4[idx == "01OV007"]) |> View()
# The files in the directories of the two data sources are basically identical
= cli2 |>
cli.final ::rename(
dplyrPatient_ID = idx,
OS_Status = OS_event,
OS_Time = OS_days,
PFS_Status = PFS_event,
PFS_Time = PFS_days
|>
) ::mutate(
dplyrSample_ID = Patient_ID,
Sex = ifelse(Sex == "Female", "F", "M")
|>
) ::select(
dplyr::everything()
Patient_ID, Sample_ID, Cancer, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status, dplyr
)# dplyr::select(
# - dplyr::starts_with("CIBERSORT"), - dplyr::starts_with("xCell")
# )
fwrite(cli.final, file = "CPTAC_cli_paper_combined.tsv", sep = "\t")
# Clinical data from GDC --------------------------------------------------
system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/CPTAC
mkdir CPTAC-3 CPTAC-2
tar zxvf clinical.project-cptac-2.2024-08-07.tar.gz -C CPTAC-2
tar zxvf clinical.project-cptac-3.2024-08-07.tar.gz -C CPTAC-3
")
= fread("gdc_cli/CPTAC/CPTAC-2/clinical.tsv")
cptac2 = fread("gdc_cli/CPTAC/CPTAC-3/clinical.tsv")
cptac3
which(colnames(cptac2) %in% "residual_disease")
which(colnames(cptac3) %in% "residual_disease")
all.equal(cptac2[[119]], cptac2[[192]])
all.equal(cptac3[[119]], cptac3[[192]])
192]] = NULL
cptac2[[192]] = NULL
cptac3[[
setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# > setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
# [1] "01OV049" "26OV010" "100004028" "05BR058" "05BR055" "17OV019" "11BR069" "02OV042" "100004012" "05BR051" "03BR012" "02OV045" "04OV041"
# [14] "05BR052" "05BR031" "1488" "11OV009" "02OV040" "01OV002" "100002921" "22OV001" "13OV004" "100003304" "02OV035" "01OV046" "01OV045"
# [27] "17OV034"
# > setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# [1] "C3N-02672" "C3L-01929" "C3N-02682" "C3N-02012" "C3N-02070"
# [6] "C3N-02696" "C3L-02354" "C3N-03026" "C3L-01739" "C3N-02088"
# [11] "C3L-04213" "C3N-03205" "C3L-04081" "C3N-03791" "C3L-03679"
# [16] "C3L-02643" "C3N-04686" "C3L-01558" "C3L-03268" "C3L-01633"
# [21] "C3L-02220" "C3N-02439" "C3N-02028" "C3N-03800" "C3L-04392"
# [26] "C3L-02556" "C3N-02721" "C3L-01672" "C3L-00938" "C3L-02201"
# [31] "C3N-02146" "C3N-02639" "C3N-03788" "C3L-03733" "C3L-04037"
# [36] "C3N-01091" "C3L-02747" "C3N-01879" "C3N-02996" "C3N-02296"
# [41] "C3L-02553" "C3L-03632" "C3N-03755" "GTEX-NPJ7-0011-R10A-SM-HAKXW" "C3L-02746"
length(setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)) # 586
View(cptac2[case_submitter_id %in% setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)])
View(cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)])
::fwrite(
data.table%in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)],
cptac3[case_submitter_id file = "CPTAC3_non_included.tsv", sep = "\t"
)
cptac3
# Supplement the missing sample information in the paper based on the existing information in the GDC. ---------------------------------------------
# might be columns that have labeled cancer types.
# primary_diagnosis site_of_resection_or_biopsy tissue_or_organ_of_origin
= rbind(cptac2, cptac3)[, .(case_submitter_id, primary_diagnosis, site_of_resection_or_biopsy, tissue_or_organ_of_origin)] |> unique()
cptac = dplyr::left_join(cptac, cli.final[, .(Sample_ID, Cancer)], by = c("case_submitter_id"="Sample_ID"))
cptac
setdiff(cli.final$Sample_ID, cptac$case_submitter_id)
# 29 cases are not in the GDC data.
table(cptac$Cancer, cptac$primary_diagnosis) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$site_of_resection_or_biopsy) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$tissue_or_organ_of_origin) |> pheatmap::pheatmap()
table(cptac$Cancer,
paste(
$primary_diagnosis,
cptac$site_of_resection_or_biopsy,
cptac$tissue_or_organ_of_origin, sep = "_"
cptac|> pheatmap::pheatmap(show_colnames = FALSE)
))
# Solve them one by one.
cat(sub("_meta.txt", "", basename(names(cli_list))))
= sub("_meta.txt", "", basename(names(cli_list)))
typelist # BRCA CCRCC COAD GBM HGSC HNSCC LSCC LUAD PDAC UCEC
for (i in typelist) {
message("check type ", i)
= cptac |> dplyr::filter(Cancer == i)
cptac_part message(" primary_diagnosis")
table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
message(" site_of_resection_or_biopsy")
table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
message(" tissue_or_organ_of_origin")
table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}# BRCA (122): tissue_or_organ_of_origin "Breast, NOS"
# CCRCC (103): primary_diagnosis "Renal cell carcinoma, NOS"
# tissue_or_organ_of_origin "Kidney, NOS"
# COAD (106): tissue_or_organ_of_origin "Colon, NOS", "Rectum, NOS"
# GBM (99): primary_diagnosis "Glioblastoma"
# HGSC (87): primary_diagnosis "Serous adenocarcinoma, NOS" AND
# tissue_or_organ_of_origin Fallopian tube Ovary Peritoneum, NOS
# HNSCC (108): Other situation: primary_diagnosis "Squamous cell carcinoma, NOS"
# Base of tongue, NOS Cheek mucosa Floor of mouth, NOS Gum, NOS
# HNSCC 1 2 18 3
#
# Head, face or neck, NOS Larynx, NOS Lip, NOS Oropharynx, NOS
# HNSCC 2 47 4 4
#
# Overlapping lesion of lip, oral cavity and pharynx Tongue, NOS Tonsil, NOS
# HNSCC
# LSCC (108): primary_diagnosis "Squamous cell carcinoma, NOS" AND
# tissue_or_organ_of_origin "lung" pattern
# LUAD (110): primary_diagnosis "Adenocarcinoma, NOS" AND
# tissue_or_organ_of_origin "lung" pattern
# PDAC (105): tissue_or_organ_of_origin "pancreas" pattern
# UCEC (95): primary_diagnosis "Endometrioid adenocarcinoma, NOS"
sum(!is.na(cptac$Cancer))
# rbind(cptac2, cptac3)[,192] # There are duplicate columns, need to go back and handle this earlier (fixed).
# Actual testing and updating
= rbind(cptac2, cptac3) |>
cptac_updated unique() |>
::mutate(
dplyrCancer = dplyr::case_when(
%in% "Breast, NOS" ~ "BRCA",
tissue_or_organ_of_origin %in% "Renal cell carcinoma, NOS" ~ "CCRCC",
primary_diagnosis %in% c("Colon, NOS", "Rectum, NOS") ~ "COAD",
tissue_or_organ_of_origin grepl("Glioblastoma", primary_diagnosis, ignore.case = TRUE) | grepl("Oligodendroglioma", primary_diagnosis, ignore.case = TRUE) ~ "GBM",
grepl("GTEX", case_submitter_id, ignore.case = TRUE) ~ "GTEX-Brain",
%in% c("C3L-06912", "C3L-07212") ~ "GBM",
case_submitter_id #primary_diagnosis %in% "Glioblastoma" ~ "GBM",
%in% "Serous adenocarcinoma, NOS" & tissue_or_organ_of_origin %in% c("Fallopian tube", "Ovary", "Peritoneum, NOS") ~ "HGSC",
primary_diagnosis %in% "Squamous cell carcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LSCC",
primary_diagnosis %in% "Adenocarcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LUAD",
primary_diagnosis grepl("pancreas", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "PDAC",
%in% "Endometrioid adenocarcinoma, NOS" ~ "UCEC",
primary_diagnosis TRUE ~ "HNSCC"
)|> as.data.table()
)
for (i in "HNSCC") { # typelist
message("check type ", i)
= cptac_updated |> dplyr::filter(Cancer == i)
cptac_part message(" primary_diagnosis")
table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
message(" site_of_resection_or_biopsy")
table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
message(" tissue_or_organ_of_origin")
table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}
table(cptac$Cancer)
table(cptac_updated$Cancer)
|>
cptac_updated ::filter(
dplyr== "'--" | site_of_resection_or_biopsy == "'--" |
primary_diagnosis %in% c("'--", "Unknown")
tissue_or_organ_of_origin
)# 1-9 from GTEX-Brain
# C3L-06912 C3L-07212 Gliomas # https://portal.gdc.cancer.gov/cases/e6027623-a5b2-4d82-a9ca-f37c948dd3ed
$case_submitter_id[grepl("GTEX", cptac_updated$case_submitter_id)]
cptac_updated
nrow(cptac_updated)
fwrite(cptac_updated, file = "CPTAC_combined_with_cancer_annotated.tsv", sep = "\t")
= cptac_updated[! case_submitter_id %in% cli.final$Sample_ID]
cptac_unmatched colnames(cptac_unmatched)
colnames(cli.final)[1:20]
= cptac_unmatched |>
cptac_unmatched2 ::select(
dplyr
case_submitter_id, age_at_diagnosis, gender,
days_to_death, vital_status,::starts_with("ajcc"),
dplyr::starts_with("days"),
dplyr
last_known_disease_status, progression_or_recurrence, residual_disease,
tumor_grade, Cancer
)
sum(cptac_unmatched2$days_to_last_follow_up != "'--")
sum(cptac_unmatched2$days_to_death != "'--")
sum(cptac_unmatched2$days_to_recurrence != "'--")
sum(cptac_unmatched2$days_to_last_known_disease_status != "'--")
sum(cptac_unmatched2$progression_or_recurrence != "'--")
sum(!is.na(cli.final$OS_Time))
sum(!is.na(cli.final$PFS_Time))
sum(cli.final$OS_Time >= cli.final$PFS_Time, na.rm = TRUE)
$OS_Time < cli.final$PFS_Time] |> View()
cli.final[cli.final
sum(as.integer(cptac_unmatched2$days_to_last_follow_up) >= as.integer(cptac_unmatched2$days_to_last_known_disease_status), na.rm = TRUE)
= cptac_unmatched2 |>
cptac_unmatched3 ::rename(
dplyrSample_ID = case_submitter_id,
Age = age_at_diagnosis,
Sex = gender,
OS_Time = days_to_last_follow_up,
OS_Status = vital_status,
PFS_Time = days_to_last_known_disease_status,
PFS_Status = progression_or_recurrence,
Histologic_Grade = tumor_grade,
Path_Stage_pT = ajcc_pathologic_t,
Path_Stage_pN = ajcc_pathologic_n,
Stage = ajcc_pathologic_stage
|>
) ::select(
dplyr
Sample_ID, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status,:Path_Stage_pT,
Path_Stage_pN
Histologic_Grade, Cancer|>
) ::mutate(
dplyrPatient_ID = Sample_ID,
Age = as.integer(round(as.integer(Age) / 365)),
Sex = fcase(Sex %in% "female", "F",
%in% "male", "M", default = NA_character_),
Sex OS_Time = as.integer(OS_Time),
OS_Status = fcase(OS_Status %in% "Alive", 0L,
%in% "Dead", 1L, default = NA_integer_),
OS_Status PFS_Time = as.integer(PFS_Time),
PFS_Status = fcase(PFS_Status %in% "no", 0L,
%in% "yes", 1L, default = NA_integer_),
PFS_Status OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time),
Path_Stage_pN = fcase(
%in% "N0", "pN0",
Path_Stage_pN %in% c("N1", "N1a"), "pN1",
Path_Stage_pN %in% c("N2", "N2a"), "pN2",
Path_Stage_pN default = NA_character_
),Path_Stage_pT = fcase(
%in% "T0", "pT0",
Path_Stage_pT %in% c("T1", "T1a", "T1b", "T1c"), "pT1",
Path_Stage_pT %in% c("T2", "T2a", "T2b"), "pT2",
Path_Stage_pT %in% c("T3", "T3a", "T3b"), "pT3",
Path_Stage_pT %in% "T4", "pT4",
Path_Stage_pT default = NA_character_
),Stage = fcase(
%in% c("Stage I", "Stage IA", "Stage IA1", "Stage IA2", "Stage IA3", "Stage IB"), "Stage I",
Stage %in% c("Stage II", "Stage IIA", "Stage IIB"), "Stage II",
Stage %in% c("Stage III", "Stage IIIA", "Stage IIIB", "Stage IIIC", "Stage IIIC1", "Stage IIIC2"), "Stage III",
Stage %in% c("Stage IV", "Stage IVB"), "Stage IV",
Stage default = NA_character_
),Histologic_Grade = fcase(
%in% "G1", "G1 Well differentiated",
Histologic_Grade %in% "G2", "G2 Moderately differentiated",
Histologic_Grade %in% "G3", "G3 Poorly differentiated",
Histologic_Grade %in% "G4", "G4 Undifferentiated",
Histologic_Grade default = NA_character_
)
)
cptac_unmatched3
# X: cannot be assessed
table(cptac_unmatched3$Path_Stage_pN)
table(cli.final$Path_Stage_pN)
table(cptac_unmatched3$Path_Stage_pT)
table(cli.final$Path_Stage_pT)
table(cptac_unmatched3$Stage)
table(cli.final$Stage)
table(cptac_unmatched3$Histologic_Grade)
table(cli.final$Histologic_Grade)
# Final output ------------------------------------------------------------
.2 = dplyr::bind_rows(
cli.final
cli.final, cptac_unmatched3|> as.data.table()
)
table(cli.final.2$Cancer)
sum(table(cli.final.2$Cancer))
# Include all Race information?
.2 = dplyr::left_join(
cli.final.2, unique(rbind(cptac2, cptac3)[, .(case_submitter_id, race)]),
cli.finalby = c("Sample_ID"="case_submitter_id")
|>
) ::mutate(
dplyrrace = ifelse(race %in% c("Unknown", "not reported"), NA_character_,
race)|>
) as.data.table()
sum(is.na(cli.final.2$race))
fwrite(cli.final.2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")
TARGET
library(data.table)
= fread("TARGET_donor_allprojects_transfer_to_sample.gz")
cli1 = fread("TARGET_phenotype.gz")
cli2
= cli2 |>
target_cli ::select(-`_cohort`) |>
dplyr::rename(Patient_ID = `_PATIENT`,
dplyrSample_ID = sample_id,
disease_code = primary_disease_code,
disease = `_primary_disease`,
sample_type = `_sample_type`) |>
::full_join(
dplyr|> dplyr::select(
cli1 -`_OS_UNIT`, -`_EVENT`, -`_TIME_TO_EVENT`
|>
) ::rename(
dplyrSample_ID = xena_sample,
Sex = `_gender`,
Age = `_age_at_diagnosis`,
OS_Status = `_OS_IND`,
OS_Time = `_OS`
by = "Sample_ID"
), |>
) ::mutate(
dplyrSex = ifelse(Sex == "Male", "M", "F"),
OS_Status = ifelse(OS_Status == "Alive", 0L, 1L)
|>
) ::select(Patient_ID, Sample_ID, Age, Sex, OS_Time, OS_Status, dplyr::everything()) |>
dplyras.data.table()
cli1
target_cli
fwrite(target_cli, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t")
# Using cli from GDC ------------------------------------------------------
system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/TARGET
mkdir ALL_P1 ALL_P2 ALL_P3 AML CCSK NBL OS RT WT
tar zxvf clinical.project-target-all-p1.2024-08-07.tar.gz -C ALL_P1
tar zxvf clinical.project-target-all-p2.2024-08-07.tar.gz -C ALL_P2
tar zxvf clinical.project-target-all-p3.2024-08-07.tar.gz -C ALL_P3
tar zxvf clinical.project-target-aml.2024-08-07.tar.gz -C AML
tar zxvf clinical.project-target-ccsk.2024-08-07.tar.gz -C CCSK
tar zxvf clinical.project-target-nbl.2024-08-07.tar.gz -C NBL
tar zxvf clinical.project-target-os.2024-08-07.tar.gz -C OS
tar zxvf clinical.project-target-rt.2024-08-07.tar.gz -C RT
tar zxvf clinical.project-target-wt.2024-08-07.tar.gz -C WT
")
= rbindlist(
gdc_target list(
fread("gdc_cli/TARGET/ALL_P1/clinical.tsv"),
fread("gdc_cli/TARGET/ALL_P2/clinical.tsv"),
fread("gdc_cli/TARGET/ALL_P3/clinical.tsv"),
fread("gdc_cli/TARGET/AML/clinical.tsv"),
fread("gdc_cli/TARGET/CCSK/clinical.tsv"),
fread("gdc_cli/TARGET/NBL/clinical.tsv"),
fread("gdc_cli/TARGET/OS/clinical.tsv"),
fread("gdc_cli/TARGET/RT/clinical.tsv"),
fread("gdc_cli/TARGET/WT/clinical.tsv")
),use.names = TRUE, fill = TRUE
)
table(gdc_target$project_id)
any(duplicated(gdc_target))
= gdc_target[!duplicated(gdc_target)]
gdc_target
all.equal(gdc_target[[119]], gdc_target[[192]])
192]] = NULL
gdc_target[[
sum(gdc_target$days_to_last_follow_up != "'--")
sum(gdc_target$days_to_death != "'--")
= gdc_target |>
gdc_target2 ::rename(
dplyrSample_ID = case_submitter_id,
Age = age_at_diagnosis,
Sex = gender,
OS_Time = days_to_death,
OS_Status = vital_status
|>
) ::select(
dplyr
Sample_ID, Age, Sex, OS_Time, OS_Status,
project_id, race, primary_diagnosis|>
) ::mutate(
dplyrPatient_ID = Sample_ID,
Age = as.integer(round(as.integer(Age) / 365)),
Sex = fcase(Sex %in% "female", "F",
%in% "male", "M", default = NA_character_),
Sex OS_Time = as.integer(OS_Time),
OS_Status = fcase(OS_Status %in% "Alive", 0L,
%in% "Dead", 1L, default = NA_integer_),
OS_Status OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time)
|>
) as.data.table()
#unique()
# exist a large number of case level replicate
duplicated(gdc_target2), ]
gdc_target2[duplicated(gdc_target[case_submitter_id == "TARGET-20-PANTIV"])
== "TARGET-20-PANTIV"] |> View() # A few records that have been updated?
gdc_target[case_submitter_id
= gdc_target2[!duplicated(gdc_target2)]
gdc_target2
#table(gdc_target2$PFS_Time)
any(duplicated(gdc_target2))
any(duplicated(gdc_target2$Sample_ID))
duplicated(gdc_target2$Sample_ID)]
gdc_target2[# They may exist in different ALL projects.
= gdc_target2[Sample_ID %in% Sample_ID[duplicated(Sample_ID)]]
gdc_dup table(gdc_dup$project_id)
# TARGET-ALL-P1 TARGET-ALL-P2 TARGET-ALL-P3
# 10 57 47
#
# For now, we won't address this issue. If we conduct a combined analysis of ALL, we will consider removing duplicates.
colnames(gdc_target2)
= gdc_target2[, c("Patient_ID", "Sample_ID", "Age", "Sex", "race", "primary_diagnosis", "OS_Time", "OS_Status", "project_id")]
gdc_target2
fwrite(gdc_target2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t")
sum(!is.na(gdc_target2$OS_Time))
sum(!is.na(cli1$`_OS`))
nrow(na.omit(unique(target_cli[, .(Patient_ID, OS_Time, OS_Status)])))
nrow(na.omit(unique(gdc_target2[, .(Patient_ID, OS_Time, OS_Status)])))
TCGA/TARGET/CPTAC Clinical/IOBR/TME/GENE data divide and anotate preprocessing
TARGET
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
<- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.tsv"
file_meta <- fread(file_meta, sep = "\t")
metafusion
# TARGET
## info
### Because some projects are not included in the target_sheet, we should abandon them.
<- fread("/home/data2/Projects/Fusion/unclean/TARGET_gdc_sample_sheet.2023-11-15.tsv", sep = "\t")
TARGET_sheet <- substr(TARGET_sheet$`Sample ID`, 1, 9) |> unique()
project_choose
project_choose### "TARGET-20" "TARGET-00" "TARGET-21" "TARGET-30"
### TARGET_sheet pretreatment
### Composite Sample
###
### Composite Sample use comma to separate like TARGET-30-PAPEAV, TARGET-30-PAPTFZ
<- TARGET_sheet |>
TARGET_Composite filter(grepl(", ", `Sample ID`)) |>
select(`Sample ID`) |>
unique()
TARGET_Composite### 1: TARGET-30-PAPEAV-01A, TARGET-30-PAPTFZ-01A
### 2: TARGET-30-PANUKV-01A, TARGET-30-PASUML-01A
### 3: TARGET-30-PANKFE-01A, TARGET-30-PAPUAR-01A
### 4: TARGET-30-PASYPX-01A, TARGET-30-PAIXIF-01A
### For the CPTAC databases, check whether there is any overlap between composite and normal Sample_ID in the info data.
<- TARGET_sheet |>
check_TARGET_overlap select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
unique() |>
separate_rows(`Sample ID`, sep = ",\\s*")
<- check_TARGET_overlap |>
check_TARGET_overlap mutate(Sample_Type = ifelse(grepl(", ", `Sample ID`), "Composite", "Normal"))
<- check_TARGET_overlap |>
overlap_samples group_by(`Sample ID`) |>
summarise(Count = n(), .groups = 'drop') |>
filter(Count > 1)
overlap_samples### A tibble: 4 × 2
### `Sample ID` Count
### <chr> <int>
### 1 TARGET-30-PANKFE-01A 2
### 2 TARGET-30-PAPTFZ-01A 2
### 3 TARGET-30-PASUML-01A 2
### 4 TARGET-30-PASYPX-01A 2
### So we choose abandon them.
<- TARGET_sheet |>
TARGET_pretreatment select(c(`Case ID`, `Sample ID`, `Sample Type`)) |>
rename("Patient_ID" = `Case ID`,
"Sample_ID" = `Sample ID`,
"Sample_type" = `Sample Type`) |>
filter(!grepl(",", Sample_ID)) |>
filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |>
unique()
### TARGET_xena_info pretreatment
### This data use
<- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t")
TARGET_xena_info <- TARGET_xena_info |>
TARGET_xena_pretreatment mutate(Patient_ID = str_extract(Sample_ID, "^[^-]+-[^-]+-[^-]+")) |>
filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |>
mutate(Age = round(Age, digits = 0),
across(c(Age, OS_Time, OS_Status), ~ as.integer(.))) |>
rename(Cancer = disease_code)
### TARGET_info pretreatment
### In the TARGET database, rows with the info column marked as '--' are annotated as NA.
<- function(x) {
filter_lose is.na(x) | x == "" | x == "--" | x == "Unknown" | x == "'--"
}<- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t")
TARGET_info <- c(
replacement_table2 "Acute lymphocytic leukemia" = "TARGET-ALL",
"'--" = "TARGET-Unkown",
"Precursor B-cell lymphoblastic leukemia" = "TARGET-B-ALL",
"Not Reported" = "TARGET-Unkown",
"Acute myeloid leukemia, NOS" = "TARGET-LAML",
"Clear cell sarcoma of kidney" = "TARGET-CCSK",
"Neuroblastoma, NOS" = "TARGET-NBL",
"Ganglioneuroblastoma" = "TARGET-GNB",
"Osteosarcoma, NOS" = "TARGET-OS",
"Malignant rhabdoid tumor" = "TARGET-MRT",
"Wilms tumor" = "TARGET-WT"
)<- TARGET_info$primary_diagnosis %in% names(replacement_table2)
matched2 <- TARGET_info |>
TARGET_info_pretreatment mutate(cohort = case_when(
~ replacement_table2[primary_diagnosis],
matched2 TRUE ~ as.character(NA)
|>
)) rename(Cancer = primary_diagnosis,
Project_id = project_id) |>
filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |>
mutate(across(c("Age", "Sex", "OS_Time", "OS_Status", "race"), ~ ifelse(filter_lose(.x), NA, .x))) |>
mutate(Cancer = ifelse(Cancer == "'--", NA, Cancer))
### Check if the valid rows in TARGET_info match the valid rows in TARGET_xena_info.
### 1.Filter the valid rows.
<- function(x) {
filter_valid !is.na(x) & x != "" & x != "--" & x != "Unknown" & x == "'--"
}
### Filter the valid rows in TARGET_info.
<- TARGET_info_pretreatment %>%
TARGET_info_valid filter(filter_valid(Age) & filter_valid(OS_Time) & filter_valid(OS_Status) & filter_valid(Sex)) |>
select(c(Patient_ID, Age, OS_Time, OS_Status, Sex))
### Filter the valid rows in TARGET_xena_info.
<- TARGET_xena_pretreatment %>%
TARGET_xena_info_valid filter(filter_valid(Age) & filter_valid(OS_Time) & filter_valid(OS_Status) & filter_valid(Sex)) |>
select(c(Patient_ID, Age, OS_Time, OS_Status, Sex))
### 2. Merge the two data frames based on Patient_ID.
<- TARGET_info_valid %>%
merged_data inner_join(TARGET_xena_info_valid, by = "Patient_ID", suffix = c("_info", "_xena"))
### 3. Check whether the valid values in the merged data frame are consistent.
<- merged_data %>%
is_consistent rowwise() %>%
mutate(
Age_consistent = Age_info == Age_xena,
Sex_consistent = Sex_info == Sex_xena,
OS_Time_consistent = OS_Time_info == OS_Time_xena,
OS_Status_consistent = OS_Status_info == OS_Status_xena
%>%
) ungroup()
### 4. Age not onsistent
<- is_consistent |>
Age_not_consistent filter(Age_consistent == "FALSE")
### 5. Sex not onsistent
<- is_consistent |>
Sex_not_consistent filter(Sex_consistent == "FALSE")
### Sex_not_consistent
### A tibble: 0 × 13
### 6. OS_Time not onsistent
<- is_consistent |>
OS_Time_not_consistent filter(OS_Time_consistent == "FALSE") |>
mutate(
OS_Time_info_greater = OS_Time_info > OS_Time_xena,
OS_Time_xena_greater = OS_Time_xena > OS_Time_info,
OS_Time_equal = OS_Time_info == OS_Time_xena
)nrow(OS_Time_not_consistent |> filter(OS_Time_info_greater == "FALSE"))
### [1]0
nrow(OS_Time_not_consistent |> filter(OS_Time_xena_greater == "TRUE"))
### [1]0
nrow(OS_Time_not_consistent |> filter(OS_Time_equal == "TRUE"))
### [1]0
### 7. OS_Status not onsistent
<- is_consistent |>
OS_Status_not_consistent filter(OS_Status_consistent == "FALSE")
unique(OS_Status_not_consistent$OS_Status_info)
### [1] 1
unique(OS_Status_not_consistent$OS_Status_xena)
### [1] 0
### All evidence shows that the GDC data has been updated more recently than the Xena data.
### Therefore, if there is a conflict between GDC data and Xena data, we should use the GDC data.
### Use the valid values from TARGET_xena_info to fill the missing values in TARGET_info.
### 1: Determine the matching Patient_ID.
<- intersect(TARGET_xena_pretreatment$Patient_ID, TARGET_info_pretreatment$Patient_ID)
matched_ids setDT(TARGET_xena_pretreatment)
setDT(TARGET_info_pretreatment)
### 2: Find the columns that require updates.
<- setdiff(names(TARGET_info_pretreatment), c("Sample_ID", "Patient_ID", "Cancer","project_id","cohort","race"))
cols_to_update
### 3.Iterate through each matching Patient_ID and update TARGET_info_pretreatment using TARGET_xena_pretreatment.
### update Age
<- TARGET_xena_pretreatment |>
TARGET_xena_can filter(if_any("Age", ~ !filter_lose(.x)))
<- TARGET_info_pretreatment |>
TARGET_info_match mutate(Age = case_when(
filter_lose(Age) & Patient_ID %in% TARGET_xena_can$Patient_ID ~ TARGET_xena_can$Age[match(Patient_ID, TARGET_xena_can$Patient_ID)],
TRUE ~ Age))
### update Sex
<- TARGET_xena_pretreatment |>
TARGET_xena_can filter(if_any("Sex", ~ !filter_lose(.x)))
<- TARGET_info_match |>
TARGET_info_match1 mutate(Sex = case_when(
filter_lose(Sex) & Patient_ID %in% TARGET_xena_can$Patient_ID ~ TARGET_xena_can$Sex[match(Patient_ID, TARGET_xena_can$Patient_ID)],
TRUE ~ Sex))
### update OS_Time
<- TARGET_xena_pretreatment |>
TARGET_xena_can filter(if_any("OS_Time", ~ !filter_lose(.x)))
<- TARGET_info_match1 |>
TARGET_info_match2 mutate(OS_Time = case_when(
filter_lose(OS_Time) & Patient_ID %in% TARGET_xena_can$Patient_ID & OS_Status %in% TARGET_xena_can$OS_Status ~ TARGET_xena_can$OS_Time[match(Patient_ID, TARGET_xena_can$Patient_ID)],
TRUE ~ OS_Time))
### update OS_Status
<- TARGET_xena_pretreatment |>
TARGET_xena_can filter(if_any("OS_Status", ~ !filter_lose(.x)))
<- TARGET_info_match2 |>
TARGET_info_after_match mutate(OS_Status = case_when(
filter_lose(OS_Status) & Patient_ID %in% TARGET_xena_can$Patient_ID & OS_Time %in% TARGET_xena_can$OS_Time ~ TARGET_xena_can$OS_Status[match(Patient_ID, TARGET_xena_can$Patient_ID)],
TRUE ~ OS_Status))
### Use TARGET_sheet to perform a left join, ensuring that the Sample_ID is correct.
<- TARGET_info_after_match |> select(- Sample_ID)
TARGET_info_after_match <- TARGET_pretreatment |>
TARGET_correct left_join(TARGET_info_after_match, by = c("Patient_ID" = "Patient_ID")) |>
mutate(Run = Sample_ID)
### Replace the cohort column values with Project_ID where the row is TARGET-Unkown.
### TARGET_correct$project_id |> unique()
### [1] "TARGET-AML" "TARGET-NBL"
<- TARGET_correct %>%
TARGET_correct_after_pro mutate(
cohort = case_when(
== "TARGET-Unkown" ~ Project_id,
cohort TRUE ~ cohort),
cohort = case_when(
== "TARGET-AML" ~ "TARGET-LAML",
cohort TRUE ~ cohort)
)
### Divide a part of data into a group based on whether the Sample_type is cell
<- TARGET_correct_after_pro |> filter(`Sample_type` == "Cell Lines")
target_cell <- target_cell$Sample_ID |> unique()
target_cell_sample <- target_cell$Patient_ID |> unique()
target_cell_patient ### target_cell_sample
### [1] "TARGET-20-MUTZ3-50A" "TARGET-20-ML1-50A" "TARGET-20-MV411D1-50A" "TARGET-20-KasumiD1-50A" "TARGET-20-HEK293-50A" "TARGET-20-MOLM14CBFGLIS-50A"
### [7] "TARGET-20-TF1-50A" "TARGET-20-IGROV1-50A" "TARGET-20-CMS-50A" "TARGET-20-WSUAML-50A" "TARGET-20-KasumiAZAD5-50A" "TARGET-20-KasumiAZAD11-50A"
### [13]"TARGET-20-REH-50A" "TARGET-20-MV411AZAD11-50A" "TARGET-20-MV411AZAD5-50A" "TARGET-20-THP1-50A" "TARGET-20-OCIAML2-50A" "TARGET-20-HL60-50A"
### target_cell_patient
### [1] "TARGET-20-MUTZ3" "TARGET-20-ML1" "TARGET-20-MV411D1" "TARGET-20-KasumiD1" "TARGET-20-HEK293" "TARGET-20-MOLM14CBFGLIS"
### [7] "TARGET-20-TF1" "TARGET-20-IGROV1" "TARGET-20-CMS" "TARGET-20-WSUAML" "TARGET-20-KasumiAZAD5" "TARGET-20-KasumiAZAD11"
### [13]"TARGET-20-REH" "TARGET-20-MV411AZAD11" "TARGET-20-MV411AZAD5" "TARGET-20-THP1" "TARGET-20-OCIAML2" "TARGET-20-HL60"
### reference : https://www.cellosaurus.org/
### MUTZ3:Acute myeloid leukemia
### ML1: Adult acute myeloid leukemia
### MV411D1:biphenotypic B-myelomonocytic leukemia
### KasumiD1:Acute myeloblastic leukemia
### TF1:Acute erythroid leukemia
### IGROV1: Ovarian endometrioid adenocarcinoma
### WSUAML:Acute myeloid leukemia。
### KasumiAZAD5:can't find, so KasumiD1:Acute myeloblastic leukemia
### KasumiAZAD11:can't find, so KasumiD1:Acute myeloblastic leukemia
### REH:Childhood B acute lymphoblastic leukemia
### MV411AZAD11:can't find, so MV411D1:biphenotypic B-myelomonocytic leukemia
### MV411AZAD5:can't find, so MV411D1:biphenotypic B-myelomonocytic leukemia
### THP1:Childhood acute monocytic leukemia
### OCIAML2:Adult acute myeloid leukemia
### HL60:Adult acute myeloid leukemia
### HEK293:Vaccine production cell line.
### CML:Childhood acute megakaryoblastic leukemia
### MOLM14CBFGLIS:can't find, so MOLM14:Adult acute myeloid leukemia
<- metafusion[metafusion$Sample_ID %in% target_cell_sample]
cell_data ### > nrow(cell_data)
### [1] 1853
<- TARGET_info_pretreatment[TARGET_info_pretreatment$Patient_ID %in% target_cell_patient]
cell_info ### nrow(cell_info)
### [1] 18
### no os.etc
<- TARGET_xena_pretreatment[TARGET_xena_pretreatment$Patient_ID %in% target_cell_patient]
cell_info2 ### nrow(cell_info2)
### [1] 0
<- TARGET_correct_after_pro |>
TARGET_info_after_cell mutate(cohort = case_when(
`Sample_type` %in% "Cell Lines" ~ "TARGET-CELL",
TRUE ~ cohort
))
### The criteria for distinguishing normal samples based on Sample_type column in the TARGET databases are as follows:
$Sample_type |> unique()
TARGET_info_after_cell### [1] "Primary Blood Derived Cancer - Bone Marrow" "Recurrent Blood Derived Cancer - Bone Marrow"
### [3] "Blood Derived Normal" "Primary Blood Derived Cancer - Peripheral Blood"
### [5] "Bone Marrow Normal" "Cell Lines"
### [7] "Next Generation Cancer Model" "Blood Derived Cancer - Bone Marrow, Post-treatment"
### [9] "Primary Tumor" "Recurrent Tumor"
### [11] "Recurrent Blood Derived Cancer - Peripheral Blood" "Blood Derived Cancer - Peripheral Blood, Post-treatment"
###
### If the sample type is "Blood Derived Normal" or "Bone Marrow Normal", then it is a normal sample. Otherwise, it is a tumor sample.
<- TARGET_info_after_cell |>
TARGET_info_after_normal mutate(cohort = case_when(
`Sample_type` %in% c("Blood Derived Normal", "Bone Marrow Normal") ~ "TARGET-NORMAL",
TRUE ~ cohort
))
### Adjust the column order.
<- c("Patient_ID", "Sample_ID", "Run", "Sample_type", "Age", "Sex", "race", "Cancer",
new_order "OS_Time", "OS_Status", "Project_id", "cohort")
<- TARGET_info_after_normal |>
TARGET_info_after_processing select(all_of(new_order))
### make sure nrow is same between Before and After Processing.
|> nrow()
TARGET_pretreatment ### [1] 3222
|> nrow()
TARGET_info_after_processing ### [1] 3222
### make sure IOBR/TME data just contain sample which target_sheet gave.
<- TARGET_info_after_processing$Run
detected_id_TARGET
### save info
<- unique(TARGET_info_after_processing$cohort)
cohorts
for (cohort1 in cohorts) {
<- subset(TARGET_info_after_processing, cohort == cohort1)
cohort_data <- gsub("-", "_", cohort1)
name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/", name ,"_info.tsv")
file_name fwrite(cohort_data , file = file_name, sep = "\t")
}
##Gene
<- metafusion |>
metafusion1 mutate(
cohort = case_when(
%in% TARGET_info_after_processing$Run & grepl("TARGET", cohort) ~ TARGET_info_after_processing$cohort[match(Sample_ID, TARGET_info_after_processing$Run)],
Sample_ID TRUE ~ cohort
)
) <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.target.tsv"
cp_file_name1 fwrite(metafusion1, file = cp_file_name1, sep = "\t")
##TME
<- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-NBL_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-NBL")
NBL_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-AML_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-AML")
AML_TME <- rbind(NBL_TME,AML_TME) |>
TARGET_TME mutate(dash_count = str_count(ID, "-"))
### Manually check and delete composite samples from the info dataset.(TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R)
|>
TARGET_TME filter(dash_count == 9) |>
select(ID) |>
unique()
### 1: TARGET-30-PASYPX-01A-01R-TARGET-30-PAIXIF-01A-01R
### 2: TARGET-30-PANKFE-01A-01R-TARGET-30-PAPUAR-01A-01R
### 3: TARGET-30-PANUKV-01A-01R-TARGET-30-PASUML-01A-01R
### 4: TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R
<- TARGET_TME |>
TARGET_TME_pretreatment filter(dash_count != 9) |>
select(-dash_count)
### Filter TARGET Replicate Samples
### The filter rule following broad institute says:
###
### In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
###
### Analyte Replicate Filter
### The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
###
### Sort Replicate Filter
### The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
###
### Ref Link: <https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon>
### <https://gist.github.com/ShixiangWang/33b2f9b49b77eaa8f773b428480f9101>
<- TARGET_TME_pretreatment |>
TARGET_TME_after_replicate mutate(
prefixes = str_remove(ID, "-[^-]+$"),
suffixes = str_extract(ID, "[^-]+$"),
second_chars = str_extract(suffixes, "[0-9]+"),
second_chars = as.integer(second_chars)) |>
group_by(prefixes) |>
slice(which.max(second_chars)) |>
ungroup() |>
mutate(ID = prefixes) |>
filter(ID %in% detected_id_TARGET) |> ### Ensure that all IDs in the TME data can be found in the TARGET info data.
select(-c(prefixes,suffixes,second_chars))
### Using TARGET info data cohort column to annotate TARGET TME data
<- TARGET_info_after_processing |>
cohort_data select(Run, cohort) |>
unique()
<- TARGET_TME_after_replicate |>
TARGET_TME_after_cohort mutate(Run = ID) |>
left_join(cohort_data, by = "Run") |>
select(-Run)
### save TME
<- unique(TARGET_TME_after_cohort$cohort)
cohorts_TME
for (cohort1 in cohorts_TME) {
<- subset(TARGET_TME_after_cohort, cohort == cohort1) |> select(-cohort)
cohort_data <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
file_name write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
## IOBR
<- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-NBL_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-NBL")
NBL_IOBR_signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-AML_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-AML")
AML_IOBR_signature <- rbind(NBL_IOBR_signature,AML_IOBR_signature) |>
TARGET_IOBR mutate(dash_count = str_count(ID, "-"))
### Manually check and delete composite samples from the info dataset.(TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R)
|>
TARGET_IOBR filter(dash_count == 9) |>
select(ID) |>
unique()
### 1: TARGET-30-PASYPX-01A-01R-TARGET-30-PAIXIF-01A-01R
### 2: TARGET-30-PANKFE-01A-01R-TARGET-30-PAPUAR-01A-01R
### 3: TARGET-30-PANUKV-01A-01R-TARGET-30-PASUML-01A-01R
### 4: TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R
<- TARGET_IOBR |>
TARGET_IOBR_pretreatment filter(dash_count != 9) |>
select(-dash_count)
<- TARGET_IOBR_pretreatment |>
TARGET_IOBR_after_replicate mutate(
prefixes = str_remove(ID, "-[^-]+$"),
suffixes = str_extract(ID, "[^-]+$"),
second_chars = str_extract(suffixes, "[0-9]+"),
second_chars = as.integer(second_chars)) |>
group_by(prefixes) |>
slice(which.max(second_chars)) |>
ungroup() |>
mutate(ID = prefixes) |>
filter(ID %in% detected_id_TARGET) |> ### Ensure that all IDs in the TME data can be found in the TARGET info data.
select(-c(prefixes,suffixes,second_chars))
### Using TARGET info data cohort column to annotate TARGET TME data
<- TARGET_info_after_processing |>
cohort_data select(Run, cohort) |>
unique()
<- TARGET_IOBR_after_replicate |>
TARGET_IOBR_after_cohort mutate(Run = ID) |>
left_join(cohort_data, by = "Run") |>
select(-Run)
### save IOBR
<- unique(TARGET_IOBR_after_cohort$cohort)
cohorts_IOBR
for (cohort1 in cohorts_IOBR) {
<- subset(TARGET_IOBR_after_cohort, cohort == cohort1) |> select(-cohort)
cohort_data <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
file_name write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
CPTAC
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
<- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.tsv"
file_meta <- fread(file_meta, sep = "\t")
metafusion
# CPTAC
## info
<- fread("/home/data2/Projects/Fusion/unclean/CPTAC_gdc_sample_sheet.2023-11-15.tsv", sep = "\t")
CPTAC_sheet
### Composite Sample
###
### Composite Sample use comma to separate like C3L-03400-05, C3L-03400-02
###
### It leads one question is whether these composite samples using comma to separate from identical patient?
<- grepl(", ", CPTAC_sheet$`Sample ID`)
rows_with_comma <- CPTAC_sheet[rows_with_comma, ]
rows_with_comma_data
<- function(x) {
check_same_chars <- unlist(strsplit(x, ", "))
parts return(parts[1] == parts[2])
}<- sapply(rows_with_comma_data$`Case ID`, check_same_chars)
same_chars if (all(same_chars)) {
cat("TRUE\n")
else {
} cat("FALSE\n")
}### They are from identical patient.
### For the TARGET databases, check whether there is any overlap between composite and normal Sample_ID in the info data. If there is an overlap, retain only the normal Sample_ID.
<- CPTAC_sheet |>
check_cptac_overlap select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
unique() |>
separate_rows(`Sample ID`, sep = ",\\s*")
<- check_cptac_overlap |>
check_cptac_overlap mutate(Sample_Type = ifelse(grepl(", ", `Sample ID`), "Composite", "Normal"))
<- check_cptac_overlap |>
overlap_samples group_by(`Sample ID`) |>
summarise(Count = n(), .groups = 'drop') |>
filter(Count > 1)
### > overlap_samples
### A tibble: 0 × 2
### ℹ 2 variables: Sample ID <chr>, Count <int>
### no this situation
### Due to the less rigorous patient naming conventions(like 604) in the CPTAC2 project, check for any overlapping patient names between CPTAC2 and CPTAC3.
<- CPTAC_sheet |>
CPTAC_sheet_rigorous select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
unique()
<- CPTAC_sheet_rigorous |>
duplicates group_by(`Case ID`, `Sample ID`) |>
summarise(n_projects = n(), projects = paste(`Project ID`, collapse = ", ")) |>
filter(n_projects > 1)
### nrow(duplicates)
### [1] 0
### no same patient names between CPTAC2 and CPTAC3
### CPTAC_sheet pretreatment
<- CPTAC_sheet |>
matched_CPTAC_pretreatment select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
rename("Patient_ID" = `Case ID`,
"Sample_ID" = `Sample ID`,
"Sample_type" = `Sample Type`,
"Project_id" = `Project ID`) |>
separate_rows(Patient_ID, sep = ",\\s*") |> #because they are from identical patient.
unique()
### CPTAC_info pretreatment
<- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")
CPTAC_info # > CPTAC_info$Cancer |> unique()
# [1] "BRCA" "CCRCC" "COAD" "GBM" "HGSC" "HNSCC" "LSCC" "LUAD" "PDAC" "UCEC" "GTEX-Brain"
### We use our naming standard mutate a new column cohort.
<- c(
cptac_replacement_table2 "BRCA" = "CPTAC-BRCA",
"CCRCC" = "CPTAC-KIRC",
"COAD" = "CPTAC-COAD",
"GBM" = "CPTAC-GBM",
"HGSC" = "CPTAC-OV",
"HNSCC" = "CPTAC-HNSC",
"LSCC" = "CPTAC-LUSC" ,
"LUAD" = "CPTAC-LUAD",
"PDAC" = "CPTAC-PAAD",
"UCEC" = "CPTAC-UCEC",
"GTEX-Brain" = "CPTAC-GTEX-Brain"
)
<- CPTAC_info$Cancer %in% names(cptac_replacement_table2)
matched_CPTAC_info <- CPTAC_info %>%
CPTAC_info_pretreatment mutate(cohort = case_when(
~ cptac_replacement_table2[Cancer],
matched_CPTAC_info TRUE ~ as.character(NA)
|>
)) select(-Sample_ID)
### CPTAC_info_pretreatment |> filter(is.na(cohort)) |> nrow()
### [1] 0
<- matched_CPTAC_pretreatment |>
matched_CPTAC left_join(CPTAC_info_pretreatment, by = c("Patient_ID" = "Patient_ID")) |>
mutate(Run = str_replace_all(Sample_ID, ", ", "-"))
### CPTAC Sample type
### matched_CPTAC_pretreatment$Sample_type |> unique()
### [1] "Primary Tumor" "Solid Tissue Normal"
### [3] "Primary Tumor, Primary Tumor" "Primary Tumor, Primary Tumor, Primary Tumor"
### [5] "Solid Tissue Normal, Solid Tissue Normal, Solid Tissue Normal" "Solid Tissue Normal, Solid Tissue Normal"
### [7] "Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor" "Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor"
###
### Just two sample type so if grepling "Solid Tissue Normal" is normal sample, grepling "Primary Tumor" is tumor sample
<- matched_CPTAC |>
matched_CPTAC_after_processing mutate(cohort = case_when(
grepl("Solid Tissue Normal", Sample_type) ~ "CPTAC-NORMAL",
TRUE ~ cohort
))
### make sure nrow is same between Before and After Processing.
|> nrow()
matched_CPTAC_after_processing ### [1] 2558
|> nrow()
matched_CPTAC_pretreatment ### [1] 2558
### make sure IOBR/TME data just contain sample which cptac_sheet gave.
<- matched_CPTAC_after_processing$Run |> unique()
detected_id_CPTAC
### save info
<- unique(matched_CPTAC_after_processing$cohort)
cohorts
for (cohort1 in cohorts) {
<- subset(matched_CPTAC_after_processing, cohort == cohort1)
cohort_data <- gsub("-", "_", cohort1)
name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/", name ,"_info.tsv")
file_name fwrite(cohort_data , file = file_name, sep = "\t")
}
## TME
<- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-2_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC2")
CPTAC2_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-3_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC3")
CPTAC3_TME <- rbind(CPTAC2_TME, CPTAC3_TME)
CPTAC_TME
### Match matched_CPTAC_after_processing$Run to CPTAC_TME$ID
###
### CPTAC_TME can match
<- CPTAC_TME %>%
matched_can filter(ID %in% matched_CPTAC_after_processing$Run)
###
### CPTAC_TME can't match
<- CPTAC_TME %>%
matched_not filter(!(ID %in% matched_CPTAC_after_processing$Run))
###
### situtation 1: Due to the barcode naming convention and etc, some portion character are added to the endning of the IOBR/TME sample name.
### str_count(matched_last$ID, "_")
### [1] 1 2 0
### 1.1 "CPT000817-0008" and "CPT000817". If the value of str_count(matched_last$ID, "_") for a row is 0, then the value of the prefixes column for that row should be the characters before the last -, and the value of the suffixes column should be the characters after the last -.
### 1.2 "f6a7f764-2495-4a1d-89a2-e12f51_D7_1" and "f6a7f764-2495-4a1d-89a2-e12f51". If the value of str_count(matched_last$ID, "_") for a row is 1, then the value of the prefixes column for that row should be the characters before the first _, and the value of the suffixes column should be the characters after the first _.
### 1.3 "CPT008756_0001" and "CPT008756". f6a7f764-2495-4a1d-89a2-e12f51_D7_1, .If the value of str_count(matched_last$ID, "_") for a row is 2, then the value of the prefixes column for that row should be the characters before the first _, and the value of the suffixes column should be the characters after the first _.
<- matched_not %>%
matched_not rowwise() %>%
mutate(
prefixes = case_when(
str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][1],
str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][1],
str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][1],
TRUE ~ NA_character_
),suffixes = case_when(
str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][length(str_split(ID, "-")[[1]])],
str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][2],
str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][2],
TRUE ~ NA_character_
),name1 = prefixes
%>%
) ungroup()
### situtation 1 after processing can match
<- matched_not %>%
matched_in_situtation1 filter((name1 %in% matched_CPTAC_after_processing$Run))%>%
mutate(ID = name1) %>%
select(-c(suffixes,prefixes,name1) )
<- rbind(matched_can, matched_in_situtation1)
matched_can
### not in situtation 1
<- matched_not %>%
matched_not_situtation1 filter(!(name1 %in% matched_CPTAC_after_processing$Run))
### situtation 2: Due to IOBR package analyze code, an additional "X" is added to the beginning of the IOBR/TME sample name.
### "X1104806" and "1104806", "X9f905736-f662-41d6-b3ac-16758d" and "9f905736-f662-41d6-b3ac-16758d".
<- matched_not_situtation1 %>%
matched_not_situtation1 mutate(name2 = str_remove(ID, "^X"))
### situtation 2 after processing can match
<- matched_not_situtation1 %>%
matched_in_situtation2 filter((name2 %in% matched_CPTAC_after_processing$Run)) %>%
mutate(ID = name2 ) %>%
select(-c(name1,prefixes,suffixes,name2))
<- rbind(matched_can, matched_in_situtation2)
matched_can
### situtation 2 after processing can't match
<- matched_not_situtation1 %>%
matched_not_situtation2 filter(!(name2 %in% matched_CPTAC_after_processing$Run))
### situtation 3: Mixed situation1 and situation2.
### X161317f3-1a39-438e-af45-cd2502_D7 and 161317f3-1a39-438e-af45-cd2502
<- matched_not_situtation2 %>%
matched_not_situtation2 mutate(name3 = str_remove(prefixes, "^X"))
### situtation 3 after processing can match
<- matched_not_situtation2 %>%
matched_in_situtation3 filter((name3 %in% matched_CPTAC_after_processing$Run)) %>%
mutate(ID = name3) %>%
select(-c(name1,name2,prefixes,suffixes, name3))
<- rbind(matched_can, matched_in_situtation3)
matched_can
### situtation 3 after processing can't match
<- matched_not_situtation2 %>%
matched_not_situtation3 filter(!(name3 %in% matched_CPTAC_after_processing$Run))
nrow(matched_not_situtation3 )
### [1] 81
### Check the reason why it still can't match after Type 3 processing
### After manual review, these Sample_ID are not included in either the info dataset or the gdc_sheet.
### Check if CPTAC TME data ID are any samples that have been divided into multiple portions.
|> select(ID) |> nrow()
matched_can ### [1] 2558
|> select(ID) |> unique() |> nrow()
matched_can ### [1] 2558
### No this situation.
### Check if CPTAC TME data has same norw with CPTAC info data.
|> select(ID) |> unique() |> nrow()
matched_can ### [1] 2558
|> select(Run) |> unique() |> nrow()
matched_CPTAC_after_processing ### [1] 2558
### Sure.
<- matched_CPTAC_after_processing |>
cohort_data select(Run, cohort) |>
unique()
<- matched_can |>
matched_can mutate(Run = ID) |>
left_join(cohort_data, by = "Run") |>
select(-Run)
<- unique(matched_can$cohort)
cohorts_TME
for (cohort1 in cohorts_TME) {
<- subset(matched_can, cohort == cohort1) |> select(-cohort)
cohort_data <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
file_name write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
#IOBR
<- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-2_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC2")
CPTAC2_Signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-3_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC3")
CPTAC3_Signature <- rbind(CPTAC2_Signature, CPTAC3_Signature)
CPTAC_Signature
### Match matched_CPTAC_after_processing$Run to CPTAC_TME$ID
###
### CPTAC_TME can match
<- CPTAC_Signature %>%
matched_can filter(ID %in% matched_CPTAC_after_processing$Run)
###
### CPTAC_TME can't match
<- CPTAC_Signature %>%
matched_not filter(!(ID %in% matched_CPTAC_after_processing$Run))
###
### situtation 1:like TME
<- matched_not %>%
matched_not rowwise() %>%
mutate(
prefixes = case_when(
str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][1],
str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][1],
str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][1],
TRUE ~ NA_character_
),suffixes = case_when(
str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][length(str_split(ID, "-")[[1]])],
str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][2],
str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][2],
TRUE ~ NA_character_
),name1 = prefixes
%>%
) ungroup()
### situtation 1 after processing can match
<- matched_not %>%
matched_in_situtation1 filter((name1 %in% matched_CPTAC_after_processing$Run))%>%
mutate(ID = name1) %>%
select(-c(suffixes,prefixes,name1) )
<- rbind(matched_can, matched_in_situtation1)
matched_can
### situtation 1 after processing can't match
<- matched_not %>%
matched_not_situtation1 filter(!(name1 %in% matched_CPTAC_after_processing$Run))
### situtation 2:like TME
<- matched_not_situtation1 %>%
matched_not_situtation1 mutate(name2 = str_remove(ID, "^X"))
### situtation 2 after processing can match
<- matched_not_situtation1 %>%
matched_in_situtation2 filter((name2 %in% matched_CPTAC_after_processing$Run)) %>%
mutate(ID = name2 ) %>%
select(-c(name1,prefixes,suffixes,name2))
<- rbind(matched_can, matched_in_situtation2)
matched_can
### situtation 2 after processing can't match
<- matched_not_situtation1 %>%
matched_not_situtation2 filter(!(name2 %in% matched_CPTAC_after_processing$Run))
### situtation 3:like TME
<- matched_not_situtation2 %>%
matched_not_situtation2 mutate(name3 = str_remove(prefixes, "^X"))
### situtation 3 after processing can match
<- matched_not_situtation2 %>%
matched_in_situtation3 filter((name3 %in% matched_CPTAC_after_processing$Run)) %>%
mutate(ID = name3) %>%
select(-c(name1,name2,prefixes,suffixes, name3))
<- rbind(matched_can, matched_in_situtation3)
matched_can
### situtation 3 after processing can't match
<- matched_not_situtation2 %>%
matched_not_situtation3 filter(!(name3 %in% matched_CPTAC_after_processing$Run))
nrow(matched_not_situtation3 )
### [1] 81
### Check the reason why it still can't match after Type 3 processing
### After manual review, these Sample_ID are not included in either the info dataset or the gdc_sheet.
### Check if CPTAC IOBR data ID are any samples that have been divided into multiple portions.
|> select(ID) |> nrow()
matched_can ### [1] 2558
|> select(ID) |> unique() |> nrow()
matched_can ### [1] 2558
### No this situation.
### Check if CPTAC IOBR data has same norw with CPTAC info data.
|> select(ID) |> unique() |> nrow()
matched_can ### [1] 2558
|> select(Run) |> unique() |> nrow()
matched_CPTAC_after_processing ### [1] 2558
### Sure.
<- matched_CPTAC_after_processing |>
cohort_data select(Run, cohort) |>
unique()
<- matched_can |>
matched_can mutate(Run = ID) |>
left_join(cohort_data, by = "Run") |>
select(-Run)
<- unique(matched_can$cohort)
cohorts_IOBR
for (cohort1 in cohorts_IOBR) {
<- subset(matched_can, cohort == cohort1) |> select(-cohort)
cohort_data <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
file_name write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
## Gene
<- metafusion |>
metafusion1 mutate(
cohort = case_when(
%in% matched_CPTAC_after_processing$Run & grepl("CPTAC", cohort) ~ matched_CPTAC_after_processing$cohort[match(Sample_ID, matched_CPTAC_after_processing$Run)],
Sample_ID TRUE ~ cohort
)
)
### In the past, the metafusion file had an issue with the sample_type being incorrectly labeled.
### After further investigation, it was found that only the composite Sample_ID in CPTAC had this issue, where samples labeled as 'Normal' were mistakenly identified as 'Tumor' in the sample_type column.
<- matched_CPTAC_after_processing |>
normal_id filter(cohort == "CPTAC-NORMAL") |>
select(Run) |>
unique() |>
pull(Run)
<- metafusion1 |>
meta_normal filter(Sample_ID %in% normal_id)
unique(meta_normal$sample_type)
#[1] "Normal" "Tumor"
<- metafusion1 |>
meta_processing mutate(
sample_type = ifelse(
!= "Normal" & Sample_ID %in% normal_id,
sample_type "Normal",
sample_type
)
)
<- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.tsv"
cp_file_name1 fwrite(meta_processing, file = cp_file_name1, sep = "\t")
TCGA
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
<- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.target.tsv"
file_meta <- fread(file_meta , sep = "\t")
metafusion
# TCGA
## info
<- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")
TCGA_info ### TCGA_sheet pretreatment
<- fread("/home/data2/Projects/Fusion/unclean/TCGA_sample_sheet.2024-05-14.tsv", sep = "\t") |>
TCGA_sheet select(c(`Sample ID`, `Sample Type`,`Project ID`))
colnames(TCGA_sheet) <- c("Sample_ID", "Sample_type","Project_id")
<- TCGA_sheet |>
TCGA_sheet_pretreatment mutate(Sample_ID = substr(Sample_ID, 1, 15)) |>
unique()
### We use our naming standard mutate a new column cohort.
<- TCGA_sheet_pretreatment |>
TCGA_after_processing left_join(TCGA_info, by = c("Sample_ID" = "Sample_ID")) |>
mutate(
cohort = Project_id,
Run = Sample_ID,
Patient_ID = substr(Sample_ID, 1, 12),
)### TCGA Sample type
### TCGA_after_processing$Sample_type |> unique()
### [1] "Primary Tumor" "Solid Tissue Normal" "Metastatic"
### [4] "Additional - New Primary" "Primary Blood Derived Cancer - Peripheral Blood" "Additional Metastatic"
### [7] "Recurrent Tumor"
###
### "Solid Tissue Normal" is normal sample, others is tumor sample
<- TCGA_after_processing |>
TCGA_NORMAL filter(Sample_type == "Solid Tissue Normal")
<- TCGA_NORMAL$Run |> unique()
NORMAL_id
<- TCGA_after_processing |>
TCGA_after_processing mutate(
Project_id = cohort,
cohort = case_when(
%in% NORMAL_id ~ "TCGA-NORMAL",
Run TRUE ~ cohort))
### make sure nrow is same between Before and After Processing.
|> nrow()
TCGA_after_processing ### [1] 10545
|> nrow()
TCGA_sheet_pretreatment ### [1] 10545
### make sure IOBR/TME data just contain sample which tcga_sheet gave.
<- TCGA_after_processing$Run |> unique()
detected_id_TCGA
### save info
<- unique(TCGA_after_processing$cohort)
cohorts
<- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/")
file_name1
for (cohort1 in cohorts) {
<- subset(TCGA_after_processing, cohort == cohort1)
cohort_data <- gsub("-", "_", cohort1)
name <- paste0(file_name1, name ,"_info.tsv")
file_name fwrite(cohort_data , file = file_name, sep = "\t")
} ## Gene
<- metafusion |>
metafusion1 mutate(
cohort = case_when(
%in% TCGA_after_processing$Run & grepl("TCGA", cohort) ~ TCGA_after_processing$cohort[match(Sample_ID, TCGA_after_processing$Run)],
Sample_ID TRUE ~ cohort
)
) <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_gdc_divide.tsv"
cp_file_name1 fwrite(metafusion1, file = cp_file_name1, sep = "\t")
## IOBR
<- "/home/data2/Projects/Fusion/fusiondb/IOBR"
folder_path
<- "^TCGA.*_IOBR_signature.tsv.gz$"
file_pattern <- list.files(path = folder_path, pattern = file_pattern, full.names = TRUE)
file_list
<- data.table()
combined_data
for (file in file_list) {
<- fread(file, sep = "\t")
file_data <- basename(file)
file_name <- sub("^(TCGA-[^_]+).*", "\\1", file_name)
project_id <- file_data %>%
file_data mutate(Project_id = project_id)
<- rbind(combined_data, file_data)
combined_data
}
### Filter TCGA Replicate Samples
### The filter rule following broad institute says:
###
### In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
###
### Analyte Replicate Filter
### The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
###
### Sort Replicate Filter
### The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
###
### Ref Link: <https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon>
### <https://gist.github.com/ShixiangWang/33b2f9b49b77eaa8f773b428480f9101>
<- combined_data |>
result_data mutate(
prefixes = str_split(ID, "-", simplify = TRUE)[, 1:4] %>% apply(1, paste, collapse = "-"),
suffixes = str_split(ID, "-", simplify = TRUE)[, 5],
second_chars = str_extract(suffixes, "[0-9]+"),
second_chars = as.integer(second_chars))
<- result_data |>
result_IOBR group_by(prefixes) |>
slice(which.max(second_chars)) |>
ungroup() |>
mutate(cohort = Project_id,
ID = substr(prefixes, 1, 15),
cohort = case_when(ID %in% NORMAL_id ~ "TCGA-NORMAL",
TRUE ~ cohort)
|>
) filter(ID %in% detected_id_TCGA) |>
select(-c("suffixes", "second_chars","prefixes")) |>
unique()
<- unique(result_IOBR$cohort)
cohorts_IOBR
for (cohort1 in cohorts_IOBR) {
<- subset(result_IOBR, cohort == cohort1) |> select(-c(cohort,Project_id) )
cohort_data <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
file_name write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
## TME
<- "/home/data2/Projects/Fusion/fusiondb/IOBR"
folder_path <- "^TCGA.*_TME.tsv.gz$"
file_pattern <- list.files(path = folder_path, pattern = file_pattern, full.names = TRUE)
file_list
<- data.table()
combined_data1
for (file in file_list) {
<- fread(file, sep = "\t")
file_data
<- basename(file)
file_name <- sub("^(TCGA-[^_]+).*", "\\1", file_name)
project_id
<- file_data %>%
file_data mutate(Project_id = project_id)
<- rbind(combined_data1, file_data)
combined_data1
}
<- combined_data1 |>
result_data1 mutate(
prefixes = str_split(ID, "-", simplify = TRUE)[, 1:4] %>% apply(1, paste, collapse = "-"),
suffixes = str_split(ID, "-", simplify = TRUE)[, 5],
second_chars = str_extract(suffixes, "[0-9]+"),
second_chars = as.integer(second_chars))
<- result_data1 |>
result_TME group_by(prefixes) |>
slice(which.max(second_chars)) |>
ungroup() |>
mutate(cohort = Project_id,
ID = substr(prefixes, 1, 15),
cohort = case_when(ID %in% NORMAL_id ~ "TCGA-NORMAL",
TRUE ~ cohort)
|>
) filter(ID %in% detected_id_TCGA) |>
select(-c("suffixes", "second_chars","prefixes")) |>
unique()
<- unique(result_TME$cohort)
cohorts_TME
for (cohort1 in cohorts_TME) {
<- subset(result_TME, cohort == cohort1) |> select(-c(cohort,Project_id) )
cohort_data <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
file_name write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
Fusion Event Scoring and Classification
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_gdc_divide.tsv"
metafusion <- fread(file_meta , sep = "\t")
# Stratification Strategy
## Normal data
### 1. If one fusion event was found in the past Arriba blocklist, it will be given a score of 0 and will not be changed.
### 2. If one fusion event's max_split_cnt >= 1 and max_span_cnt >= 1, +1 score.
### 3. If Arriba and STAR-Fusion detected identical fusion event, +1 score.
### 4. If past FusionAnnotator annotated one normal fusion event, +2 score(Because it is a strong evidence)
Normal_data <- metafusion |>
filter(sample_type == "Normal") |>
mutate(Normal_score = 0)
### 1. Arriba blocklist
#### reference:https://arriba.readthedocs.io/en/latest/input-files/#blacklist
Normal_data <- Normal_data %>%
mutate(
Normal_score = case_when(
blocklist == TRUE ~ 0,
TRUE ~ Normal_score
),
score0 = case_when(
blocklist == TRUE ~ TRUE,
TRUE ~ FALSE
)
)
### 2. Reads counts
Normal_data <- Normal_data %>%
mutate(
Normal_score = case_when(
max_split_cnt >= 1 & max_span_cnt >= 1 & score0 == FALSE ~ Normal_score + 1 ,
TRUE ~ Normal_score
)
)
### 3. Tools detected
Normal_data <- Normal_data %>%
mutate(
Normal_score = case_when(
tools == "arriba,starfusion" & score0 == FALSE ~ Normal_score + 1,
TRUE ~ Normal_score
)
)
### 4. Fusionanntator annotate
normal_markers <- c("GTEx_recurrent_StarF2019", "DGD_PARALOGS", "HGNC_GENEFAM", "BodyMap", "Greger_Normal", "ConjoinG", "Babiceanu_Normal")
#### Past Red Herrings summary didn't contain TCGA, TARGET, CPTAC(except for GTEx-Brain) normal sample data.
#### reference:https://github.com/FusionAnnotator/CTAT_HumanFusionLib/wiki
#### BodyMap 40
#### Babiceanu_Normal 239
#### ConjoinG 698
#### DGD_PARALOGS 14463
#### Greger_Normal 113
#### GTEx_recurrent_StarF2019 4047
#### HGNC_GENEFAM Unknown
Normal_data <- Normal_data %>%
mutate(
Normal_score = case_when(
rowSums(sapply(normal_markers, grepl, x = cancer_db_hits)) >= 1 & score0 == FALSE ~ Normal_score + 2,
TRUE ~ Normal_score )
)
### Nomral true
Normal_data_true <- Normal_data |>
group_by(gene1, gene2) %>%
slice_max(Normal_score) %>%
ungroup() %>%
filter(Normal_score >= 2)
## Tumor data
### 1. If one fusion event was found in the past Arriba blocklist, it will be given a score of 0 and will not be changed.
### 2. If fusion event was in our detcted normal fusion(score>=2), it will be given a score of 0 and will not be changed.
### 3. If one fusion event's max_split_cnt and max_span_cnt are both greater than or equal to 1, and at least one of them is less than 5, +1 score.
### If one fusion event's max_split_cnt and max_span_cnt are both greater than or equal to 5, +2 score.
### 4. If Arriba detected exclusive fusion event, +0 score;
### If STAR-Fusion detected exclusive fusion event, +0 score;
### If Arriba and STAR-Fusion detected identical fusion event, +1 score;
### 5. If past FusionAnnotator annotated one tumor fusion event, +2 score(Because it is a strong evidence)
Tumor_data <- metafusion |>
filter(sample_type == "Tumor" | sample_type == "Blood") |>
mutate(Tumor_score = 0)
### 1. Arriba blocklist
#### reference:https://arriba.readthedocs.io/en/latest/input-files/#blacklist
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
blocklist == TRUE ~ 0,
TRUE ~ Tumor_score
),
score0 = case_when(
blocklist == TRUE ~ TRUE,
TRUE ~ FALSE
)
)
### 2. Filter not in normal data.
Tumor_data <- Tumor_data %>%
mutate(gene_pair = paste0(gene1, "::", gene2))
Normal_data_true <- Normal_data_true %>%
mutate(gene_pair = paste0(gene1, "::", gene2))
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
gene_pair %in% Normal_data_true$gene_pair & score0 == FALSE ~ 0,
TRUE ~ Tumor_score
),
score0 = case_when(
gene_pair %in% Normal_data_true$gene_pair & score0 == FALSE ~ TRUE,
TRUE ~ FALSE
)
)
### 3. Reads counts
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
max_split_cnt >= 5 & max_span_cnt >= 5 & score0 == FALSE ~ Tumor_score + 2,
(max_split_cnt >= 1 & max_span_cnt >= 1) & ((max_split_cnt < 5 & max_span_cnt >= 5)| (max_split_cnt >= 5 & max_span_cnt < 5)| (max_split_cnt < 5 & max_span_cnt < 5))~ Tumor_score + 1,
TRUE ~ Tumor_score
)
)
### 4. Tools detected
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
tools == "arriba,starfusion" & score0 == FALSE ~ Tumor_score + 1,
tools == "arriba" & score0 == FALSE ~ Tumor_score,
tools == "starfusion" & score0 == FALSE ~ Tumor_score,
TRUE ~ Tumor_score
)
)
### 5. Fusionanntator annotate
tumor_markers <- c("Mitelman", "chimerdb_omim", "chimerdb_pubmed", "Cosmic", "YOSHIHARA_TCGA", "ChimerKB", "ChimerPub", "ChimerSeq", "Klijn_CellLines", "Larsson_TCGA", "CCLE_StarF2019", "HaasMedCancer", "TCGA_StarF2019", "TumorFusionsNAR2018", "DEEPEST2019", "GUO2018CR_TCGA", "DepMap2023")
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
rowSums(sapply(tumor_markers, grepl, x = cancer_db_hits)) >= 1 ~ Tumor_score + 2,
TRUE ~ Tumor_score)
)
### New metafusion
Tumor_data1 <- Tumor_data |>
select(-c(gene_pair, score0)) |>
rename(gene_score = Tumor_score)
Normal_data1 <- Normal_data |>
select(-c( score0)) |>
rename(gene_score = Normal_score)
metafusion_new <- rbind(Normal_data1,Tumor_data1)
file_name <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls.star.tsv"
write.table(metafusion_new,
file = file_name,
sep = "\t",
row.names = FALSE,
quote = FALSE)