Methods
Gene Fusions identification and differential expression analysis
We employed two Gene Fusions(GFs) detection tools, including arriba, STAR-Fusion, to identify, parse, and annotate GFs junctions within each sample, using the human genome version hg38 as the reference. In our pursuit of biologically robust GFs identification within each cohort, we undertook an approach metafusion that amalgamated results from two GFs detection tools. Firstly, we exclusively retained Fusions located on chromosomes 1 to 22, as well as the X and Y chromosomes. Secondly, GFs were mandated to exhibit genomic overlap with gene regions as defined in the reference genome annotation file, specifically “gencode.v34.annotation.gtf.”. Thirdly to ensure the precision and recall in GFs prediction with different tools, one can either rely on consensus predictions or all predctions and choosing junction-crossing reads and reads mates which flank (span) the breakpoint. Fourthly, we offer FusionAnnotator to identify and prioritize GFs that have been previously reported in cancer or normal samples. Lastly, the detected data can be divided into Cis-SAGes and other categories to further explore the debate on what should be considered true fusion RNAs. This meticulous methodological approach was instrumental in ensuring the acquisition of biologically sound and reliable GFs datasets, thereby fortifying the integrity of our analysis within each cohort. This analysis facilitated comparisons between patient groups, such as those who responded to checkpoint immunotherapy versus non-responders, as well as between samples collected before and during ICB treatment.
Response and Response2
We gathered and processed patient ‘Response’ data from various cohorts. However, due to inconsistencies among these cohorts, such as the usage of different terms like CR, PR, SD, PD, R, NR, DCB, and NDB, we harmonized the response variable based on the timing of treatment and treatment response. As a result, we introduced a new variable called ‘Response2’ with four distinct categories: Pre-R, Pre-NR, On-R, and On-NR. In cases where response data was not available, we assigned the category NE.
Cancer Type Abbreviation
| Abbr. | Description |
|---|---|
| BLCA | Bladder urothelial carcinoma |
| SKCM | Skin cutaneous melanoma |
| KIRC & CCRCC | Kidney renal clear cell carcinoma |
| HNSC & HNSCC | Head and neck squamous cell carcinoma |
| NSCLC | Non-small cell lung cancer, including lung adenocarcinoma and lung squamous cell carcinoma |
| SCLC | Small cell lung cancer |
| SARC | Sarcoma |
| BRCA | Breast cancer |
| GBM | Glioblastoma Multiforme |
| STAD | Stomach Adenocarcinoma |
| SGC | Salivary gland cancer |
| LUSC | Lung Squamous Cell Carcinoma |
| MESO | Mesothelioma |
| PCPG | Pheochromocytoma and Paraganglioma |
| LIHC | Liver Hepatocellular Carcinoma |
| PRAD | Prostate Adenocarcinoma |
| ACC | Adrenocortical Carcinoma |
| TGCT | Testicular Germ Cell Tumors |
| PAAD | Pancreatic Adenocarcinoma |
| UCEC | Uterine Corpus Endometrial Carcinoma |
| THCA | Thyroid carcinoma |
| CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma |
| ESCA | Esophageal carcinoma |
| READ | Rectum adenocarcinoma |
| OV | Ovarian Serous Cystadenocarcinoma |
| CHOL | Cholangiocarcinoma |
| UCS | Uterine Carcinosarcoma |
| LUAD | Lung Adenocarcinoma |
| LGG | Brain Lower Grade Glioma |
| HGSC | High-Grade Serous Carcinoma |
| LSCC & LUSC | Lung Squamous Cell Carcinoma |
| PDAC | Pancreatic Ductal Adenocarcinoma |
| THYM | Thymoma |
| KICH | Kidney Chromophobe |
| KIRP | Kidney Renal Papillary Cell Carcinoma |
| COAD | Colon Adenocarcinoma |
| LAML & AML | Acute Myeloid Leukemia |
| NBL | Neuroblastoma |
| GNB | Ganglioneuroblastoma |
ImmunoFusion cohort naming maping
| Cancer | Sample_type | Project_id | Immunofusion-cohort |
|---|---|---|---|
| BRCA | Primary Tumor | CPTAC-2 | CPTAC-BRCA |
| COAD | Primary Tumor | CPTAC-2 | CPTAC-COAD |
| HGSC | Primary Tumor | CPTAC-2 | CPTAC-OV |
| LSCC | Primary Tumor | CPTAC-3 | CPTAC-LUSC |
| UCEC | Primary Tumor | CPTAC-3 | CPTAC-UCEC |
| LUAD | Primary Tumor | CPTAC-3 | CPTAC-LUAD |
| GBM | Primary Tumor | CPTAC-3 | CPTAC-GBM |
| HNSCC | Primary Tumor | CPTAC-3 | CPTAC-HNSC |
| CCRCC | Primary Tumor | CPTAC-3 | CPTAC-KIRC |
| PDAC | Primary Tumor | CPTAC-3 | CPTAC-PAAD |
| PDAC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
| LUAD | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
| UCEC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
| CCRCC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
| HNSCC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
| LSCC | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
| GTEX-Brain | Solid Tissue Normal | CPTAC-3 | CPTAC-NORMAL |
| – | Primary Blood Derived Cancer - Peripheral Blood/Recurrent Blood Derived Cancer - Peripheral Blood/Next Generation Cancer Model/Recurrent Blood Derived Cancer - Bone Marrow/Primary Blood Derived Cancer - Bone Marrow | TARGET-AML | TARGET-LAML |
| Acute myeloid leukemia, NOS | Primary Blood Derived Cancer - Bone Marrow/Recurrent Blood Derived Cancer - Bone Marrow/Primary Blood Derived Cancer - Peripheral Blood/ Blood Derived Cancer - Bone Marrow/Post-treatment/Recurrent Blood Derived Cancer - Peripheral Blood/ Blood Derived Cancer - Peripheral Blood | TARGET-AML | TARGET-LAML |
| Neuroblastoma, NOS | Primary Tumor/Recurrent Tumor/Recurrent Blood Derived Cancer - Bone Marrow | TARGET-NBL | TARGET-NBL |
| Ganglioneuroblastoma | Primary Tumor/Recurrent Tumor | TARGET-NBL | TARGET-GNB |
| – | Cell Lines | TARGET-AML | TARGET-CELL |
| – | Bone Marrow Normal | TARGET-AML | TARGET-NORMAL |
| Acute myeloid leukemia, NOS | Blood Derived Normal/Bone Marrow Normal | TARGET-AML | TARGET-NORMAL |
| ACC | Primary Tumor | TCGA-ACC | TCGA-ACC |
| BLCA | Primary Tumor | TCGA-BLCA | TCGA-BLCA |
| BRCA | Primary Tumor | TCGA-BRCA | TCGA-BRCA |
| BRCA | Metastatic | TCGA-BRCA | TCGA-BRCA |
| CESC | Primary Tumor | TCGA-CESC | TCGA-CESC |
| CESC | Metastatic | TCGA-CESC | TCGA-CESC |
| CHOL | Primary Tumor | TCGA-CHOL | TCGA-CHOL |
| COAD | Primary Tumor | TCGA-COAD | TCGA-COAD |
| COAD | Recurrent Tumor | TCGA-COAD | TCGA-COAD |
| COAD | Metastatic | TCGA-COAD | TCGA-COAD |
| DLBC | Primary Tumor | TCGA-DLBC | TCGA-DLBC |
| ESCA | Primary Tumor | TCGA-ESCA | TCGA-ESCA |
| ESCA | Metastatic | TCGA-ESCA | TCGA-ESCA |
| GBM | Primary Tumor | TCGA-GBM | TCGA-GBM |
| GBM | Recurrent Tumor | TCGA-GBM | TCGA-GBM |
| HNSC | Primary Tumor | TCGA-HNSC | TCGA-HNSC |
| HNSC | Metastatic | TCGA-HNSC | TCGA-HNSC |
| KICH | Primary Tumor | TCGA-KICH | TCGA-KICH |
| KIRC | Primary Tumor | TCGA-KIRC | TCGA-KIRC |
| KIRC | Additional - New Primary | TCGA-KIRC | TCGA-KIRC |
| KIRP | Primary Tumor | TCGA-KIRP | TCGA-KIRP |
| KIRP | Additional - New Primary | TCGA-KIRP | TCGA-KIRP |
| LAML | Primary Blood Derived Cancer - Peripheral Blood | TCGA-LAML | TCGA-LAML |
| LGG | Primary Tumor | TCGA-LGG | TCGA-LGG |
| LGG | Recurrent Tumor | TCGA-LGG | TCGA-LGG |
| LIHC | Primary Tumor | TCGA-LIHC | TCGA-LIHC |
| LIHC | Recurrent Tumor | TCGA-LIHC | TCGA-LIHC |
| LUAD | Primary Tumor | TCGA-LUAD | TCGA-LUAD |
| LUAD | Recurrent Tumor | TCGA-LUAD | TCGA-LUAD |
| LUSC | Primary Tumor | TCGA-LUSC | TCGA-LUSC |
| MESO | Primary Tumor | TCGA-MESO | TCGA-MESO |
| OV | Primary Tumor | TCGA-OV | TCGA-OV |
| OV | Recurrent Tumor | TCGA-OV | TCGA-OV |
| PAAD | Primary Tumor | TCGA-PAAD | TCGA-PAAD |
| PAAD | Metastatic | TCGA-PAAD | TCGA-PAAD |
| PCPG | Primary Tumor | TCGA-PCPG | TCGA-PCPG |
| PCPG | Additional - New Primary | TCGA-PCPG | TCGA-PCPG |
| PCPG | Metastatic | TCGA-PCPG | TCGA-PCPG |
| PRAD | Primary Tumor | TCGA-PRAD | TCGA-PRAD |
| PRAD | Metastatic | TCGA-PRAD | TCGA-PRAD |
| READ | Primary Tumor | TCGA-READ | TCGA-READ |
| READ | Recurrent Tumor | TCGA-READ | TCGA-READ |
| SARC | Primary Tumor | TCGA-SARC | TCGA-SARC |
| SARC | Recurrent Tumor | TCGA-SARC | TCGA-SARC |
| SARC | Metastatic | TCGA-SARC | TCGA-SARC |
| SKCM | Metastatic | TCGA-SKCM | TCGA-SKCM |
| SKCM | Primary Tumor | TCGA-SKCM | TCGA-SKCM |
| SKCM | Additional Metastatic | TCGA-SKCM | TCGA-SKCM |
| STAD | Primary Tumor | TCGA-STAD | TCGA-STAD |
| TGCT | Primary Tumor | TCGA-TGCT | TCGA-TGCT |
| TGCT | Additional - New Primary | TCGA-TGCT | TCGA-TGCT |
| THCA | Primary Tumor | TCGA-THCA | TCGA-THCA |
| THCA | Metastatic | TCGA-THCA | TCGA-THCA |
| THYM | Primary Tumor | TCGA-THYM | TCGA-THYM |
| UCEC | Primary Tumor | TCGA-UCEC | TCGA-UCEC |
| UCEC | Recurrent Tumor | TCGA-UCEC | TCGA-UCEC |
| UCS | Primary Tumor | TCGA-UCS | TCGA-UCS |
| UVM | Primary Tumor | TCGA-UVM | TCGA-UVM |
| BLCA | Solid Tissue Normal | TCGA-BLCA | TCGA-NORMAL |
| BRCA | Solid Tissue Normal | TCGA-BRCA | TCGA-NORMAL |
| CESC | Solid Tissue Normal | TCGA-CESC | TCGA-NORMAL |
| CHOL | Solid Tissue Normal | TCGA-CHOL | TCGA-NORMAL |
| COAD | Solid Tissue Normal | TCGA-COAD | TCGA-NORMAL |
| ESCA | Solid Tissue Normal | TCGA-ESCA | TCGA-NORMAL |
| HNSC | Solid Tissue Normal | TCGA-HNSC | TCGA-NORMAL |
| KICH | Solid Tissue Normal | TCGA-KICH | TCGA-NORMAL |
| KIRC | Solid Tissue Normal | TCGA-KIRC | TCGA-NORMAL |
| KIRP | Solid Tissue Normal | TCGA-KIRP | TCGA-NORMAL |
| LIHC | Solid Tissue Normal | TCGA-LIHC | TCGA-NORMAL |
| LUAD | Solid Tissue Normal | TCGA-LUAD | TCGA-NORMAL |
| LUSC | Solid Tissue Normal | TCGA-LUSC | TCGA-NORMAL |
| PAAD | Solid Tissue Normal | TCGA-PAAD | TCGA-NORMAL |
| PCPG | Solid Tissue Normal | TCGA-PCPG | TCGA-NORMAL |
| PRAD | Solid Tissue Normal | TCGA-PRAD | TCGA-NORMAL |
| READ | Solid Tissue Normal | TCGA-READ | TCGA-NORMAL |
| SARC | Solid Tissue Normal | TCGA-SARC | TCGA-NORMAL |
| SKCM | Solid Tissue Normal | TCGA-SKCM | TCGA-NORMAL |
| STAD | Solid Tissue Normal | TCGA-STAD | TCGA-NORMAL |
| THCA | Solid Tissue Normal | TCGA-THCA | TCGA-NORMAL |
| THYM | Solid Tissue Normal | TCGA-THYM | TCGA-NORMAL |
| UCEC | Solid Tissue Normal | TCGA-UCEC | TCGA-NORMAL |
Clinical data preprocessing
TCGA
library(data.table)
library(dplyr)
# load("/home/data2/Projects/Fusion/unclean/TCGA_ALL_clinical.rdata")
#
# TCGA_ALL_clinical |>
# rename(Patient_ID = bcr_patient_barcode,
# Age = age_at_initial_pathologic_diagnosis,
# Sex = gender,
# Race = race,
# )
# [Discrepancy] [Not Applicable] [Not Available]
# [Discrepancy] [Not Available] [Unknown]
# # [Discrepancy] [Not Applicable]
# 29 105
# [Not Available] [Not Evaluated]
# 5140 4
# [Unknown]
library(UCSCXenaShiny)
table(tcga_clinical_fine$Stage_ajcc)
tcga_clinical
tcga_clinical_fine
tcga_genome_instability
tcga_purity
tcga_subtypes
tcga_surv
nrow(tcga_clinical_fine)
nrow(unique(tcga_clinical_fine))
tcga_clinical = unique(tcga_clinical_fine) |>
left_join(tcga_surv, by = c("Sample"="sample")) |>
left_join(tcga_purity, by = c("Sample"="sample"))
tcga_clinical[551, ]$Sample
sum(startsWith(tcga_genome_instability$sample, "TCGA"))
tcga_clinical = tcga_clinical |>
left_join(tcga_genome_instability[startsWith(tcga_genome_instability$sample, "TCGA") & !duplicated(tcga_genome_instability$sample), ], by = c("Sample"="sample")) |>
#left_join(tcga_genome_instability[duplicated(tcga_genome_instability$sample), ], by = c("Sample"="sample")) |>
left_join(tcga_subtypes, by = c("Sample"="sampleID"))
sum(is.na(tcga_clinical$Sample))
sum(duplicated(tcga_clinical$Sample))
tcga_clinical[duplicated(tcga_clinical$Sample), ]$Sample
tcga_clinical.final = tcga_clinical[!duplicated(tcga_clinical$Sample), ] |>
rename(Sample_ID = Sample, Sex = Gender, OS_Time = OS.time, OS_Status = OS, DSS_Time = DSS.time, DSS_Status = DSS, DFS_Time = DFI.time, DFS_Status = DFI, PFS_Time = PFI.time, PFS_Status = PFI) |>
select(-cancer_type) |>
mutate(Sex = ifelse(Sex == "FEMALE", "F", "M"),
Patient_ID = substr(Sample_ID, 1, 12))
sum(is.na(tcga_clinical.final$Sample_ID))
sum(is.na(tcga_clinical.final$Cancer))
sum(duplicated(tcga_clinical.final$Sample_ID))
tcga_clinical.final = tcga_clinical.final |>
select(Patient_ID, Sample_ID, everything()) |>
as.data.table()
tcga_clinical.final
fwrite(tcga_clinical.final, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")CPTAC
library(data.table)
# Clinical data from CPTAC paper ------------------------------------------
cli_list = fs::dir_ls("CPTAC_Clinical_meta_data_v1/")
cli = purrr::map(cli_list, fread) |> rbindlist(fill = TRUE, use.names = TRUE)
colnames(cli)
# Problem: Need to skip the second row
# This also includes samples from CPTAC-2/3, but the total number is fewer
cli2 = purrr::map(cli_list, function(x) {
# Since the column names are also different, read them twice—once for the column names and once for the data
data1 = fread(x)
data2 = fread(x, skip = 2, header = FALSE)
# Update the column names
colnames(data2) = colnames(data1)
data2$Cancer = stringr::str_remove(basename(x), "_meta.txt")
data2 |> as.data.table()
}) |> rbindlist(fill = TRUE, use.names = TRUE)
#colnames(cli2) = colnames(cli)
colnames(cli2)
cli2[Tumor == "Yes" & Normal == "Yes"] # This is to mark whether the patient has both tumor and normal samples
cli_list2 = fs::dir_ls("meta_data_BCM/")
cli3 = purrr::map(cli_list2, function(x) {
# Since the column names are also different, read them twice—once for the column names and once for the data
data1 = fread(x)
data2 = fread(x, skip = 2, header = FALSE)
# 更新列名
colnames(data2) = colnames(data1)
data2$Cancer = stringr::str_remove(basename(x), "_meta.txt")
data2 |> as.data.table()
}) |> rbindlist(fill = TRUE, use.names = TRUE)
colnames(cli3)
all.equal(cli2, cli3)
all.equal(sort(colnames(cli2)), sort(colnames(cli3)))
cli4 = cli3[, colnames(cli2), with = FALSE]
all.equal(cli2, cli4)
identical(cli2, cli4)
dplyr::setdiff(cli2, cli4) |> View()
dplyr::setdiff(cli4, cli2) |> nrow()
# rbind(cli2[idx == "01OV007"],
# cli4[idx == "01OV007"]) |> View()
# The files in the directories of the two data sources are basically identical
cli.final = cli2 |>
dplyr::rename(
Patient_ID = idx,
OS_Status = OS_event,
OS_Time = OS_days,
PFS_Status = PFS_event,
PFS_Time = PFS_days
) |>
dplyr::mutate(
Sample_ID = Patient_ID,
Sex = ifelse(Sex == "Female", "F", "M")
) |>
dplyr::select(
Patient_ID, Sample_ID, Cancer, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status, dplyr::everything()
)
# dplyr::select(
# - dplyr::starts_with("CIBERSORT"), - dplyr::starts_with("xCell")
# )
fwrite(cli.final, file = "CPTAC_cli_paper_combined.tsv", sep = "\t")
# Clinical data from GDC --------------------------------------------------
system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/CPTAC
mkdir CPTAC-3 CPTAC-2
tar zxvf clinical.project-cptac-2.2024-08-07.tar.gz -C CPTAC-2
tar zxvf clinical.project-cptac-3.2024-08-07.tar.gz -C CPTAC-3
")
cptac2 = fread("gdc_cli/CPTAC/CPTAC-2/clinical.tsv")
cptac3 = fread("gdc_cli/CPTAC/CPTAC-3/clinical.tsv")
which(colnames(cptac2) %in% "residual_disease")
which(colnames(cptac3) %in% "residual_disease")
all.equal(cptac2[[119]], cptac2[[192]])
all.equal(cptac3[[119]], cptac3[[192]])
cptac2[[192]] = NULL
cptac3[[192]] = NULL
setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# > setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
# [1] "01OV049" "26OV010" "100004028" "05BR058" "05BR055" "17OV019" "11BR069" "02OV042" "100004012" "05BR051" "03BR012" "02OV045" "04OV041"
# [14] "05BR052" "05BR031" "1488" "11OV009" "02OV040" "01OV002" "100002921" "22OV001" "13OV004" "100003304" "02OV035" "01OV046" "01OV045"
# [27] "17OV034"
# > setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# [1] "C3N-02672" "C3L-01929" "C3N-02682" "C3N-02012" "C3N-02070"
# [6] "C3N-02696" "C3L-02354" "C3N-03026" "C3L-01739" "C3N-02088"
# [11] "C3L-04213" "C3N-03205" "C3L-04081" "C3N-03791" "C3L-03679"
# [16] "C3L-02643" "C3N-04686" "C3L-01558" "C3L-03268" "C3L-01633"
# [21] "C3L-02220" "C3N-02439" "C3N-02028" "C3N-03800" "C3L-04392"
# [26] "C3L-02556" "C3N-02721" "C3L-01672" "C3L-00938" "C3L-02201"
# [31] "C3N-02146" "C3N-02639" "C3N-03788" "C3L-03733" "C3L-04037"
# [36] "C3N-01091" "C3L-02747" "C3N-01879" "C3N-02996" "C3N-02296"
# [41] "C3L-02553" "C3L-03632" "C3N-03755" "GTEX-NPJ7-0011-R10A-SM-HAKXW" "C3L-02746"
length(setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)) # 586
View(cptac2[case_submitter_id %in% setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)])
View(cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)])
data.table::fwrite(
cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)],
file = "CPTAC3_non_included.tsv", sep = "\t"
)
cptac3
# Supplement the missing sample information in the paper based on the existing information in the GDC. ---------------------------------------------
# might be columns that have labeled cancer types.
# primary_diagnosis site_of_resection_or_biopsy tissue_or_organ_of_origin
cptac = rbind(cptac2, cptac3)[, .(case_submitter_id, primary_diagnosis, site_of_resection_or_biopsy, tissue_or_organ_of_origin)] |> unique()
cptac = dplyr::left_join(cptac, cli.final[, .(Sample_ID, Cancer)], by = c("case_submitter_id"="Sample_ID"))
setdiff(cli.final$Sample_ID, cptac$case_submitter_id)
# 29 cases are not in the GDC data.
table(cptac$Cancer, cptac$primary_diagnosis) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$site_of_resection_or_biopsy) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$tissue_or_organ_of_origin) |> pheatmap::pheatmap()
table(cptac$Cancer,
paste(
cptac$primary_diagnosis,
cptac$site_of_resection_or_biopsy,
cptac$tissue_or_organ_of_origin, sep = "_"
)) |> pheatmap::pheatmap(show_colnames = FALSE)
# Solve them one by one.
cat(sub("_meta.txt", "", basename(names(cli_list))))
typelist = sub("_meta.txt", "", basename(names(cli_list)))
# BRCA CCRCC COAD GBM HGSC HNSCC LSCC LUAD PDAC UCEC
for (i in typelist) {
message("check type ", i)
cptac_part = cptac |> dplyr::filter(Cancer == i)
message(" primary_diagnosis")
table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
message(" site_of_resection_or_biopsy")
table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
message(" tissue_or_organ_of_origin")
table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}
# BRCA (122): tissue_or_organ_of_origin "Breast, NOS"
# CCRCC (103): primary_diagnosis "Renal cell carcinoma, NOS"
# tissue_or_organ_of_origin "Kidney, NOS"
# COAD (106): tissue_or_organ_of_origin "Colon, NOS", "Rectum, NOS"
# GBM (99): primary_diagnosis "Glioblastoma"
# HGSC (87): primary_diagnosis "Serous adenocarcinoma, NOS" AND
# tissue_or_organ_of_origin Fallopian tube Ovary Peritoneum, NOS
# HNSCC (108): Other situation: primary_diagnosis "Squamous cell carcinoma, NOS"
# Base of tongue, NOS Cheek mucosa Floor of mouth, NOS Gum, NOS
# HNSCC 1 2 18 3
#
# Head, face or neck, NOS Larynx, NOS Lip, NOS Oropharynx, NOS
# HNSCC 2 47 4 4
#
# Overlapping lesion of lip, oral cavity and pharynx Tongue, NOS Tonsil, NOS
# HNSCC
# LSCC (108): primary_diagnosis "Squamous cell carcinoma, NOS" AND
# tissue_or_organ_of_origin "lung" pattern
# LUAD (110): primary_diagnosis "Adenocarcinoma, NOS" AND
# tissue_or_organ_of_origin "lung" pattern
# PDAC (105): tissue_or_organ_of_origin "pancreas" pattern
# UCEC (95): primary_diagnosis "Endometrioid adenocarcinoma, NOS"
sum(!is.na(cptac$Cancer))
# rbind(cptac2, cptac3)[,192] # There are duplicate columns, need to go back and handle this earlier (fixed).
# Actual testing and updating
cptac_updated = rbind(cptac2, cptac3) |>
unique() |>
dplyr::mutate(
Cancer = dplyr::case_when(
tissue_or_organ_of_origin %in% "Breast, NOS" ~ "BRCA",
primary_diagnosis %in% "Renal cell carcinoma, NOS" ~ "CCRCC",
tissue_or_organ_of_origin %in% c("Colon, NOS", "Rectum, NOS") ~ "COAD",
grepl("Glioblastoma", primary_diagnosis, ignore.case = TRUE) | grepl("Oligodendroglioma", primary_diagnosis, ignore.case = TRUE) ~ "GBM",
grepl("GTEX", case_submitter_id, ignore.case = TRUE) ~ "GTEX-Brain",
case_submitter_id %in% c("C3L-06912", "C3L-07212") ~ "GBM",
#primary_diagnosis %in% "Glioblastoma" ~ "GBM",
primary_diagnosis %in% "Serous adenocarcinoma, NOS" & tissue_or_organ_of_origin %in% c("Fallopian tube", "Ovary", "Peritoneum, NOS") ~ "HGSC",
primary_diagnosis %in% "Squamous cell carcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LSCC",
primary_diagnosis %in% "Adenocarcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LUAD",
grepl("pancreas", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "PDAC",
primary_diagnosis %in% "Endometrioid adenocarcinoma, NOS" ~ "UCEC",
TRUE ~ "HNSCC"
)
) |> as.data.table()
for (i in "HNSCC") { # typelist
message("check type ", i)
cptac_part = cptac_updated |> dplyr::filter(Cancer == i)
message(" primary_diagnosis")
table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
message(" site_of_resection_or_biopsy")
table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
message(" tissue_or_organ_of_origin")
table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}
table(cptac$Cancer)
table(cptac_updated$Cancer)
cptac_updated |>
dplyr::filter(
primary_diagnosis == "'--" | site_of_resection_or_biopsy == "'--" |
tissue_or_organ_of_origin %in% c("'--", "Unknown")
)
# 1-9 from GTEX-Brain
# C3L-06912 C3L-07212 Gliomas # https://portal.gdc.cancer.gov/cases/e6027623-a5b2-4d82-a9ca-f37c948dd3ed
cptac_updated$case_submitter_id[grepl("GTEX", cptac_updated$case_submitter_id)]
nrow(cptac_updated)
fwrite(cptac_updated, file = "CPTAC_combined_with_cancer_annotated.tsv", sep = "\t")
cptac_unmatched = cptac_updated[! case_submitter_id %in% cli.final$Sample_ID]
colnames(cptac_unmatched)
colnames(cli.final)[1:20]
cptac_unmatched2 = cptac_unmatched |>
dplyr::select(
case_submitter_id, age_at_diagnosis, gender,
days_to_death, vital_status,
dplyr::starts_with("ajcc"),
dplyr::starts_with("days"),
last_known_disease_status, progression_or_recurrence, residual_disease,
tumor_grade, Cancer
)
sum(cptac_unmatched2$days_to_last_follow_up != "'--")
sum(cptac_unmatched2$days_to_death != "'--")
sum(cptac_unmatched2$days_to_recurrence != "'--")
sum(cptac_unmatched2$days_to_last_known_disease_status != "'--")
sum(cptac_unmatched2$progression_or_recurrence != "'--")
sum(!is.na(cli.final$OS_Time))
sum(!is.na(cli.final$PFS_Time))
sum(cli.final$OS_Time >= cli.final$PFS_Time, na.rm = TRUE)
cli.final[cli.final$OS_Time < cli.final$PFS_Time] |> View()
sum(as.integer(cptac_unmatched2$days_to_last_follow_up) >= as.integer(cptac_unmatched2$days_to_last_known_disease_status), na.rm = TRUE)
cptac_unmatched3 = cptac_unmatched2 |>
dplyr::rename(
Sample_ID = case_submitter_id,
Age = age_at_diagnosis,
Sex = gender,
OS_Time = days_to_last_follow_up,
OS_Status = vital_status,
PFS_Time = days_to_last_known_disease_status,
PFS_Status = progression_or_recurrence,
Histologic_Grade = tumor_grade,
Path_Stage_pT = ajcc_pathologic_t,
Path_Stage_pN = ajcc_pathologic_n,
Stage = ajcc_pathologic_stage
) |>
dplyr::select(
Sample_ID, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status,
Path_Stage_pN:Path_Stage_pT,
Histologic_Grade, Cancer
) |>
dplyr::mutate(
Patient_ID = Sample_ID,
Age = as.integer(round(as.integer(Age) / 365)),
Sex = fcase(Sex %in% "female", "F",
Sex %in% "male", "M", default = NA_character_),
OS_Time = as.integer(OS_Time),
OS_Status = fcase(OS_Status %in% "Alive", 0L,
OS_Status %in% "Dead", 1L, default = NA_integer_),
PFS_Time = as.integer(PFS_Time),
PFS_Status = fcase(PFS_Status %in% "no", 0L,
PFS_Status %in% "yes", 1L, default = NA_integer_),
OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time),
Path_Stage_pN = fcase(
Path_Stage_pN %in% "N0", "pN0",
Path_Stage_pN %in% c("N1", "N1a"), "pN1",
Path_Stage_pN %in% c("N2", "N2a"), "pN2",
default = NA_character_
),
Path_Stage_pT = fcase(
Path_Stage_pT %in% "T0", "pT0",
Path_Stage_pT %in% c("T1", "T1a", "T1b", "T1c"), "pT1",
Path_Stage_pT %in% c("T2", "T2a", "T2b"), "pT2",
Path_Stage_pT %in% c("T3", "T3a", "T3b"), "pT3",
Path_Stage_pT %in% "T4", "pT4",
default = NA_character_
),
Stage = fcase(
Stage %in% c("Stage I", "Stage IA", "Stage IA1", "Stage IA2", "Stage IA3", "Stage IB"), "Stage I",
Stage %in% c("Stage II", "Stage IIA", "Stage IIB"), "Stage II",
Stage %in% c("Stage III", "Stage IIIA", "Stage IIIB", "Stage IIIC", "Stage IIIC1", "Stage IIIC2"), "Stage III",
Stage %in% c("Stage IV", "Stage IVB"), "Stage IV",
default = NA_character_
),
Histologic_Grade = fcase(
Histologic_Grade %in% "G1", "G1 Well differentiated",
Histologic_Grade %in% "G2", "G2 Moderately differentiated",
Histologic_Grade %in% "G3", "G3 Poorly differentiated",
Histologic_Grade %in% "G4", "G4 Undifferentiated",
default = NA_character_
)
)
cptac_unmatched3
# X: cannot be assessed
table(cptac_unmatched3$Path_Stage_pN)
table(cli.final$Path_Stage_pN)
table(cptac_unmatched3$Path_Stage_pT)
table(cli.final$Path_Stage_pT)
table(cptac_unmatched3$Stage)
table(cli.final$Stage)
table(cptac_unmatched3$Histologic_Grade)
table(cli.final$Histologic_Grade)
# Final output ------------------------------------------------------------
cli.final.2 = dplyr::bind_rows(
cli.final, cptac_unmatched3
) |> as.data.table()
table(cli.final.2$Cancer)
sum(table(cli.final.2$Cancer))
# Include all Race information?
cli.final.2 = dplyr::left_join(
cli.final.2, unique(rbind(cptac2, cptac3)[, .(case_submitter_id, race)]),
by = c("Sample_ID"="case_submitter_id")
) |>
dplyr::mutate(
race = ifelse(race %in% c("Unknown", "not reported"), NA_character_,
race)
) |>
as.data.table()
sum(is.na(cli.final.2$race))
fwrite(cli.final.2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")TARGET
library(data.table)
cli1 = fread("TARGET_donor_allprojects_transfer_to_sample.gz")
cli2 = fread("TARGET_phenotype.gz")
target_cli = cli2 |>
dplyr::select(-`_cohort`) |>
dplyr::rename(Patient_ID = `_PATIENT`,
Sample_ID = sample_id,
disease_code = primary_disease_code,
disease = `_primary_disease`,
sample_type = `_sample_type`) |>
dplyr::full_join(
cli1 |> dplyr::select(
-`_OS_UNIT`, -`_EVENT`, -`_TIME_TO_EVENT`
) |>
dplyr::rename(
Sample_ID = xena_sample,
Sex = `_gender`,
Age = `_age_at_diagnosis`,
OS_Status = `_OS_IND`,
OS_Time = `_OS`
), by = "Sample_ID"
) |>
dplyr::mutate(
Sex = ifelse(Sex == "Male", "M", "F"),
OS_Status = ifelse(OS_Status == "Alive", 0L, 1L)
) |>
dplyr::select(Patient_ID, Sample_ID, Age, Sex, OS_Time, OS_Status, dplyr::everything()) |>
as.data.table()
cli1
target_cli
fwrite(target_cli, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t")
# Using cli from GDC ------------------------------------------------------
system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/TARGET
mkdir ALL_P1 ALL_P2 ALL_P3 AML CCSK NBL OS RT WT
tar zxvf clinical.project-target-all-p1.2024-08-07.tar.gz -C ALL_P1
tar zxvf clinical.project-target-all-p2.2024-08-07.tar.gz -C ALL_P2
tar zxvf clinical.project-target-all-p3.2024-08-07.tar.gz -C ALL_P3
tar zxvf clinical.project-target-aml.2024-08-07.tar.gz -C AML
tar zxvf clinical.project-target-ccsk.2024-08-07.tar.gz -C CCSK
tar zxvf clinical.project-target-nbl.2024-08-07.tar.gz -C NBL
tar zxvf clinical.project-target-os.2024-08-07.tar.gz -C OS
tar zxvf clinical.project-target-rt.2024-08-07.tar.gz -C RT
tar zxvf clinical.project-target-wt.2024-08-07.tar.gz -C WT
")
gdc_target = rbindlist(
list(
fread("gdc_cli/TARGET/ALL_P1/clinical.tsv"),
fread("gdc_cli/TARGET/ALL_P2/clinical.tsv"),
fread("gdc_cli/TARGET/ALL_P3/clinical.tsv"),
fread("gdc_cli/TARGET/AML/clinical.tsv"),
fread("gdc_cli/TARGET/CCSK/clinical.tsv"),
fread("gdc_cli/TARGET/NBL/clinical.tsv"),
fread("gdc_cli/TARGET/OS/clinical.tsv"),
fread("gdc_cli/TARGET/RT/clinical.tsv"),
fread("gdc_cli/TARGET/WT/clinical.tsv")
),
use.names = TRUE, fill = TRUE
)
table(gdc_target$project_id)
any(duplicated(gdc_target))
gdc_target = gdc_target[!duplicated(gdc_target)]
all.equal(gdc_target[[119]], gdc_target[[192]])
gdc_target[[192]] = NULL
sum(gdc_target$days_to_last_follow_up != "'--")
sum(gdc_target$days_to_death != "'--")
gdc_target2 = gdc_target |>
dplyr::rename(
Sample_ID = case_submitter_id,
Age = age_at_diagnosis,
Sex = gender,
OS_Time = days_to_death,
OS_Status = vital_status
) |>
dplyr::select(
Sample_ID, Age, Sex, OS_Time, OS_Status,
project_id, race, primary_diagnosis
) |>
dplyr::mutate(
Patient_ID = Sample_ID,
Age = as.integer(round(as.integer(Age) / 365)),
Sex = fcase(Sex %in% "female", "F",
Sex %in% "male", "M", default = NA_character_),
OS_Time = as.integer(OS_Time),
OS_Status = fcase(OS_Status %in% "Alive", 0L,
OS_Status %in% "Dead", 1L, default = NA_integer_),
OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time)
) |>
as.data.table()
#unique()
# exist a large number of case level replicate
gdc_target2[duplicated(gdc_target2), ]
duplicated(gdc_target[case_submitter_id == "TARGET-20-PANTIV"])
gdc_target[case_submitter_id == "TARGET-20-PANTIV"] |> View() # A few records that have been updated?
gdc_target2 = gdc_target2[!duplicated(gdc_target2)]
#table(gdc_target2$PFS_Time)
any(duplicated(gdc_target2))
any(duplicated(gdc_target2$Sample_ID))
gdc_target2[duplicated(gdc_target2$Sample_ID)]
# They may exist in different ALL projects.
gdc_dup = gdc_target2[Sample_ID %in% Sample_ID[duplicated(Sample_ID)]]
table(gdc_dup$project_id)
# TARGET-ALL-P1 TARGET-ALL-P2 TARGET-ALL-P3
# 10 57 47
#
# For now, we won't address this issue. If we conduct a combined analysis of ALL, we will consider removing duplicates.
colnames(gdc_target2)
gdc_target2 = gdc_target2[, c("Patient_ID", "Sample_ID", "Age", "Sex", "race", "primary_diagnosis", "OS_Time", "OS_Status", "project_id")]
fwrite(gdc_target2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t")
sum(!is.na(gdc_target2$OS_Time))
sum(!is.na(cli1$`_OS`))
nrow(na.omit(unique(target_cli[, .(Patient_ID, OS_Time, OS_Status)])))
nrow(na.omit(unique(gdc_target2[, .(Patient_ID, OS_Time, OS_Status)])))TCGA/TARGET/CPTAC Clinical/IOBR/TME/GENE data divide and anotate preprocessing
TARGET
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.tsv"
metafusion <- fread(file_meta, sep = "\t")
# TARGET
## info
### Because some projects are not included in the target_sheet, we should abandon them.
TARGET_sheet <- fread("/home/data2/Projects/Fusion/unclean/TARGET_gdc_sample_sheet.2023-11-15.tsv", sep = "\t")
project_choose <- substr(TARGET_sheet$`Sample ID`, 1, 9) |> unique()
project_choose
### "TARGET-20" "TARGET-00" "TARGET-21" "TARGET-30"
### TARGET_sheet pretreatment
### Composite Sample
###
### Composite Sample use comma to separate like TARGET-30-PAPEAV, TARGET-30-PAPTFZ
TARGET_Composite <- TARGET_sheet |>
filter(grepl(", ", `Sample ID`)) |>
select(`Sample ID`) |>
unique()
TARGET_Composite
### 1: TARGET-30-PAPEAV-01A, TARGET-30-PAPTFZ-01A
### 2: TARGET-30-PANUKV-01A, TARGET-30-PASUML-01A
### 3: TARGET-30-PANKFE-01A, TARGET-30-PAPUAR-01A
### 4: TARGET-30-PASYPX-01A, TARGET-30-PAIXIF-01A
### For the CPTAC databases, check whether there is any overlap between composite and normal Sample_ID in the info data.
check_TARGET_overlap <- TARGET_sheet |>
select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
unique() |>
separate_rows(`Sample ID`, sep = ",\\s*")
check_TARGET_overlap <- check_TARGET_overlap |>
mutate(Sample_Type = ifelse(grepl(", ", `Sample ID`), "Composite", "Normal"))
overlap_samples <- check_TARGET_overlap |>
group_by(`Sample ID`) |>
summarise(Count = n(), .groups = 'drop') |>
filter(Count > 1)
overlap_samples
### A tibble: 4 × 2
### `Sample ID` Count
### <chr> <int>
### 1 TARGET-30-PANKFE-01A 2
### 2 TARGET-30-PAPTFZ-01A 2
### 3 TARGET-30-PASUML-01A 2
### 4 TARGET-30-PASYPX-01A 2
### So we choose abandon them.
TARGET_pretreatment <- TARGET_sheet |>
select(c(`Case ID`, `Sample ID`, `Sample Type`)) |>
rename("Patient_ID" = `Case ID`,
"Sample_ID" = `Sample ID`,
"Sample_type" = `Sample Type`) |>
filter(!grepl(",", Sample_ID)) |>
filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |>
unique()
### TARGET_xena_info pretreatment
### This data use
TARGET_xena_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t")
TARGET_xena_pretreatment <- TARGET_xena_info |>
mutate(Patient_ID = str_extract(Sample_ID, "^[^-]+-[^-]+-[^-]+")) |>
filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |>
mutate(Age = round(Age, digits = 0),
across(c(Age, OS_Time, OS_Status), ~ as.integer(.))) |>
rename(Cancer = disease_code)
### TARGET_info pretreatment
### In the TARGET database, rows with the info column marked as '--' are annotated as NA.
filter_lose <- function(x) {
is.na(x) | x == "" | x == "--" | x == "Unknown" | x == "'--"
}
TARGET_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t")
replacement_table2 <- c(
"Acute lymphocytic leukemia" = "TARGET-ALL",
"'--" = "TARGET-Unkown",
"Precursor B-cell lymphoblastic leukemia" = "TARGET-B-ALL",
"Not Reported" = "TARGET-Unkown",
"Acute myeloid leukemia, NOS" = "TARGET-LAML",
"Clear cell sarcoma of kidney" = "TARGET-CCSK",
"Neuroblastoma, NOS" = "TARGET-NBL",
"Ganglioneuroblastoma" = "TARGET-GNB",
"Osteosarcoma, NOS" = "TARGET-OS",
"Malignant rhabdoid tumor" = "TARGET-MRT",
"Wilms tumor" = "TARGET-WT"
)
matched2 <- TARGET_info$primary_diagnosis %in% names(replacement_table2)
TARGET_info_pretreatment <- TARGET_info |>
mutate(cohort = case_when(
matched2 ~ replacement_table2[primary_diagnosis],
TRUE ~ as.character(NA)
)) |>
rename(Cancer = primary_diagnosis,
Project_id = project_id) |>
filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |>
mutate(across(c("Age", "Sex", "OS_Time", "OS_Status", "race"), ~ ifelse(filter_lose(.x), NA, .x))) |>
mutate(Cancer = ifelse(Cancer == "'--", NA, Cancer))
### Check if the valid rows in TARGET_info match the valid rows in TARGET_xena_info.
### 1.Filter the valid rows.
filter_valid <- function(x) {
!is.na(x) & x != "" & x != "--" & x != "Unknown" & x == "'--"
}
### Filter the valid rows in TARGET_info.
TARGET_info_valid <- TARGET_info_pretreatment %>%
filter(filter_valid(Age) & filter_valid(OS_Time) & filter_valid(OS_Status) & filter_valid(Sex)) |>
select(c(Patient_ID, Age, OS_Time, OS_Status, Sex))
### Filter the valid rows in TARGET_xena_info.
TARGET_xena_info_valid <- TARGET_xena_pretreatment %>%
filter(filter_valid(Age) & filter_valid(OS_Time) & filter_valid(OS_Status) & filter_valid(Sex)) |>
select(c(Patient_ID, Age, OS_Time, OS_Status, Sex))
### 2. Merge the two data frames based on Patient_ID.
merged_data <- TARGET_info_valid %>%
inner_join(TARGET_xena_info_valid, by = "Patient_ID", suffix = c("_info", "_xena"))
### 3. Check whether the valid values in the merged data frame are consistent.
is_consistent <- merged_data %>%
rowwise() %>%
mutate(
Age_consistent = Age_info == Age_xena,
Sex_consistent = Sex_info == Sex_xena,
OS_Time_consistent = OS_Time_info == OS_Time_xena,
OS_Status_consistent = OS_Status_info == OS_Status_xena
) %>%
ungroup()
### 4. Age not onsistent
Age_not_consistent <- is_consistent |>
filter(Age_consistent == "FALSE")
### 5. Sex not onsistent
Sex_not_consistent <- is_consistent |>
filter(Sex_consistent == "FALSE")
### Sex_not_consistent
### A tibble: 0 × 13
### 6. OS_Time not onsistent
OS_Time_not_consistent <- is_consistent |>
filter(OS_Time_consistent == "FALSE") |>
mutate(
OS_Time_info_greater = OS_Time_info > OS_Time_xena,
OS_Time_xena_greater = OS_Time_xena > OS_Time_info,
OS_Time_equal = OS_Time_info == OS_Time_xena
)
nrow(OS_Time_not_consistent |> filter(OS_Time_info_greater == "FALSE"))
### [1]0
nrow(OS_Time_not_consistent |> filter(OS_Time_xena_greater == "TRUE"))
### [1]0
nrow(OS_Time_not_consistent |> filter(OS_Time_equal == "TRUE"))
### [1]0
### 7. OS_Status not onsistent
OS_Status_not_consistent <- is_consistent |>
filter(OS_Status_consistent == "FALSE")
unique(OS_Status_not_consistent$OS_Status_info)
### [1] 1
unique(OS_Status_not_consistent$OS_Status_xena)
### [1] 0
### All evidence shows that the GDC data has been updated more recently than the Xena data.
### Therefore, if there is a conflict between GDC data and Xena data, we should use the GDC data.
### Use the valid values from TARGET_xena_info to fill the missing values in TARGET_info.
### 1: Determine the matching Patient_ID.
matched_ids <- intersect(TARGET_xena_pretreatment$Patient_ID, TARGET_info_pretreatment$Patient_ID)
setDT(TARGET_xena_pretreatment)
setDT(TARGET_info_pretreatment)
### 2: Find the columns that require updates.
cols_to_update <- setdiff(names(TARGET_info_pretreatment), c("Sample_ID", "Patient_ID", "Cancer","project_id","cohort","race"))
### 3.Iterate through each matching Patient_ID and update TARGET_info_pretreatment using TARGET_xena_pretreatment.
### update Age
TARGET_xena_can <- TARGET_xena_pretreatment |>
filter(if_any("Age", ~ !filter_lose(.x)))
TARGET_info_match <- TARGET_info_pretreatment |>
mutate(Age = case_when(
filter_lose(Age) & Patient_ID %in% TARGET_xena_can$Patient_ID ~ TARGET_xena_can$Age[match(Patient_ID, TARGET_xena_can$Patient_ID)],
TRUE ~ Age))
### update Sex
TARGET_xena_can <- TARGET_xena_pretreatment |>
filter(if_any("Sex", ~ !filter_lose(.x)))
TARGET_info_match1 <- TARGET_info_match |>
mutate(Sex = case_when(
filter_lose(Sex) & Patient_ID %in% TARGET_xena_can$Patient_ID ~ TARGET_xena_can$Sex[match(Patient_ID, TARGET_xena_can$Patient_ID)],
TRUE ~ Sex))
### update OS_Time
TARGET_xena_can <- TARGET_xena_pretreatment |>
filter(if_any("OS_Time", ~ !filter_lose(.x)))
TARGET_info_match2 <- TARGET_info_match1 |>
mutate(OS_Time = case_when(
filter_lose(OS_Time) & Patient_ID %in% TARGET_xena_can$Patient_ID & OS_Status %in% TARGET_xena_can$OS_Status ~ TARGET_xena_can$OS_Time[match(Patient_ID, TARGET_xena_can$Patient_ID)],
TRUE ~ OS_Time))
### update OS_Status
TARGET_xena_can <- TARGET_xena_pretreatment |>
filter(if_any("OS_Status", ~ !filter_lose(.x)))
TARGET_info_after_match <- TARGET_info_match2 |>
mutate(OS_Status = case_when(
filter_lose(OS_Status) & Patient_ID %in% TARGET_xena_can$Patient_ID & OS_Time %in% TARGET_xena_can$OS_Time ~ TARGET_xena_can$OS_Status[match(Patient_ID, TARGET_xena_can$Patient_ID)],
TRUE ~ OS_Status))
### Use TARGET_sheet to perform a left join, ensuring that the Sample_ID is correct.
TARGET_info_after_match <- TARGET_info_after_match |> select(- Sample_ID)
TARGET_correct <- TARGET_pretreatment |>
left_join(TARGET_info_after_match, by = c("Patient_ID" = "Patient_ID")) |>
mutate(Run = Sample_ID)
### Replace the cohort column values with Project_ID where the row is TARGET-Unkown.
### TARGET_correct$project_id |> unique()
### [1] "TARGET-AML" "TARGET-NBL"
TARGET_correct_after_pro <- TARGET_correct %>%
mutate(
cohort = case_when(
cohort == "TARGET-Unkown" ~ Project_id,
TRUE ~ cohort),
cohort = case_when(
cohort == "TARGET-AML" ~ "TARGET-LAML",
TRUE ~ cohort)
)
### Divide a part of data into a group based on whether the Sample_type is cell
target_cell <- TARGET_correct_after_pro |> filter(`Sample_type` == "Cell Lines")
target_cell_sample <- target_cell$Sample_ID |> unique()
target_cell_patient <- target_cell$Patient_ID |> unique()
### target_cell_sample
### [1] "TARGET-20-MUTZ3-50A" "TARGET-20-ML1-50A" "TARGET-20-MV411D1-50A" "TARGET-20-KasumiD1-50A" "TARGET-20-HEK293-50A" "TARGET-20-MOLM14CBFGLIS-50A"
### [7] "TARGET-20-TF1-50A" "TARGET-20-IGROV1-50A" "TARGET-20-CMS-50A" "TARGET-20-WSUAML-50A" "TARGET-20-KasumiAZAD5-50A" "TARGET-20-KasumiAZAD11-50A"
### [13]"TARGET-20-REH-50A" "TARGET-20-MV411AZAD11-50A" "TARGET-20-MV411AZAD5-50A" "TARGET-20-THP1-50A" "TARGET-20-OCIAML2-50A" "TARGET-20-HL60-50A"
### target_cell_patient
### [1] "TARGET-20-MUTZ3" "TARGET-20-ML1" "TARGET-20-MV411D1" "TARGET-20-KasumiD1" "TARGET-20-HEK293" "TARGET-20-MOLM14CBFGLIS"
### [7] "TARGET-20-TF1" "TARGET-20-IGROV1" "TARGET-20-CMS" "TARGET-20-WSUAML" "TARGET-20-KasumiAZAD5" "TARGET-20-KasumiAZAD11"
### [13]"TARGET-20-REH" "TARGET-20-MV411AZAD11" "TARGET-20-MV411AZAD5" "TARGET-20-THP1" "TARGET-20-OCIAML2" "TARGET-20-HL60"
### reference : https://www.cellosaurus.org/
### MUTZ3:Acute myeloid leukemia
### ML1: Adult acute myeloid leukemia
### MV411D1:biphenotypic B-myelomonocytic leukemia
### KasumiD1:Acute myeloblastic leukemia
### TF1:Acute erythroid leukemia
### IGROV1: Ovarian endometrioid adenocarcinoma
### WSUAML:Acute myeloid leukemia。
### KasumiAZAD5:can't find, so KasumiD1:Acute myeloblastic leukemia
### KasumiAZAD11:can't find, so KasumiD1:Acute myeloblastic leukemia
### REH:Childhood B acute lymphoblastic leukemia
### MV411AZAD11:can't find, so MV411D1:biphenotypic B-myelomonocytic leukemia
### MV411AZAD5:can't find, so MV411D1:biphenotypic B-myelomonocytic leukemia
### THP1:Childhood acute monocytic leukemia
### OCIAML2:Adult acute myeloid leukemia
### HL60:Adult acute myeloid leukemia
### HEK293:Vaccine production cell line.
### CML:Childhood acute megakaryoblastic leukemia
### MOLM14CBFGLIS:can't find, so MOLM14:Adult acute myeloid leukemia
cell_data <- metafusion[metafusion$Sample_ID %in% target_cell_sample]
### > nrow(cell_data)
### [1] 1853
cell_info <- TARGET_info_pretreatment[TARGET_info_pretreatment$Patient_ID %in% target_cell_patient]
### nrow(cell_info)
### [1] 18
### no os.etc
cell_info2 <- TARGET_xena_pretreatment[TARGET_xena_pretreatment$Patient_ID %in% target_cell_patient]
### nrow(cell_info2)
### [1] 0
TARGET_info_after_cell <- TARGET_correct_after_pro |>
mutate(cohort = case_when(
`Sample_type` %in% "Cell Lines" ~ "TARGET-CELL",
TRUE ~ cohort
))
### The criteria for distinguishing normal samples based on Sample_type column in the TARGET databases are as follows:
TARGET_info_after_cell$Sample_type |> unique()
### [1] "Primary Blood Derived Cancer - Bone Marrow" "Recurrent Blood Derived Cancer - Bone Marrow"
### [3] "Blood Derived Normal" "Primary Blood Derived Cancer - Peripheral Blood"
### [5] "Bone Marrow Normal" "Cell Lines"
### [7] "Next Generation Cancer Model" "Blood Derived Cancer - Bone Marrow, Post-treatment"
### [9] "Primary Tumor" "Recurrent Tumor"
### [11] "Recurrent Blood Derived Cancer - Peripheral Blood" "Blood Derived Cancer - Peripheral Blood, Post-treatment"
###
### If the sample type is "Blood Derived Normal" or "Bone Marrow Normal", then it is a normal sample. Otherwise, it is a tumor sample.
TARGET_info_after_normal <- TARGET_info_after_cell |>
mutate(cohort = case_when(
`Sample_type` %in% c("Blood Derived Normal", "Bone Marrow Normal") ~ "TARGET-NORMAL",
TRUE ~ cohort
))
### Adjust the column order.
new_order <- c("Patient_ID", "Sample_ID", "Run", "Sample_type", "Age", "Sex", "race", "Cancer",
"OS_Time", "OS_Status", "Project_id", "cohort")
TARGET_info_after_processing <- TARGET_info_after_normal |>
select(all_of(new_order))
### make sure nrow is same between Before and After Processing.
TARGET_pretreatment |> nrow()
### [1] 3222
TARGET_info_after_processing |> nrow()
### [1] 3222
### make sure IOBR/TME data just contain sample which target_sheet gave.
detected_id_TARGET <- TARGET_info_after_processing$Run
### save info
cohorts <- unique(TARGET_info_after_processing$cohort)
for (cohort1 in cohorts) {
cohort_data <- subset(TARGET_info_after_processing, cohort == cohort1)
name <- gsub("-", "_", cohort1)
file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/", name ,"_info.tsv")
fwrite(cohort_data , file = file_name, sep = "\t")
}
##Gene
metafusion1 <- metafusion |>
mutate(
cohort = case_when(
Sample_ID %in% TARGET_info_after_processing$Run & grepl("TARGET", cohort) ~ TARGET_info_after_processing$cohort[match(Sample_ID, TARGET_info_after_processing$Run)],
TRUE ~ cohort
)
)
cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.target.tsv"
fwrite(metafusion1, file = cp_file_name1, sep = "\t")
##TME
NBL_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-NBL_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-NBL")
AML_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-AML_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-AML")
TARGET_TME <- rbind(NBL_TME,AML_TME) |>
mutate(dash_count = str_count(ID, "-"))
### Manually check and delete composite samples from the info dataset.(TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R)
TARGET_TME |>
filter(dash_count == 9) |>
select(ID) |>
unique()
### 1: TARGET-30-PASYPX-01A-01R-TARGET-30-PAIXIF-01A-01R
### 2: TARGET-30-PANKFE-01A-01R-TARGET-30-PAPUAR-01A-01R
### 3: TARGET-30-PANUKV-01A-01R-TARGET-30-PASUML-01A-01R
### 4: TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R
TARGET_TME_pretreatment <- TARGET_TME |>
filter(dash_count != 9) |>
select(-dash_count)
### Filter TARGET Replicate Samples
### The filter rule following broad institute says:
###
### In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
###
### Analyte Replicate Filter
### The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
###
### Sort Replicate Filter
### The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
###
### Ref Link: <https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon>
### <https://gist.github.com/ShixiangWang/33b2f9b49b77eaa8f773b428480f9101>
TARGET_TME_after_replicate <- TARGET_TME_pretreatment |>
mutate(
prefixes = str_remove(ID, "-[^-]+$"),
suffixes = str_extract(ID, "[^-]+$"),
second_chars = str_extract(suffixes, "[0-9]+"),
second_chars = as.integer(second_chars)) |>
group_by(prefixes) |>
slice(which.max(second_chars)) |>
ungroup() |>
mutate(ID = prefixes) |>
filter(ID %in% detected_id_TARGET) |> ### Ensure that all IDs in the TME data can be found in the TARGET info data.
select(-c(prefixes,suffixes,second_chars))
### Using TARGET info data cohort column to annotate TARGET TME data
cohort_data <- TARGET_info_after_processing |>
select(Run, cohort) |>
unique()
TARGET_TME_after_cohort <- TARGET_TME_after_replicate |>
mutate(Run = ID) |>
left_join(cohort_data, by = "Run") |>
select(-Run)
### save TME
cohorts_TME <- unique(TARGET_TME_after_cohort$cohort)
for (cohort1 in cohorts_TME) {
cohort_data <- subset(TARGET_TME_after_cohort, cohort == cohort1) |> select(-cohort)
file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
## IOBR
NBL_IOBR_signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-NBL_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-NBL")
AML_IOBR_signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-AML_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-AML")
TARGET_IOBR <- rbind(NBL_IOBR_signature,AML_IOBR_signature) |>
mutate(dash_count = str_count(ID, "-"))
### Manually check and delete composite samples from the info dataset.(TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R)
TARGET_IOBR |>
filter(dash_count == 9) |>
select(ID) |>
unique()
### 1: TARGET-30-PASYPX-01A-01R-TARGET-30-PAIXIF-01A-01R
### 2: TARGET-30-PANKFE-01A-01R-TARGET-30-PAPUAR-01A-01R
### 3: TARGET-30-PANUKV-01A-01R-TARGET-30-PASUML-01A-01R
### 4: TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R
TARGET_IOBR_pretreatment <- TARGET_IOBR |>
filter(dash_count != 9) |>
select(-dash_count)
TARGET_IOBR_after_replicate <- TARGET_IOBR_pretreatment |>
mutate(
prefixes = str_remove(ID, "-[^-]+$"),
suffixes = str_extract(ID, "[^-]+$"),
second_chars = str_extract(suffixes, "[0-9]+"),
second_chars = as.integer(second_chars)) |>
group_by(prefixes) |>
slice(which.max(second_chars)) |>
ungroup() |>
mutate(ID = prefixes) |>
filter(ID %in% detected_id_TARGET) |> ### Ensure that all IDs in the TME data can be found in the TARGET info data.
select(-c(prefixes,suffixes,second_chars))
### Using TARGET info data cohort column to annotate TARGET TME data
cohort_data <- TARGET_info_after_processing |>
select(Run, cohort) |>
unique()
TARGET_IOBR_after_cohort <- TARGET_IOBR_after_replicate |>
mutate(Run = ID) |>
left_join(cohort_data, by = "Run") |>
select(-Run)
### save IOBR
cohorts_IOBR <- unique(TARGET_IOBR_after_cohort$cohort)
for (cohort1 in cohorts_IOBR) {
cohort_data <- subset(TARGET_IOBR_after_cohort, cohort == cohort1) |> select(-cohort)
file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
CPTAC
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.tsv"
metafusion <- fread(file_meta, sep = "\t")
# CPTAC
## info
CPTAC_sheet <- fread("/home/data2/Projects/Fusion/unclean/CPTAC_gdc_sample_sheet.2023-11-15.tsv", sep = "\t")
### Composite Sample
###
### Composite Sample use comma to separate like C3L-03400-05, C3L-03400-02
###
### It leads one question is whether these composite samples using comma to separate from identical patient?
rows_with_comma <- grepl(", ", CPTAC_sheet$`Sample ID`)
rows_with_comma_data <- CPTAC_sheet[rows_with_comma, ]
check_same_chars <- function(x) {
parts <- unlist(strsplit(x, ", "))
return(parts[1] == parts[2])
}
same_chars <- sapply(rows_with_comma_data$`Case ID`, check_same_chars)
if (all(same_chars)) {
cat("TRUE\n")
} else {
cat("FALSE\n")
}
### They are from identical patient.
### For the TARGET databases, check whether there is any overlap between composite and normal Sample_ID in the info data. If there is an overlap, retain only the normal Sample_ID.
check_cptac_overlap <- CPTAC_sheet |>
select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
unique() |>
separate_rows(`Sample ID`, sep = ",\\s*")
check_cptac_overlap <- check_cptac_overlap |>
mutate(Sample_Type = ifelse(grepl(", ", `Sample ID`), "Composite", "Normal"))
overlap_samples <- check_cptac_overlap |>
group_by(`Sample ID`) |>
summarise(Count = n(), .groups = 'drop') |>
filter(Count > 1)
### > overlap_samples
### A tibble: 0 × 2
### ℹ 2 variables: Sample ID <chr>, Count <int>
### no this situation
### Due to the less rigorous patient naming conventions(like 604) in the CPTAC2 project, check for any overlapping patient names between CPTAC2 and CPTAC3.
CPTAC_sheet_rigorous <- CPTAC_sheet |>
select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
unique()
duplicates <- CPTAC_sheet_rigorous |>
group_by(`Case ID`, `Sample ID`) |>
summarise(n_projects = n(), projects = paste(`Project ID`, collapse = ", ")) |>
filter(n_projects > 1)
### nrow(duplicates)
### [1] 0
### no same patient names between CPTAC2 and CPTAC3
### CPTAC_sheet pretreatment
matched_CPTAC_pretreatment <- CPTAC_sheet |>
select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
rename("Patient_ID" = `Case ID`,
"Sample_ID" = `Sample ID`,
"Sample_type" = `Sample Type`,
"Project_id" = `Project ID`) |>
separate_rows(Patient_ID, sep = ",\\s*") |> #because they are from identical patient.
unique()
### CPTAC_info pretreatment
CPTAC_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")
# > CPTAC_info$Cancer |> unique()
# [1] "BRCA" "CCRCC" "COAD" "GBM" "HGSC" "HNSCC" "LSCC" "LUAD" "PDAC" "UCEC" "GTEX-Brain"
### We use our naming standard mutate a new column cohort.
cptac_replacement_table2 <- c(
"BRCA" = "CPTAC-BRCA",
"CCRCC" = "CPTAC-KIRC",
"COAD" = "CPTAC-COAD",
"GBM" = "CPTAC-GBM",
"HGSC" = "CPTAC-OV",
"HNSCC" = "CPTAC-HNSC",
"LSCC" = "CPTAC-LUSC" ,
"LUAD" = "CPTAC-LUAD",
"PDAC" = "CPTAC-PAAD",
"UCEC" = "CPTAC-UCEC",
"GTEX-Brain" = "CPTAC-GTEX-Brain"
)
matched_CPTAC_info <- CPTAC_info$Cancer %in% names(cptac_replacement_table2)
CPTAC_info_pretreatment <- CPTAC_info %>%
mutate(cohort = case_when(
matched_CPTAC_info ~ cptac_replacement_table2[Cancer],
TRUE ~ as.character(NA)
)) |>
select(-Sample_ID)
### CPTAC_info_pretreatment |> filter(is.na(cohort)) |> nrow()
### [1] 0
matched_CPTAC <- matched_CPTAC_pretreatment |>
left_join(CPTAC_info_pretreatment, by = c("Patient_ID" = "Patient_ID")) |>
mutate(Run = str_replace_all(Sample_ID, ", ", "-"))
### CPTAC Sample type
### matched_CPTAC_pretreatment$Sample_type |> unique()
### [1] "Primary Tumor" "Solid Tissue Normal"
### [3] "Primary Tumor, Primary Tumor" "Primary Tumor, Primary Tumor, Primary Tumor"
### [5] "Solid Tissue Normal, Solid Tissue Normal, Solid Tissue Normal" "Solid Tissue Normal, Solid Tissue Normal"
### [7] "Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor" "Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor"
###
### Just two sample type so if grepling "Solid Tissue Normal" is normal sample, grepling "Primary Tumor" is tumor sample
matched_CPTAC_after_processing <- matched_CPTAC |>
mutate(cohort = case_when(
grepl("Solid Tissue Normal", Sample_type) ~ "CPTAC-NORMAL",
TRUE ~ cohort
))
### make sure nrow is same between Before and After Processing.
matched_CPTAC_after_processing |> nrow()
### [1] 2558
matched_CPTAC_pretreatment |> nrow()
### [1] 2558
### make sure IOBR/TME data just contain sample which cptac_sheet gave.
detected_id_CPTAC <- matched_CPTAC_after_processing$Run |> unique()
### save info
cohorts <- unique(matched_CPTAC_after_processing$cohort)
for (cohort1 in cohorts) {
cohort_data <- subset(matched_CPTAC_after_processing, cohort == cohort1)
name <- gsub("-", "_", cohort1)
file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/", name ,"_info.tsv")
fwrite(cohort_data , file = file_name, sep = "\t")
}
## TME
CPTAC2_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-2_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC2")
CPTAC3_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-3_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC3")
CPTAC_TME <- rbind(CPTAC2_TME, CPTAC3_TME)
### Match matched_CPTAC_after_processing$Run to CPTAC_TME$ID
###
### CPTAC_TME can match
matched_can <- CPTAC_TME %>%
filter(ID %in% matched_CPTAC_after_processing$Run)
###
### CPTAC_TME can't match
matched_not <- CPTAC_TME %>%
filter(!(ID %in% matched_CPTAC_after_processing$Run))
###
### situtation 1: Due to the barcode naming convention and etc, some portion character are added to the endning of the IOBR/TME sample name.
### str_count(matched_last$ID, "_")
### [1] 1 2 0
### 1.1 "CPT000817-0008" and "CPT000817". If the value of str_count(matched_last$ID, "_") for a row is 0, then the value of the prefixes column for that row should be the characters before the last -, and the value of the suffixes column should be the characters after the last -.
### 1.2 "f6a7f764-2495-4a1d-89a2-e12f51_D7_1" and "f6a7f764-2495-4a1d-89a2-e12f51". If the value of str_count(matched_last$ID, "_") for a row is 1, then the value of the prefixes column for that row should be the characters before the first _, and the value of the suffixes column should be the characters after the first _.
### 1.3 "CPT008756_0001" and "CPT008756". f6a7f764-2495-4a1d-89a2-e12f51_D7_1, .If the value of str_count(matched_last$ID, "_") for a row is 2, then the value of the prefixes column for that row should be the characters before the first _, and the value of the suffixes column should be the characters after the first _.
matched_not <- matched_not %>%
rowwise() %>%
mutate(
prefixes = case_when(
str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][1],
str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][1],
str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][1],
TRUE ~ NA_character_
),
suffixes = case_when(
str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][length(str_split(ID, "-")[[1]])],
str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][2],
str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][2],
TRUE ~ NA_character_
),
name1 = prefixes
) %>%
ungroup()
### situtation 1 after processing can match
matched_in_situtation1 <- matched_not %>%
filter((name1 %in% matched_CPTAC_after_processing$Run))%>%
mutate(ID = name1) %>%
select(-c(suffixes,prefixes,name1) )
matched_can <- rbind(matched_can, matched_in_situtation1)
### not in situtation 1
matched_not_situtation1 <- matched_not %>%
filter(!(name1 %in% matched_CPTAC_after_processing$Run))
### situtation 2: Due to IOBR package analyze code, an additional "X" is added to the beginning of the IOBR/TME sample name.
### "X1104806" and "1104806", "X9f905736-f662-41d6-b3ac-16758d" and "9f905736-f662-41d6-b3ac-16758d".
matched_not_situtation1 <- matched_not_situtation1 %>%
mutate(name2 = str_remove(ID, "^X"))
### situtation 2 after processing can match
matched_in_situtation2 <- matched_not_situtation1 %>%
filter((name2 %in% matched_CPTAC_after_processing$Run)) %>%
mutate(ID = name2 ) %>%
select(-c(name1,prefixes,suffixes,name2))
matched_can <- rbind(matched_can, matched_in_situtation2)
### situtation 2 after processing can't match
matched_not_situtation2 <- matched_not_situtation1 %>%
filter(!(name2 %in% matched_CPTAC_after_processing$Run))
### situtation 3: Mixed situation1 and situation2.
### X161317f3-1a39-438e-af45-cd2502_D7 and 161317f3-1a39-438e-af45-cd2502
matched_not_situtation2 <- matched_not_situtation2 %>%
mutate(name3 = str_remove(prefixes, "^X"))
### situtation 3 after processing can match
matched_in_situtation3 <- matched_not_situtation2 %>%
filter((name3 %in% matched_CPTAC_after_processing$Run)) %>%
mutate(ID = name3) %>%
select(-c(name1,name2,prefixes,suffixes, name3))
matched_can <- rbind(matched_can, matched_in_situtation3)
### situtation 3 after processing can't match
matched_not_situtation3 <- matched_not_situtation2 %>%
filter(!(name3 %in% matched_CPTAC_after_processing$Run))
nrow(matched_not_situtation3 )
### [1] 81
### Check the reason why it still can't match after Type 3 processing
### After manual review, these Sample_ID are not included in either the info dataset or the gdc_sheet.
### Check if CPTAC TME data ID are any samples that have been divided into multiple portions.
matched_can |> select(ID) |> nrow()
### [1] 2558
matched_can |> select(ID) |> unique() |> nrow()
### [1] 2558
### No this situation.
### Check if CPTAC TME data has same norw with CPTAC info data.
matched_can |> select(ID) |> unique() |> nrow()
### [1] 2558
matched_CPTAC_after_processing |> select(Run) |> unique() |> nrow()
### [1] 2558
### Sure.
cohort_data <- matched_CPTAC_after_processing |>
select(Run, cohort) |>
unique()
matched_can <- matched_can |>
mutate(Run = ID) |>
left_join(cohort_data, by = "Run") |>
select(-Run)
cohorts_TME <- unique(matched_can$cohort)
for (cohort1 in cohorts_TME) {
cohort_data <- subset(matched_can, cohort == cohort1) |> select(-cohort)
file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
#IOBR
CPTAC2_Signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-2_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC2")
CPTAC3_Signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-3_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC3")
CPTAC_Signature <- rbind(CPTAC2_Signature, CPTAC3_Signature)
### Match matched_CPTAC_after_processing$Run to CPTAC_TME$ID
###
### CPTAC_TME can match
matched_can <- CPTAC_Signature %>%
filter(ID %in% matched_CPTAC_after_processing$Run)
###
### CPTAC_TME can't match
matched_not <- CPTAC_Signature %>%
filter(!(ID %in% matched_CPTAC_after_processing$Run))
###
### situtation 1:like TME
matched_not <- matched_not %>%
rowwise() %>%
mutate(
prefixes = case_when(
str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][1],
str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][1],
str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][1],
TRUE ~ NA_character_
),
suffixes = case_when(
str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][length(str_split(ID, "-")[[1]])],
str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][2],
str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][2],
TRUE ~ NA_character_
),
name1 = prefixes
) %>%
ungroup()
### situtation 1 after processing can match
matched_in_situtation1 <- matched_not %>%
filter((name1 %in% matched_CPTAC_after_processing$Run))%>%
mutate(ID = name1) %>%
select(-c(suffixes,prefixes,name1) )
matched_can <- rbind(matched_can, matched_in_situtation1)
### situtation 1 after processing can't match
matched_not_situtation1 <- matched_not %>%
filter(!(name1 %in% matched_CPTAC_after_processing$Run))
### situtation 2:like TME
matched_not_situtation1 <- matched_not_situtation1 %>%
mutate(name2 = str_remove(ID, "^X"))
### situtation 2 after processing can match
matched_in_situtation2 <- matched_not_situtation1 %>%
filter((name2 %in% matched_CPTAC_after_processing$Run)) %>%
mutate(ID = name2 ) %>%
select(-c(name1,prefixes,suffixes,name2))
matched_can <- rbind(matched_can, matched_in_situtation2)
### situtation 2 after processing can't match
matched_not_situtation2 <- matched_not_situtation1 %>%
filter(!(name2 %in% matched_CPTAC_after_processing$Run))
### situtation 3:like TME
matched_not_situtation2 <- matched_not_situtation2 %>%
mutate(name3 = str_remove(prefixes, "^X"))
### situtation 3 after processing can match
matched_in_situtation3 <- matched_not_situtation2 %>%
filter((name3 %in% matched_CPTAC_after_processing$Run)) %>%
mutate(ID = name3) %>%
select(-c(name1,name2,prefixes,suffixes, name3))
matched_can <- rbind(matched_can, matched_in_situtation3)
### situtation 3 after processing can't match
matched_not_situtation3 <- matched_not_situtation2 %>%
filter(!(name3 %in% matched_CPTAC_after_processing$Run))
nrow(matched_not_situtation3 )
### [1] 81
### Check the reason why it still can't match after Type 3 processing
### After manual review, these Sample_ID are not included in either the info dataset or the gdc_sheet.
### Check if CPTAC IOBR data ID are any samples that have been divided into multiple portions.
matched_can |> select(ID) |> nrow()
### [1] 2558
matched_can |> select(ID) |> unique() |> nrow()
### [1] 2558
### No this situation.
### Check if CPTAC IOBR data has same norw with CPTAC info data.
matched_can |> select(ID) |> unique() |> nrow()
### [1] 2558
matched_CPTAC_after_processing |> select(Run) |> unique() |> nrow()
### [1] 2558
### Sure.
cohort_data <- matched_CPTAC_after_processing |>
select(Run, cohort) |>
unique()
matched_can <- matched_can |>
mutate(Run = ID) |>
left_join(cohort_data, by = "Run") |>
select(-Run)
cohorts_IOBR <- unique(matched_can$cohort)
for (cohort1 in cohorts_IOBR) {
cohort_data <- subset(matched_can, cohort == cohort1) |> select(-cohort)
file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
## Gene
metafusion1 <- metafusion |>
mutate(
cohort = case_when(
Sample_ID %in% matched_CPTAC_after_processing$Run & grepl("CPTAC", cohort) ~ matched_CPTAC_after_processing$cohort[match(Sample_ID, matched_CPTAC_after_processing$Run)],
TRUE ~ cohort
)
)
### In the past, the metafusion file had an issue with the sample_type being incorrectly labeled.
### After further investigation, it was found that only the composite Sample_ID in CPTAC had this issue, where samples labeled as 'Normal' were mistakenly identified as 'Tumor' in the sample_type column.
normal_id <- matched_CPTAC_after_processing |>
filter(cohort == "CPTAC-NORMAL") |>
select(Run) |>
unique() |>
pull(Run)
meta_normal <- metafusion1 |>
filter(Sample_ID %in% normal_id)
unique(meta_normal$sample_type)
#[1] "Normal" "Tumor"
meta_processing <- metafusion1 |>
mutate(
sample_type = ifelse(
sample_type != "Normal" & Sample_ID %in% normal_id,
"Normal",
sample_type
)
)
cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.tsv"
fwrite(meta_processing, file = cp_file_name1, sep = "\t") TCGA
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.target.tsv"
metafusion <- fread(file_meta , sep = "\t")
# TCGA
## info
TCGA_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")
### TCGA_sheet pretreatment
TCGA_sheet <- fread("/home/data2/Projects/Fusion/unclean/TCGA_sample_sheet.2024-05-14.tsv", sep = "\t") |>
select(c(`Sample ID`, `Sample Type`,`Project ID`))
colnames(TCGA_sheet) <- c("Sample_ID", "Sample_type","Project_id")
TCGA_sheet_pretreatment <- TCGA_sheet |>
mutate(Sample_ID = substr(Sample_ID, 1, 15)) |>
unique()
### We use our naming standard mutate a new column cohort.
TCGA_after_processing <- TCGA_sheet_pretreatment |>
left_join(TCGA_info, by = c("Sample_ID" = "Sample_ID")) |>
mutate(
cohort = Project_id,
Run = Sample_ID,
Patient_ID = substr(Sample_ID, 1, 12),
)
### TCGA Sample type
### TCGA_after_processing$Sample_type |> unique()
### [1] "Primary Tumor" "Solid Tissue Normal" "Metastatic"
### [4] "Additional - New Primary" "Primary Blood Derived Cancer - Peripheral Blood" "Additional Metastatic"
### [7] "Recurrent Tumor"
###
### "Solid Tissue Normal" is normal sample, others is tumor sample
TCGA_NORMAL <- TCGA_after_processing |>
filter(Sample_type == "Solid Tissue Normal")
NORMAL_id <- TCGA_NORMAL$Run |> unique()
TCGA_after_processing <- TCGA_after_processing |>
mutate(
Project_id = cohort,
cohort = case_when(
Run %in% NORMAL_id ~ "TCGA-NORMAL",
TRUE ~ cohort))
### make sure nrow is same between Before and After Processing.
TCGA_after_processing |> nrow()
### [1] 10545
TCGA_sheet_pretreatment |> nrow()
### [1] 10545
### make sure IOBR/TME data just contain sample which tcga_sheet gave.
detected_id_TCGA <- TCGA_after_processing$Run |> unique()
### save info
cohorts <- unique(TCGA_after_processing$cohort)
file_name1 <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/")
for (cohort1 in cohorts) {
cohort_data <- subset(TCGA_after_processing, cohort == cohort1)
name <- gsub("-", "_", cohort1)
file_name <- paste0(file_name1, name ,"_info.tsv")
fwrite(cohort_data , file = file_name, sep = "\t")
}
## Gene
metafusion1 <- metafusion |>
mutate(
cohort = case_when(
Sample_ID %in% TCGA_after_processing$Run & grepl("TCGA", cohort) ~ TCGA_after_processing$cohort[match(Sample_ID, TCGA_after_processing$Run)],
TRUE ~ cohort
)
)
cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_gdc_divide.tsv"
fwrite(metafusion1, file = cp_file_name1, sep = "\t")
## IOBR
folder_path <- "/home/data2/Projects/Fusion/fusiondb/IOBR"
file_pattern <- "^TCGA.*_IOBR_signature.tsv.gz$"
file_list <- list.files(path = folder_path, pattern = file_pattern, full.names = TRUE)
combined_data <- data.table()
for (file in file_list) {
file_data <- fread(file, sep = "\t")
file_name <- basename(file)
project_id <- sub("^(TCGA-[^_]+).*", "\\1", file_name)
file_data <- file_data %>%
mutate(Project_id = project_id)
combined_data <- rbind(combined_data, file_data)
}
### Filter TCGA Replicate Samples
### The filter rule following broad institute says:
###
### In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
###
### Analyte Replicate Filter
### The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
###
### Sort Replicate Filter
### The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
###
### Ref Link: <https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon>
### <https://gist.github.com/ShixiangWang/33b2f9b49b77eaa8f773b428480f9101>
result_data <- combined_data |>
mutate(
prefixes = str_split(ID, "-", simplify = TRUE)[, 1:4] %>% apply(1, paste, collapse = "-"),
suffixes = str_split(ID, "-", simplify = TRUE)[, 5],
second_chars = str_extract(suffixes, "[0-9]+"),
second_chars = as.integer(second_chars))
result_IOBR <- result_data |>
group_by(prefixes) |>
slice(which.max(second_chars)) |>
ungroup() |>
mutate(cohort = Project_id,
ID = substr(prefixes, 1, 15),
cohort = case_when(ID %in% NORMAL_id ~ "TCGA-NORMAL",
TRUE ~ cohort)
) |>
filter(ID %in% detected_id_TCGA) |>
select(-c("suffixes", "second_chars","prefixes")) |>
unique()
cohorts_IOBR <- unique(result_IOBR$cohort)
for (cohort1 in cohorts_IOBR) {
cohort_data <- subset(result_IOBR, cohort == cohort1) |> select(-c(cohort,Project_id) )
file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
}
## TME
folder_path <- "/home/data2/Projects/Fusion/fusiondb/IOBR"
file_pattern <- "^TCGA.*_TME.tsv.gz$"
file_list <- list.files(path = folder_path, pattern = file_pattern, full.names = TRUE)
combined_data1 <- data.table()
for (file in file_list) {
file_data <- fread(file, sep = "\t")
file_name <- basename(file)
project_id <- sub("^(TCGA-[^_]+).*", "\\1", file_name)
file_data <- file_data %>%
mutate(Project_id = project_id)
combined_data1 <- rbind(combined_data1, file_data)
}
result_data1 <- combined_data1 |>
mutate(
prefixes = str_split(ID, "-", simplify = TRUE)[, 1:4] %>% apply(1, paste, collapse = "-"),
suffixes = str_split(ID, "-", simplify = TRUE)[, 5],
second_chars = str_extract(suffixes, "[0-9]+"),
second_chars = as.integer(second_chars))
result_TME <- result_data1 |>
group_by(prefixes) |>
slice(which.max(second_chars)) |>
ungroup() |>
mutate(cohort = Project_id,
ID = substr(prefixes, 1, 15),
cohort = case_when(ID %in% NORMAL_id ~ "TCGA-NORMAL",
TRUE ~ cohort)
) |>
filter(ID %in% detected_id_TCGA) |>
select(-c("suffixes", "second_chars","prefixes")) |>
unique()
cohorts_TME <- unique(result_TME$cohort)
for (cohort1 in cohorts_TME) {
cohort_data <- subset(result_TME, cohort == cohort1) |> select(-c(cohort,Project_id) )
file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
write.table(cohort_data,
file = gzfile(file_name),
sep = "\t",
row.names = FALSE,
quote = FALSE)
} Fusion Event Scoring and Classification
library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_gdc_divide.tsv"
metafusion <- fread(file_meta , sep = "\t")
# Stratification Strategy
## Normal data
### 1. If one fusion event was found in the past Arriba blocklist, it will be given a score of 0 and will not be changed.
### 2. If one fusion event's max_split_cnt >= 1 and max_span_cnt >= 1, +1 score.
### 3. If Arriba and STAR-Fusion detected identical fusion event, +1 score.
### 4. If past FusionAnnotator annotated one normal fusion event, +2 score(Because it is a strong evidence)
Normal_data <- metafusion |>
filter(sample_type == "Normal") |>
mutate(Normal_score = 0)
### 1. Arriba blocklist
#### reference:https://arriba.readthedocs.io/en/latest/input-files/#blacklist
Normal_data <- Normal_data %>%
mutate(
Normal_score = case_when(
blocklist == TRUE ~ 0,
TRUE ~ Normal_score
),
score0 = case_when(
blocklist == TRUE ~ TRUE,
TRUE ~ FALSE
)
)
### 2. Reads counts
Normal_data <- Normal_data %>%
mutate(
Normal_score = case_when(
max_split_cnt >= 1 & max_span_cnt >= 1 & score0 == FALSE ~ Normal_score + 1 ,
TRUE ~ Normal_score
)
)
### 3. Tools detected
Normal_data <- Normal_data %>%
mutate(
Normal_score = case_when(
tools == "arriba,starfusion" & score0 == FALSE ~ Normal_score + 1,
TRUE ~ Normal_score
)
)
### 4. Fusionanntator annotate
normal_markers <- c("GTEx_recurrent_StarF2019", "DGD_PARALOGS", "HGNC_GENEFAM", "BodyMap", "Greger_Normal", "ConjoinG", "Babiceanu_Normal")
#### Past Red Herrings summary didn't contain TCGA, TARGET, CPTAC(except for GTEx-Brain) normal sample data.
#### reference:https://github.com/FusionAnnotator/CTAT_HumanFusionLib/wiki
#### BodyMap 40
#### Babiceanu_Normal 239
#### ConjoinG 698
#### DGD_PARALOGS 14463
#### Greger_Normal 113
#### GTEx_recurrent_StarF2019 4047
#### HGNC_GENEFAM Unknown
Normal_data <- Normal_data %>%
mutate(
Normal_score = case_when(
rowSums(sapply(normal_markers, grepl, x = cancer_db_hits)) >= 1 & score0 == FALSE ~ Normal_score + 2,
TRUE ~ Normal_score )
)
### Nomral true
Normal_data_true <- Normal_data |>
group_by(gene1, gene2) %>%
slice_max(Normal_score) %>%
ungroup() %>%
filter(Normal_score >= 2)
## Tumor data
### 1. If one fusion event was found in the past Arriba blocklist, it will be given a score of 0 and will not be changed.
### 2. If fusion event was in our detcted normal fusion(score>=2), it will be given a score of 0 and will not be changed.
### 3. If one fusion event's max_split_cnt and max_span_cnt are both greater than or equal to 1, and at least one of them is less than 5, +1 score.
### If one fusion event's max_split_cnt and max_span_cnt are both greater than or equal to 5, +2 score.
### 4. If Arriba detected exclusive fusion event, +0 score;
### If STAR-Fusion detected exclusive fusion event, +0 score;
### If Arriba and STAR-Fusion detected identical fusion event, +1 score;
### 5. If past FusionAnnotator annotated one tumor fusion event, +2 score(Because it is a strong evidence)
Tumor_data <- metafusion |>
filter(sample_type == "Tumor" | sample_type == "Blood") |>
mutate(Tumor_score = 0)
### 1. Arriba blocklist
#### reference:https://arriba.readthedocs.io/en/latest/input-files/#blacklist
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
blocklist == TRUE ~ 0,
TRUE ~ Tumor_score
),
score0 = case_when(
blocklist == TRUE ~ TRUE,
TRUE ~ FALSE
)
)
### 2. Filter not in normal data.
Tumor_data <- Tumor_data %>%
mutate(gene_pair = paste0(gene1, "::", gene2))
Normal_data_true <- Normal_data_true %>%
mutate(gene_pair = paste0(gene1, "::", gene2))
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
gene_pair %in% Normal_data_true$gene_pair & score0 == FALSE ~ 0,
TRUE ~ Tumor_score
),
score0 = case_when(
gene_pair %in% Normal_data_true$gene_pair & score0 == FALSE ~ TRUE,
TRUE ~ FALSE
)
)
### 3. Reads counts
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
max_split_cnt >= 5 & max_span_cnt >= 5 & score0 == FALSE ~ Tumor_score + 2,
(max_split_cnt >= 1 & max_span_cnt >= 1) & ((max_split_cnt < 5 & max_span_cnt >= 5)| (max_split_cnt >= 5 & max_span_cnt < 5)| (max_split_cnt < 5 & max_span_cnt < 5))~ Tumor_score + 1,
TRUE ~ Tumor_score
)
)
### 4. Tools detected
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
tools == "arriba,starfusion" & score0 == FALSE ~ Tumor_score + 1,
tools == "arriba" & score0 == FALSE ~ Tumor_score,
tools == "starfusion" & score0 == FALSE ~ Tumor_score,
TRUE ~ Tumor_score
)
)
### 5. Fusionanntator annotate
tumor_markers <- c("Mitelman", "chimerdb_omim", "chimerdb_pubmed", "Cosmic", "YOSHIHARA_TCGA", "ChimerKB", "ChimerPub", "ChimerSeq", "Klijn_CellLines", "Larsson_TCGA", "CCLE_StarF2019", "HaasMedCancer", "TCGA_StarF2019", "TumorFusionsNAR2018", "DEEPEST2019", "GUO2018CR_TCGA", "DepMap2023")
Tumor_data <- Tumor_data %>%
mutate(
Tumor_score = case_when(
rowSums(sapply(tumor_markers, grepl, x = cancer_db_hits)) >= 1 ~ Tumor_score + 2,
TRUE ~ Tumor_score)
)
### New metafusion
Tumor_data1 <- Tumor_data |>
select(-c(gene_pair, score0)) |>
rename(gene_score = Tumor_score)
Normal_data1 <- Normal_data |>
select(-c( score0)) |>
rename(gene_score = Normal_score)
metafusion_new <- rbind(Normal_data1,Tumor_data1)
file_name <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls.star.tsv"
write.table(metafusion_new,
file = file_name,
sep = "\t",
row.names = FALSE,
quote = FALSE)