Methods

Authors

Shixiang Wang

Yankun Zhao

Published

March 16, 2024

Gene Fusions identification and differential expression analysis

We employed two Gene Fusions(GFs) detection tools, including arriba, STAR-Fusion, to identify, parse, and annotate GFs junctions within each sample, using the human genome version hg38 as the reference. In our pursuit of biologically robust GFs identification within each cohort, we undertook an approach metafusion that amalgamated results from two GFs detection tools. Firstly, we exclusively retained Fusions located on chromosomes 1 to 22, as well as the X and Y chromosomes. Secondly, GFs were mandated to exhibit genomic overlap with gene regions as defined in the reference genome annotation file, specifically “gencode.v34.annotation.gtf.”. Thirdly to ensure the precision and recall in GFs prediction with different tools, one can either rely on consensus predictions or all predctions and choosing junction-crossing reads and reads mates which flank (span) the breakpoint. Fourthly, we offer FusionAnnotator to identify and prioritize GFs that have been previously reported in cancer or normal samples. Lastly, the detected data can be divided into Cis-SAGes and other categories to further explore the debate on what should be considered true fusion RNAs. This meticulous methodological approach was instrumental in ensuring the acquisition of biologically sound and reliable GFs datasets, thereby fortifying the integrity of our analysis within each cohort. This analysis facilitated comparisons between patient groups, such as those who responded to checkpoint immunotherapy versus non-responders, as well as between samples collected before and during ICB treatment.

Response and Response2

We gathered and processed patient ‘Response’ data from various cohorts. However, due to inconsistencies among these cohorts, such as the usage of different terms like CR, PR, SD, PD, R, NR, DCB, and NDB, we harmonized the response variable based on the timing of treatment and treatment response. As a result, we introduced a new variable called ‘Response2’ with four distinct categories: Pre-R, Pre-NR, On-R, and On-NR. In cases where response data was not available, we assigned the category NE.

Cancer Type Abbreviation

Abbr.	Description
BLCA	Bladder urothelial carcinoma
SKCM	Skin cutaneous melanoma
KIRC & CCRCC	Kidney renal clear cell carcinoma
HNSC & HNSCC	Head and neck squamous cell carcinoma
NSCLC	Non-small cell lung cancer, including lung adenocarcinoma and lung squamous cell carcinoma
SCLC	Small cell lung cancer
SARC	Sarcoma
BRCA	Breast cancer
GBM	Glioblastoma Multiforme
STAD	Stomach Adenocarcinoma
SGC	Salivary gland cancer
LUSC	Lung Squamous Cell Carcinoma
MESO	Mesothelioma
PCPG	Pheochromocytoma and Paraganglioma
LIHC	Liver Hepatocellular Carcinoma
PRAD	Prostate Adenocarcinoma
ACC	Adrenocortical Carcinoma
TGCT	Testicular Germ Cell Tumors
PAAD	Pancreatic Adenocarcinoma
UCEC	Uterine Corpus Endometrial Carcinoma
THCA	Thyroid carcinoma
CESC	Cervical squamous cell carcinoma and endocervical adenocarcinoma
ESCA	Esophageal carcinoma
READ	Rectum adenocarcinoma
OV	Ovarian Serous Cystadenocarcinoma
CHOL	Cholangiocarcinoma
UCS	Uterine Carcinosarcoma
LUAD	Lung Adenocarcinoma
LGG	Brain Lower Grade Glioma
HGSC	High-Grade Serous Carcinoma
LSCC & LUSC	Lung Squamous Cell Carcinoma
PDAC	Pancreatic Ductal Adenocarcinoma
THYM	Thymoma
KICH	Kidney Chromophobe
KIRP	Kidney Renal Papillary Cell Carcinoma
COAD	Colon Adenocarcinoma
LAML & AML	Acute Myeloid Leukemia
NBL	Neuroblastoma
GNB	Ganglioneuroblastoma

ImmunoFusion cohort naming maping

Cancer	Sample_type	Project_id	Immunofusion-cohort
BRCA	Primary Tumor	CPTAC-2	CPTAC-BRCA
COAD	Primary Tumor	CPTAC-2	CPTAC-COAD
HGSC	Primary Tumor	CPTAC-2	CPTAC-OV
LSCC	Primary Tumor	CPTAC-3	CPTAC-LUSC
UCEC	Primary Tumor	CPTAC-3	CPTAC-UCEC
LUAD	Primary Tumor	CPTAC-3	CPTAC-LUAD
GBM	Primary Tumor	CPTAC-3	CPTAC-GBM
HNSCC	Primary Tumor	CPTAC-3	CPTAC-HNSC
CCRCC	Primary Tumor	CPTAC-3	CPTAC-KIRC
PDAC	Primary Tumor	CPTAC-3	CPTAC-PAAD
PDAC	Solid Tissue Normal	CPTAC-3	CPTAC-NORMAL
LUAD	Solid Tissue Normal	CPTAC-3	CPTAC-NORMAL
UCEC	Solid Tissue Normal	CPTAC-3	CPTAC-NORMAL
CCRCC	Solid Tissue Normal	CPTAC-3	CPTAC-NORMAL
HNSCC	Solid Tissue Normal	CPTAC-3	CPTAC-NORMAL
LSCC	Solid Tissue Normal	CPTAC-3	CPTAC-NORMAL
GTEX-Brain	Solid Tissue Normal	CPTAC-3	CPTAC-NORMAL
–	Primary Blood Derived Cancer - Peripheral Blood/Recurrent Blood Derived Cancer - Peripheral Blood/Next Generation Cancer Model/Recurrent Blood Derived Cancer - Bone Marrow/Primary Blood Derived Cancer - Bone Marrow	TARGET-AML	TARGET-LAML
Acute myeloid leukemia, NOS	Primary Blood Derived Cancer - Bone Marrow/Recurrent Blood Derived Cancer - Bone Marrow/Primary Blood Derived Cancer - Peripheral Blood/ Blood Derived Cancer - Bone Marrow/Post-treatment/Recurrent Blood Derived Cancer - Peripheral Blood/ Blood Derived Cancer - Peripheral Blood	TARGET-AML	TARGET-LAML
Neuroblastoma, NOS	Primary Tumor/Recurrent Tumor/Recurrent Blood Derived Cancer - Bone Marrow	TARGET-NBL	TARGET-NBL
Ganglioneuroblastoma	Primary Tumor/Recurrent Tumor	TARGET-NBL	TARGET-GNB
–	Cell Lines	TARGET-AML	TARGET-CELL
–	Bone Marrow Normal	TARGET-AML	TARGET-NORMAL
Acute myeloid leukemia, NOS	Blood Derived Normal/Bone Marrow Normal	TARGET-AML	TARGET-NORMAL
ACC	Primary Tumor	TCGA-ACC	TCGA-ACC
BLCA	Primary Tumor	TCGA-BLCA	TCGA-BLCA
BRCA	Primary Tumor	TCGA-BRCA	TCGA-BRCA
BRCA	Metastatic	TCGA-BRCA	TCGA-BRCA
CESC	Primary Tumor	TCGA-CESC	TCGA-CESC
CESC	Metastatic	TCGA-CESC	TCGA-CESC
CHOL	Primary Tumor	TCGA-CHOL	TCGA-CHOL
COAD	Primary Tumor	TCGA-COAD	TCGA-COAD
COAD	Recurrent Tumor	TCGA-COAD	TCGA-COAD
COAD	Metastatic	TCGA-COAD	TCGA-COAD
DLBC	Primary Tumor	TCGA-DLBC	TCGA-DLBC
ESCA	Primary Tumor	TCGA-ESCA	TCGA-ESCA
ESCA	Metastatic	TCGA-ESCA	TCGA-ESCA
GBM	Primary Tumor	TCGA-GBM	TCGA-GBM
GBM	Recurrent Tumor	TCGA-GBM	TCGA-GBM
HNSC	Primary Tumor	TCGA-HNSC	TCGA-HNSC
HNSC	Metastatic	TCGA-HNSC	TCGA-HNSC
KICH	Primary Tumor	TCGA-KICH	TCGA-KICH
KIRC	Primary Tumor	TCGA-KIRC	TCGA-KIRC
KIRC	Additional - New Primary	TCGA-KIRC	TCGA-KIRC
KIRP	Primary Tumor	TCGA-KIRP	TCGA-KIRP
KIRP	Additional - New Primary	TCGA-KIRP	TCGA-KIRP
LAML	Primary Blood Derived Cancer - Peripheral Blood	TCGA-LAML	TCGA-LAML
LGG	Primary Tumor	TCGA-LGG	TCGA-LGG
LGG	Recurrent Tumor	TCGA-LGG	TCGA-LGG
LIHC	Primary Tumor	TCGA-LIHC	TCGA-LIHC
LIHC	Recurrent Tumor	TCGA-LIHC	TCGA-LIHC
LUAD	Primary Tumor	TCGA-LUAD	TCGA-LUAD
LUAD	Recurrent Tumor	TCGA-LUAD	TCGA-LUAD
LUSC	Primary Tumor	TCGA-LUSC	TCGA-LUSC
MESO	Primary Tumor	TCGA-MESO	TCGA-MESO
OV	Primary Tumor	TCGA-OV	TCGA-OV
OV	Recurrent Tumor	TCGA-OV	TCGA-OV
PAAD	Primary Tumor	TCGA-PAAD	TCGA-PAAD
PAAD	Metastatic	TCGA-PAAD	TCGA-PAAD
PCPG	Primary Tumor	TCGA-PCPG	TCGA-PCPG
PCPG	Additional - New Primary	TCGA-PCPG	TCGA-PCPG
PCPG	Metastatic	TCGA-PCPG	TCGA-PCPG
PRAD	Primary Tumor	TCGA-PRAD	TCGA-PRAD
PRAD	Metastatic	TCGA-PRAD	TCGA-PRAD
READ	Primary Tumor	TCGA-READ	TCGA-READ
READ	Recurrent Tumor	TCGA-READ	TCGA-READ
SARC	Primary Tumor	TCGA-SARC	TCGA-SARC
SARC	Recurrent Tumor	TCGA-SARC	TCGA-SARC
SARC	Metastatic	TCGA-SARC	TCGA-SARC
SKCM	Metastatic	TCGA-SKCM	TCGA-SKCM
SKCM	Primary Tumor	TCGA-SKCM	TCGA-SKCM
SKCM	Additional Metastatic	TCGA-SKCM	TCGA-SKCM
STAD	Primary Tumor	TCGA-STAD	TCGA-STAD
TGCT	Primary Tumor	TCGA-TGCT	TCGA-TGCT
TGCT	Additional - New Primary	TCGA-TGCT	TCGA-TGCT
THCA	Primary Tumor	TCGA-THCA	TCGA-THCA
THCA	Metastatic	TCGA-THCA	TCGA-THCA
THYM	Primary Tumor	TCGA-THYM	TCGA-THYM
UCEC	Primary Tumor	TCGA-UCEC	TCGA-UCEC
UCEC	Recurrent Tumor	TCGA-UCEC	TCGA-UCEC
UCS	Primary Tumor	TCGA-UCS	TCGA-UCS
UVM	Primary Tumor	TCGA-UVM	TCGA-UVM
BLCA	Solid Tissue Normal	TCGA-BLCA	TCGA-NORMAL
BRCA	Solid Tissue Normal	TCGA-BRCA	TCGA-NORMAL
CESC	Solid Tissue Normal	TCGA-CESC	TCGA-NORMAL
CHOL	Solid Tissue Normal	TCGA-CHOL	TCGA-NORMAL
COAD	Solid Tissue Normal	TCGA-COAD	TCGA-NORMAL
ESCA	Solid Tissue Normal	TCGA-ESCA	TCGA-NORMAL
HNSC	Solid Tissue Normal	TCGA-HNSC	TCGA-NORMAL
KICH	Solid Tissue Normal	TCGA-KICH	TCGA-NORMAL
KIRC	Solid Tissue Normal	TCGA-KIRC	TCGA-NORMAL
KIRP	Solid Tissue Normal	TCGA-KIRP	TCGA-NORMAL
LIHC	Solid Tissue Normal	TCGA-LIHC	TCGA-NORMAL
LUAD	Solid Tissue Normal	TCGA-LUAD	TCGA-NORMAL
LUSC	Solid Tissue Normal	TCGA-LUSC	TCGA-NORMAL
PAAD	Solid Tissue Normal	TCGA-PAAD	TCGA-NORMAL
PCPG	Solid Tissue Normal	TCGA-PCPG	TCGA-NORMAL
PRAD	Solid Tissue Normal	TCGA-PRAD	TCGA-NORMAL
READ	Solid Tissue Normal	TCGA-READ	TCGA-NORMAL
SARC	Solid Tissue Normal	TCGA-SARC	TCGA-NORMAL
SKCM	Solid Tissue Normal	TCGA-SKCM	TCGA-NORMAL
STAD	Solid Tissue Normal	TCGA-STAD	TCGA-NORMAL
THCA	Solid Tissue Normal	TCGA-THCA	TCGA-NORMAL
THYM	Solid Tissue Normal	TCGA-THYM	TCGA-NORMAL
UCEC	Solid Tissue Normal	TCGA-UCEC	TCGA-NORMAL

Clinical data preprocessing

TCGA

library(data.table)
library(dplyr)

# load("/home/data2/Projects/Fusion/unclean/TCGA_ALL_clinical.rdata")
#
# TCGA_ALL_clinical |>
#   rename(Patient_ID = bcr_patient_barcode,
#          Age = age_at_initial_pathologic_diagnosis,
#          Sex = gender,
#          Race = race,
#          )

# [Discrepancy] [Not Applicable]  [Not Available]
# [Discrepancy] [Not Available]       [Unknown]
# # [Discrepancy]                                        [Not Applicable]
# 29                                                          105
# [Not Available]                                         [Not Evaluated]
# 5140                                                       4
# [Unknown]


library(UCSCXenaShiny)

table(tcga_clinical_fine$Stage_ajcc)
tcga_clinical

tcga_clinical_fine
tcga_genome_instability
tcga_purity
tcga_subtypes
tcga_surv

nrow(tcga_clinical_fine)
nrow(unique(tcga_clinical_fine))

tcga_clinical = unique(tcga_clinical_fine) |>
  left_join(tcga_surv, by = c("Sample"="sample")) |>
  left_join(tcga_purity, by = c("Sample"="sample"))

tcga_clinical[551, ]$Sample

sum(startsWith(tcga_genome_instability$sample, "TCGA"))

tcga_clinical = tcga_clinical |>
  left_join(tcga_genome_instability[startsWith(tcga_genome_instability$sample, "TCGA") & !duplicated(tcga_genome_instability$sample), ],  by = c("Sample"="sample")) |>
  #left_join(tcga_genome_instability[duplicated(tcga_genome_instability$sample), ],  by = c("Sample"="sample")) |>
  left_join(tcga_subtypes,  by = c("Sample"="sampleID"))


sum(is.na(tcga_clinical$Sample))
sum(duplicated(tcga_clinical$Sample))

tcga_clinical[duplicated(tcga_clinical$Sample), ]$Sample

tcga_clinical.final = tcga_clinical[!duplicated(tcga_clinical$Sample), ] |>
  rename(Sample_ID = Sample, Sex = Gender, OS_Time = OS.time, OS_Status = OS, DSS_Time = DSS.time, DSS_Status = DSS, DFS_Time = DFI.time, DFS_Status = DFI, PFS_Time = PFI.time, PFS_Status = PFI) |>
  select(-cancer_type) |>
  mutate(Sex = ifelse(Sex == "FEMALE", "F", "M"),
         Patient_ID = substr(Sample_ID, 1, 12))

sum(is.na(tcga_clinical.final$Sample_ID))
sum(is.na(tcga_clinical.final$Cancer))
sum(duplicated(tcga_clinical.final$Sample_ID))

tcga_clinical.final = tcga_clinical.final |>
  select(Patient_ID, Sample_ID, everything()) |>
  as.data.table()

tcga_clinical.final

fwrite(tcga_clinical.final, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")

CPTAC

library(data.table)


# Clinical data from CPTAC paper ------------------------------------------


cli_list = fs::dir_ls("CPTAC_Clinical_meta_data_v1/")

cli = purrr::map(cli_list, fread) |> rbindlist(fill = TRUE, use.names = TRUE)
colnames(cli)
# Problem: Need to skip the second row
# This also includes samples from CPTAC-2/3, but the total number is fewer

cli2 = purrr::map(cli_list, function(x) {
  # Since the column names are also different, read them twice—once for the column names and once for the data
  data1 = fread(x)
  data2 = fread(x, skip = 2, header = FALSE)
  # Update the column names
  colnames(data2) = colnames(data1)
  data2$Cancer = stringr::str_remove(basename(x), "_meta.txt")
  data2 |> as.data.table()
}) |> rbindlist(fill = TRUE, use.names = TRUE)
#colnames(cli2) = colnames(cli)
colnames(cli2)

cli2[Tumor == "Yes" & Normal == "Yes"] # This is to mark whether the patient has both tumor and normal samples


cli_list2 = fs::dir_ls("meta_data_BCM/")

cli3 = purrr::map(cli_list2, function(x) {
  # Since the column names are also different, read them twice—once for the column names and once for the data
  data1 = fread(x)
  data2 = fread(x, skip = 2, header = FALSE)
  # 更新列名
  colnames(data2) = colnames(data1)
  data2$Cancer = stringr::str_remove(basename(x), "_meta.txt")
  data2 |> as.data.table()
}) |> rbindlist(fill = TRUE, use.names = TRUE)
colnames(cli3)

all.equal(cli2, cli3)
all.equal(sort(colnames(cli2)), sort(colnames(cli3)))

cli4 = cli3[, colnames(cli2), with = FALSE]
all.equal(cli2, cli4)
identical(cli2, cli4)

dplyr::setdiff(cli2, cli4) |> View()
dplyr::setdiff(cli4, cli2) |> nrow()

# rbind(cli2[idx == "01OV007"],
#       cli4[idx == "01OV007"]) |> View()

# The files in the directories of the two data sources are basically identical

cli.final = cli2 |>
  dplyr::rename(
    Patient_ID = idx,
    OS_Status = OS_event,
    OS_Time = OS_days,
    PFS_Status = PFS_event,
    PFS_Time = PFS_days
  ) |>
  dplyr::mutate(
    Sample_ID = Patient_ID,
    Sex = ifelse(Sex == "Female", "F", "M")
  ) |>
  dplyr::select(
    Patient_ID, Sample_ID, Cancer, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status, dplyr::everything()
  )
  # dplyr::select(
  #   - dplyr::starts_with("CIBERSORT"), - dplyr::starts_with("xCell")
  # )


fwrite(cli.final, file = "CPTAC_cli_paper_combined.tsv", sep = "\t")

# Clinical data from GDC --------------------------------------------------


system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/CPTAC
mkdir CPTAC-3 CPTAC-2
tar zxvf clinical.project-cptac-2.2024-08-07.tar.gz -C CPTAC-2
tar zxvf clinical.project-cptac-3.2024-08-07.tar.gz -C CPTAC-3
       ")

cptac2 = fread("gdc_cli/CPTAC/CPTAC-2/clinical.tsv")
cptac3 = fread("gdc_cli/CPTAC/CPTAC-3/clinical.tsv")

which(colnames(cptac2) %in% "residual_disease")
which(colnames(cptac3) %in% "residual_disease")
all.equal(cptac2[[119]], cptac2[[192]])
all.equal(cptac3[[119]], cptac3[[192]])
cptac2[[192]] = NULL
cptac3[[192]] = NULL

setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# > setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
# [1] "01OV049"   "26OV010"   "100004028" "05BR058"   "05BR055"   "17OV019"   "11BR069"   "02OV042"   "100004012" "05BR051"   "03BR012"   "02OV045"   "04OV041"
# [14] "05BR052"   "05BR031"   "1488"      "11OV009"   "02OV040"   "01OV002"   "100002921" "22OV001"   "13OV004"   "100003304" "02OV035"   "01OV046"   "01OV045"
# [27] "17OV034"
# > setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# [1] "C3N-02672"                    "C3L-01929"                    "C3N-02682"                    "C3N-02012"                    "C3N-02070"
# [6] "C3N-02696"                    "C3L-02354"                    "C3N-03026"                    "C3L-01739"                    "C3N-02088"
# [11] "C3L-04213"                    "C3N-03205"                    "C3L-04081"                    "C3N-03791"                    "C3L-03679"
# [16] "C3L-02643"                    "C3N-04686"                    "C3L-01558"                    "C3L-03268"                    "C3L-01633"
# [21] "C3L-02220"                    "C3N-02439"                    "C3N-02028"                    "C3N-03800"                    "C3L-04392"
# [26] "C3L-02556"                    "C3N-02721"                    "C3L-01672"                    "C3L-00938"                    "C3L-02201"
# [31] "C3N-02146"                    "C3N-02639"                    "C3N-03788"                    "C3L-03733"                    "C3L-04037"
# [36] "C3N-01091"                    "C3L-02747"                    "C3N-01879"                    "C3N-02996"                    "C3N-02296"
# [41] "C3L-02553"                    "C3L-03632"                    "C3N-03755"                    "GTEX-NPJ7-0011-R10A-SM-HAKXW" "C3L-02746"
length(setdiff(cptac3$case_submitter_id, cli.final$Sample_ID))  # 586


View(cptac2[case_submitter_id %in% setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)])
View(cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)])

data.table::fwrite(
  cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)],
  file = "CPTAC3_non_included.tsv", sep = "\t"
)

cptac3


# Supplement the missing sample information in the paper based on the existing information in the GDC. ---------------------------------------------

# might be columns that have labeled cancer types.
# primary_diagnosis site_of_resection_or_biopsy tissue_or_organ_of_origin

cptac = rbind(cptac2, cptac3)[, .(case_submitter_id, primary_diagnosis, site_of_resection_or_biopsy, tissue_or_organ_of_origin)] |> unique()
cptac = dplyr::left_join(cptac, cli.final[, .(Sample_ID, Cancer)], by = c("case_submitter_id"="Sample_ID"))

setdiff(cli.final$Sample_ID, cptac$case_submitter_id)
# 29 cases are not in the GDC data.

table(cptac$Cancer, cptac$primary_diagnosis) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$site_of_resection_or_biopsy) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$tissue_or_organ_of_origin) |> pheatmap::pheatmap()

table(cptac$Cancer,
      paste(
        cptac$primary_diagnosis,
        cptac$site_of_resection_or_biopsy,
        cptac$tissue_or_organ_of_origin, sep = "_"
      )) |> pheatmap::pheatmap(show_colnames = FALSE)

# Solve them one by one.
cat(sub("_meta.txt", "", basename(names(cli_list))))

typelist = sub("_meta.txt", "", basename(names(cli_list)))
# BRCA CCRCC COAD GBM HGSC HNSCC LSCC LUAD PDAC UCEC
for (i in typelist) {
  message("check type ", i)
  cptac_part = cptac |> dplyr::filter(Cancer == i)
  message("  primary_diagnosis")
  table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
  message("  site_of_resection_or_biopsy")
  table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
  message("  tissue_or_organ_of_origin")
  table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}
# BRCA (122): tissue_or_organ_of_origin "Breast, NOS"
# CCRCC (103): primary_diagnosis "Renal cell carcinoma, NOS"
#        tissue_or_organ_of_origin "Kidney, NOS"
# COAD (106): tissue_or_organ_of_origin "Colon, NOS", "Rectum, NOS"
# GBM (99): primary_diagnosis "Glioblastoma"
# HGSC (87): primary_diagnosis "Serous adenocarcinoma, NOS" AND
#  tissue_or_organ_of_origin Fallopian tube Ovary Peritoneum, NOS
# HNSCC (108): Other situation: primary_diagnosis "Squamous cell carcinoma, NOS"
# Base of tongue, NOS Cheek mucosa Floor of mouth, NOS Gum, NOS
# HNSCC                   1            2                  18        3
#
# Head, face or neck, NOS Larynx, NOS Lip, NOS Oropharynx, NOS
# HNSCC                       2          47        4               4
#
# Overlapping lesion of lip, oral cavity and pharynx Tongue, NOS Tonsil, NOS
# HNSCC
# LSCC (108): primary_diagnosis "Squamous cell carcinoma, NOS" AND
#   tissue_or_organ_of_origin "lung" pattern
# LUAD (110): primary_diagnosis "Adenocarcinoma, NOS" AND
#   tissue_or_organ_of_origin "lung" pattern
# PDAC (105): tissue_or_organ_of_origin "pancreas" pattern
# UCEC (95): primary_diagnosis "Endometrioid adenocarcinoma, NOS"
sum(!is.na(cptac$Cancer))

# rbind(cptac2, cptac3)[,192] # There are duplicate columns, need to go back and handle this earlier (fixed).

# Actual testing and updating
cptac_updated = rbind(cptac2, cptac3) |>
  unique() |>
  dplyr::mutate(
    Cancer = dplyr::case_when(
      tissue_or_organ_of_origin %in% "Breast, NOS" ~ "BRCA",
      primary_diagnosis %in% "Renal cell carcinoma, NOS" ~ "CCRCC",
      tissue_or_organ_of_origin %in% c("Colon, NOS", "Rectum, NOS") ~ "COAD",
      grepl("Glioblastoma", primary_diagnosis, ignore.case = TRUE) | grepl("Oligodendroglioma", primary_diagnosis, ignore.case = TRUE) ~ "GBM",
      grepl("GTEX", case_submitter_id, ignore.case = TRUE) ~ "GTEX-Brain",
      case_submitter_id %in% c("C3L-06912", "C3L-07212") ~ "GBM",
      #primary_diagnosis %in% "Glioblastoma" ~ "GBM",
      primary_diagnosis %in% "Serous adenocarcinoma, NOS" & tissue_or_organ_of_origin %in% c("Fallopian tube", "Ovary", "Peritoneum, NOS") ~ "HGSC",
      primary_diagnosis %in% "Squamous cell carcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LSCC",
      primary_diagnosis %in% "Adenocarcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LUAD",
      grepl("pancreas", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "PDAC",
      primary_diagnosis %in% "Endometrioid adenocarcinoma, NOS" ~ "UCEC",
      TRUE ~ "HNSCC"
    )
  ) |> as.data.table()

for (i in "HNSCC") { # typelist
  message("check type ", i)
  cptac_part = cptac_updated |> dplyr::filter(Cancer == i)
  message("  primary_diagnosis")
  table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
  message("  site_of_resection_or_biopsy")
  table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
  message("  tissue_or_organ_of_origin")
  table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}

table(cptac$Cancer)
table(cptac_updated$Cancer)

cptac_updated |>
  dplyr::filter(
    primary_diagnosis == "'--" | site_of_resection_or_biopsy == "'--" |
    tissue_or_organ_of_origin %in% c("'--", "Unknown")
  )
# 1-9 from GTEX-Brain
# C3L-06912 C3L-07212 Gliomas # https://portal.gdc.cancer.gov/cases/e6027623-a5b2-4d82-a9ca-f37c948dd3ed
cptac_updated$case_submitter_id[grepl("GTEX", cptac_updated$case_submitter_id)]

nrow(cptac_updated)

fwrite(cptac_updated, file = "CPTAC_combined_with_cancer_annotated.tsv", sep = "\t")

cptac_unmatched = cptac_updated[! case_submitter_id %in% cli.final$Sample_ID]
colnames(cptac_unmatched)
colnames(cli.final)[1:20]

cptac_unmatched2 = cptac_unmatched |>
  dplyr::select(
    case_submitter_id, age_at_diagnosis, gender,
    days_to_death, vital_status,
    dplyr::starts_with("ajcc"),
    dplyr::starts_with("days"),
    last_known_disease_status, progression_or_recurrence, residual_disease,
    tumor_grade, Cancer
  )

sum(cptac_unmatched2$days_to_last_follow_up != "'--")
sum(cptac_unmatched2$days_to_death != "'--")
sum(cptac_unmatched2$days_to_recurrence != "'--")
sum(cptac_unmatched2$days_to_last_known_disease_status != "'--")

sum(cptac_unmatched2$progression_or_recurrence != "'--")


sum(!is.na(cli.final$OS_Time))
sum(!is.na(cli.final$PFS_Time))
sum(cli.final$OS_Time >= cli.final$PFS_Time, na.rm = TRUE)
cli.final[cli.final$OS_Time < cli.final$PFS_Time] |> View()

sum(as.integer(cptac_unmatched2$days_to_last_follow_up) >= as.integer(cptac_unmatched2$days_to_last_known_disease_status), na.rm = TRUE)

cptac_unmatched3 = cptac_unmatched2 |>
  dplyr::rename(
    Sample_ID = case_submitter_id,
    Age = age_at_diagnosis,
    Sex = gender,
    OS_Time = days_to_last_follow_up,
    OS_Status = vital_status,
    PFS_Time = days_to_last_known_disease_status,
    PFS_Status = progression_or_recurrence,
    Histologic_Grade = tumor_grade,
    Path_Stage_pT = ajcc_pathologic_t,
    Path_Stage_pN = ajcc_pathologic_n,
    Stage = ajcc_pathologic_stage
  ) |>
  dplyr::select(
    Sample_ID, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status,
    Path_Stage_pN:Path_Stage_pT,
    Histologic_Grade, Cancer
  ) |>
  dplyr::mutate(
    Patient_ID = Sample_ID,
    Age = as.integer(round(as.integer(Age) / 365)),
    Sex = fcase(Sex %in% "female", "F",
                Sex %in% "male", "M", default = NA_character_),
    OS_Time = as.integer(OS_Time),
    OS_Status = fcase(OS_Status %in% "Alive", 0L,
                      OS_Status %in% "Dead", 1L, default = NA_integer_),
    PFS_Time = as.integer(PFS_Time),
    PFS_Status = fcase(PFS_Status %in% "no", 0L,
                       PFS_Status %in% "yes", 1L, default = NA_integer_),
    OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time),
    Path_Stage_pN = fcase(
      Path_Stage_pN %in% "N0", "pN0",
      Path_Stage_pN %in% c("N1", "N1a"), "pN1",
      Path_Stage_pN %in% c("N2", "N2a"), "pN2",
      default = NA_character_
    ),
    Path_Stage_pT = fcase(
      Path_Stage_pT %in% "T0", "pT0",
      Path_Stage_pT %in% c("T1", "T1a", "T1b", "T1c"), "pT1",
      Path_Stage_pT %in% c("T2", "T2a", "T2b"), "pT2",
      Path_Stage_pT %in% c("T3", "T3a", "T3b"), "pT3",
      Path_Stage_pT %in% "T4", "pT4",
      default = NA_character_
    ),
    Stage = fcase(
      Stage %in% c("Stage I", "Stage IA", "Stage IA1", "Stage IA2", "Stage IA3", "Stage IB"), "Stage I",
      Stage %in% c("Stage II", "Stage IIA", "Stage IIB"), "Stage II",
      Stage %in% c("Stage III", "Stage IIIA", "Stage IIIB", "Stage IIIC", "Stage IIIC1", "Stage IIIC2"), "Stage III",
      Stage %in% c("Stage IV", "Stage IVB"), "Stage IV",
      default = NA_character_
    ),
    Histologic_Grade = fcase(
      Histologic_Grade %in% "G1", "G1 Well differentiated",
      Histologic_Grade %in% "G2", "G2 Moderately differentiated",
      Histologic_Grade %in% "G3", "G3 Poorly differentiated",
      Histologic_Grade %in% "G4", "G4 Undifferentiated",
      default = NA_character_
    )
  )

cptac_unmatched3

# X: cannot be assessed

table(cptac_unmatched3$Path_Stage_pN)
table(cli.final$Path_Stage_pN)

table(cptac_unmatched3$Path_Stage_pT)
table(cli.final$Path_Stage_pT)

table(cptac_unmatched3$Stage)
table(cli.final$Stage)

table(cptac_unmatched3$Histologic_Grade)
table(cli.final$Histologic_Grade)



# Final output ------------------------------------------------------------

cli.final.2 = dplyr::bind_rows(
  cli.final, cptac_unmatched3
) |> as.data.table()

table(cli.final.2$Cancer)
sum(table(cli.final.2$Cancer))

# Include all Race information?
cli.final.2 = dplyr::left_join(
  cli.final.2, unique(rbind(cptac2, cptac3)[, .(case_submitter_id, race)]),
  by = c("Sample_ID"="case_submitter_id")
) |>
  dplyr::mutate(
    race = ifelse(race %in% c("Unknown", "not reported"), NA_character_,
                  race)
  ) |>
  as.data.table()

sum(is.na(cli.final.2$race))

fwrite(cli.final.2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")

TARGET

library(data.table)

cli1 = fread("TARGET_donor_allprojects_transfer_to_sample.gz")
cli2 = fread("TARGET_phenotype.gz")

target_cli = cli2 |>
  dplyr::select(-`_cohort`) |>
  dplyr::rename(Patient_ID = `_PATIENT`,
                Sample_ID = sample_id,
                disease_code = primary_disease_code,
                disease = `_primary_disease`,
                sample_type = `_sample_type`) |>
  dplyr::full_join(
    cli1 |> dplyr::select(
      -`_OS_UNIT`, -`_EVENT`, -`_TIME_TO_EVENT`
    ) |>
      dplyr::rename(
        Sample_ID = xena_sample,
        Sex = `_gender`,
        Age = `_age_at_diagnosis`,
        OS_Status = `_OS_IND`,
        OS_Time = `_OS`
      ), by = "Sample_ID"
  ) |>
  dplyr::mutate(
    Sex = ifelse(Sex == "Male", "M", "F"),
    OS_Status = ifelse(OS_Status == "Alive", 0L, 1L)
  ) |>
  dplyr::select(Patient_ID, Sample_ID, Age, Sex, OS_Time, OS_Status, dplyr::everything()) |>
  as.data.table()

cli1

target_cli

fwrite(target_cli, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t")


# Using cli from GDC ------------------------------------------------------

system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/TARGET
mkdir ALL_P1 ALL_P2 ALL_P3 AML CCSK NBL OS RT WT
tar zxvf clinical.project-target-all-p1.2024-08-07.tar.gz -C ALL_P1
tar zxvf clinical.project-target-all-p2.2024-08-07.tar.gz -C ALL_P2
tar zxvf clinical.project-target-all-p3.2024-08-07.tar.gz -C ALL_P3
tar zxvf clinical.project-target-aml.2024-08-07.tar.gz -C AML
tar zxvf clinical.project-target-ccsk.2024-08-07.tar.gz -C CCSK
tar zxvf clinical.project-target-nbl.2024-08-07.tar.gz -C NBL
tar zxvf clinical.project-target-os.2024-08-07.tar.gz -C OS
tar zxvf clinical.project-target-rt.2024-08-07.tar.gz -C RT
tar zxvf clinical.project-target-wt.2024-08-07.tar.gz -C WT
       ")

gdc_target = rbindlist(
  list(
    fread("gdc_cli/TARGET/ALL_P1/clinical.tsv"),
    fread("gdc_cli/TARGET/ALL_P2/clinical.tsv"),
    fread("gdc_cli/TARGET/ALL_P3/clinical.tsv"),
    fread("gdc_cli/TARGET/AML/clinical.tsv"),
    fread("gdc_cli/TARGET/CCSK/clinical.tsv"),
    fread("gdc_cli/TARGET/NBL/clinical.tsv"),
    fread("gdc_cli/TARGET/OS/clinical.tsv"),
    fread("gdc_cli/TARGET/RT/clinical.tsv"),
    fread("gdc_cli/TARGET/WT/clinical.tsv")
  ),
  use.names = TRUE, fill = TRUE
)

table(gdc_target$project_id)
any(duplicated(gdc_target))
gdc_target = gdc_target[!duplicated(gdc_target)]

all.equal(gdc_target[[119]], gdc_target[[192]])
gdc_target[[192]] = NULL

sum(gdc_target$days_to_last_follow_up != "'--")
sum(gdc_target$days_to_death != "'--")

gdc_target2 = gdc_target |>
  dplyr::rename(
    Sample_ID = case_submitter_id,
    Age = age_at_diagnosis,
    Sex = gender,
    OS_Time = days_to_death,
    OS_Status = vital_status
  ) |>
  dplyr::select(
    Sample_ID, Age, Sex, OS_Time, OS_Status,
    project_id, race, primary_diagnosis
  ) |>
  dplyr::mutate(
    Patient_ID = Sample_ID,
    Age = as.integer(round(as.integer(Age) / 365)),
    Sex = fcase(Sex %in% "female", "F",
                Sex %in% "male", "M", default = NA_character_),
    OS_Time = as.integer(OS_Time),
    OS_Status = fcase(OS_Status %in% "Alive", 0L,
                      OS_Status %in% "Dead", 1L, default = NA_integer_),
    OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time)
  ) |>
  as.data.table()

#unique()

# exist  a large number of case level  replicate
gdc_target2[duplicated(gdc_target2), ]
duplicated(gdc_target[case_submitter_id == "TARGET-20-PANTIV"])
gdc_target[case_submitter_id == "TARGET-20-PANTIV"] |> View() # A few records that have been updated？

gdc_target2 = gdc_target2[!duplicated(gdc_target2)]

#table(gdc_target2$PFS_Time)
any(duplicated(gdc_target2))
any(duplicated(gdc_target2$Sample_ID))

gdc_target2[duplicated(gdc_target2$Sample_ID)]
# They may exist in different ALL projects.
gdc_dup = gdc_target2[Sample_ID %in% Sample_ID[duplicated(Sample_ID)]]
table(gdc_dup$project_id)
# TARGET-ALL-P1 TARGET-ALL-P2 TARGET-ALL-P3
# 10            57            47
#
# For now, we won't address this issue. If we conduct a combined analysis of ALL, we will consider removing duplicates.

colnames(gdc_target2)
gdc_target2 = gdc_target2[, c("Patient_ID", "Sample_ID", "Age", "Sex", "race", "primary_diagnosis", "OS_Time", "OS_Status", "project_id")]

fwrite(gdc_target2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t")

sum(!is.na(gdc_target2$OS_Time))
sum(!is.na(cli1$`_OS`))

nrow(na.omit(unique(target_cli[, .(Patient_ID, OS_Time, OS_Status)])))
nrow(na.omit(unique(gdc_target2[, .(Patient_ID, OS_Time, OS_Status)])))

TCGA/TARGET/CPTAC Clinical/IOBR/TME/GENE data divide and anotate preprocessing

TARGET

library(data.table)
library(dplyr)
library(stringr)
library(tidyr)

file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.tsv"
metafusion <- fread(file_meta, sep = "\t")

# TARGET
## info
### Because some projects are not included in the target_sheet, we should abandon them.
TARGET_sheet <- fread("/home/data2/Projects/Fusion/unclean/TARGET_gdc_sample_sheet.2023-11-15.tsv", sep = "\t")
project_choose <- substr(TARGET_sheet$`Sample ID`, 1, 9) |> unique()
project_choose
### "TARGET-20" "TARGET-00" "TARGET-21" "TARGET-30"

### TARGET_sheet pretreatment

### Composite Sample 
###
### Composite Sample use comma to separate like TARGET-30-PAPEAV, TARGET-30-PAPTFZ
TARGET_Composite <- TARGET_sheet |> 
  filter(grepl(", ", `Sample ID`)) |>  
  select(`Sample ID`) |>
  unique()
TARGET_Composite
### 1: TARGET-30-PAPEAV-01A, TARGET-30-PAPTFZ-01A
### 2: TARGET-30-PANUKV-01A, TARGET-30-PASUML-01A
### 3: TARGET-30-PANKFE-01A, TARGET-30-PAPUAR-01A
### 4: TARGET-30-PASYPX-01A, TARGET-30-PAIXIF-01A

### For the CPTAC databases, check whether there is any overlap between composite and normal Sample_ID in the info data. 
check_TARGET_overlap <- TARGET_sheet |>  
  select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
  unique() |> 
  separate_rows(`Sample ID`, sep = ",\\s*") 

check_TARGET_overlap <- check_TARGET_overlap |>
  mutate(Sample_Type = ifelse(grepl(", ", `Sample ID`), "Composite", "Normal"))

overlap_samples <- check_TARGET_overlap |>
  group_by(`Sample ID`) |>
  summarise(Count = n(), .groups = 'drop') |>
  filter(Count > 1)

overlap_samples
###  A tibble: 4 × 2
###  `Sample ID`          Count
###  <chr>                <int>
###  1 TARGET-30-PANKFE-01A     2
###  2 TARGET-30-PAPTFZ-01A     2
###  3 TARGET-30-PASUML-01A     2
###  4 TARGET-30-PASYPX-01A     2

### So we choose abandon them.

TARGET_pretreatment <- TARGET_sheet |> 
  select(c(`Case ID`, `Sample ID`, `Sample Type`)) |>
  rename("Patient_ID" = `Case ID`,
         "Sample_ID" = `Sample ID`,
         "Sample_type"  =  `Sample Type`) |> 
  filter(!grepl(",", Sample_ID)) |> 
  filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |> 
  unique()

### TARGET_xena_info pretreatment
### This data use
TARGET_xena_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t") 
TARGET_xena_pretreatment   <- TARGET_xena_info |>
                              mutate(Patient_ID = str_extract(Sample_ID, "^[^-]+-[^-]+-[^-]+")) |>
                              filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |>
                              mutate(Age = round(Age, digits = 0),
                                     across(c(Age, OS_Time, OS_Status), ~ as.integer(.))) |>
                              rename(Cancer = disease_code)

### TARGET_info pretreatment
### In the TARGET database, rows with the info column marked as '--' are annotated as NA.
filter_lose <- function(x) {
  is.na(x) | x == "" | x == "--" | x == "Unknown" | x == "'--"
 }
TARGET_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t") 
replacement_table2 <- c(
  "Acute lymphocytic leukemia" =  "TARGET-ALL",             
  "'--"   = "TARGET-Unkown",                                  
  "Precursor B-cell lymphoblastic leukemia" = "TARGET-B-ALL", 
  "Not Reported"  = "TARGET-Unkown",                         
  "Acute myeloid leukemia, NOS" = "TARGET-LAML",             
  "Clear cell sarcoma of kidney" = "TARGET-CCSK",           
  "Neuroblastoma, NOS"  = "TARGET-NBL",                     
  "Ganglioneuroblastoma" = "TARGET-GNB",                   
  "Osteosarcoma, NOS"   = "TARGET-OS",                    
  "Malignant rhabdoid tumor" = "TARGET-MRT",               
  "Wilms tumor" = "TARGET-WT"  
)
matched2 <- TARGET_info$primary_diagnosis %in% names(replacement_table2)
TARGET_info_pretreatment <- TARGET_info |>
                            mutate(cohort = case_when(
                                   matched2 ~ replacement_table2[primary_diagnosis],
                                   TRUE ~ as.character(NA)  
                                   )) |>
                            rename(Cancer = primary_diagnosis,
                                   Project_id = project_id) |> 
                            filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |> 
                            mutate(across(c("Age", "Sex", "OS_Time", "OS_Status", "race"), ~ ifelse(filter_lose(.x), NA, .x))) |> 
                            mutate(Cancer = ifelse(Cancer == "'--", NA, Cancer))

### Check if the valid rows in TARGET_info match the valid rows in TARGET_xena_info.

### 1.Filter the valid rows.
filter_valid <- function(x) {
  !is.na(x) & x != "" & x != "--" & x != "Unknown" & x == "'--"
}

### Filter the valid rows in TARGET_info.
TARGET_info_valid <- TARGET_info_pretreatment %>%
                     filter(filter_valid(Age) & filter_valid(OS_Time) & filter_valid(OS_Status) & filter_valid(Sex)) |>
                     select(c(Patient_ID, Age, OS_Time, OS_Status, Sex))

### Filter the valid rows in TARGET_xena_info.
TARGET_xena_info_valid <- TARGET_xena_pretreatment %>%
                          filter(filter_valid(Age) & filter_valid(OS_Time) & filter_valid(OS_Status) & filter_valid(Sex)) |>
                          select(c(Patient_ID, Age, OS_Time, OS_Status, Sex))

### 2. Merge the two data frames based on Patient_ID.
merged_data <- TARGET_info_valid %>%
               inner_join(TARGET_xena_info_valid, by = "Patient_ID", suffix = c("_info", "_xena"))

### 3. Check whether the valid values in the merged data frame are consistent.
is_consistent <- merged_data %>%
  rowwise() %>%
  mutate(
    Age_consistent = Age_info == Age_xena,
    Sex_consistent = Sex_info == Sex_xena,
    OS_Time_consistent = OS_Time_info == OS_Time_xena,
    OS_Status_consistent = OS_Status_info == OS_Status_xena
  ) %>%
  ungroup()

### 4. Age not onsistent
Age_not_consistent <- is_consistent |> 
                      filter(Age_consistent == "FALSE") 

### 5. Sex not onsistent
Sex_not_consistent <- is_consistent |> 
                      filter(Sex_consistent == "FALSE") 
### Sex_not_consistent

###  A tibble: 0 × 13

### 6. OS_Time not onsistent
OS_Time_not_consistent <- is_consistent |> 
                          filter(OS_Time_consistent == "FALSE") |>
                          mutate(
                          OS_Time_info_greater = OS_Time_info > OS_Time_xena,  
                          OS_Time_xena_greater = OS_Time_xena > OS_Time_info,  
                          OS_Time_equal = OS_Time_info == OS_Time_xena         
                          )
nrow(OS_Time_not_consistent |> filter(OS_Time_info_greater == "FALSE"))
### [1]0
nrow(OS_Time_not_consistent |> filter(OS_Time_xena_greater == "TRUE"))
### [1]0
nrow(OS_Time_not_consistent |> filter(OS_Time_equal == "TRUE"))
### [1]0

### 7. OS_Status not onsistent
OS_Status_not_consistent <- is_consistent |> 
                            filter(OS_Status_consistent == "FALSE") 

unique(OS_Status_not_consistent$OS_Status_info)
### [1] 1
unique(OS_Status_not_consistent$OS_Status_xena)
### [1] 0

### All evidence shows that the GDC data has been updated more recently than the Xena data.
### Therefore, if there is a conflict between GDC data and Xena data, we should use the GDC data.

### Use the valid values from TARGET_xena_info to fill the missing values in TARGET_info.
### 1: Determine the matching Patient_ID.
matched_ids <- intersect(TARGET_xena_pretreatment$Patient_ID, TARGET_info_pretreatment$Patient_ID)
setDT(TARGET_xena_pretreatment)
setDT(TARGET_info_pretreatment)

### 2: Find the columns that require updates.
cols_to_update <- setdiff(names(TARGET_info_pretreatment), c("Sample_ID", "Patient_ID", "Cancer","project_id","cohort","race"))

### 3.Iterate through each matching Patient_ID and update TARGET_info_pretreatment using TARGET_xena_pretreatment.
### update Age
TARGET_xena_can  <-  TARGET_xena_pretreatment |> 
                     filter(if_any("Age", ~ !filter_lose(.x))) 

TARGET_info_match <- TARGET_info_pretreatment |>
                            mutate(Age = case_when(
                                                   filter_lose(Age) & Patient_ID %in% TARGET_xena_can$Patient_ID ~  TARGET_xena_can$Age[match(Patient_ID, TARGET_xena_can$Patient_ID)],
                                                   TRUE ~ Age))

### update Sex
TARGET_xena_can  <-  TARGET_xena_pretreatment |> 
                     filter(if_any("Sex", ~ !filter_lose(.x))) 

TARGET_info_match1 <- TARGET_info_match |>
                            mutate(Sex = case_when(
                                                   filter_lose(Sex) & Patient_ID %in% TARGET_xena_can$Patient_ID ~  TARGET_xena_can$Sex[match(Patient_ID, TARGET_xena_can$Patient_ID)],
                                                   TRUE ~ Sex))

### update OS_Time
TARGET_xena_can  <-  TARGET_xena_pretreatment |> 
                     filter(if_any("OS_Time", ~ !filter_lose(.x))) 

TARGET_info_match2 <- TARGET_info_match1 |>
                      mutate(OS_Time = case_when(
                                                   filter_lose(OS_Time) & Patient_ID %in% TARGET_xena_can$Patient_ID  & OS_Status %in% TARGET_xena_can$OS_Status ~  TARGET_xena_can$OS_Time[match(Patient_ID, TARGET_xena_can$Patient_ID)],
                                                   TRUE ~ OS_Time))

### update OS_Status
TARGET_xena_can  <-  TARGET_xena_pretreatment |> 
                     filter(if_any("OS_Status", ~ !filter_lose(.x))) 

TARGET_info_after_match <- TARGET_info_match2 |>
                            mutate(OS_Status = case_when(
                                                   filter_lose(OS_Status) & Patient_ID %in% TARGET_xena_can$Patient_ID  & OS_Time %in% TARGET_xena_can$OS_Time ~  TARGET_xena_can$OS_Status[match(Patient_ID, TARGET_xena_can$Patient_ID)],
                                                   TRUE ~ OS_Status)) 
  
### Use TARGET_sheet to perform a left join, ensuring that the Sample_ID is correct.
TARGET_info_after_match <- TARGET_info_after_match |> select(- Sample_ID)
TARGET_correct      <- TARGET_pretreatment |>
                       left_join(TARGET_info_after_match, by = c("Patient_ID" = "Patient_ID")) |>
                       mutate(Run = Sample_ID)

### Replace the cohort column values with Project_ID where the row is TARGET-Unkown.
### TARGET_correct$project_id |> unique()
### [1] "TARGET-AML" "TARGET-NBL"
TARGET_correct_after_pro <- TARGET_correct  %>%
  mutate(
    cohort = case_when(
             cohort == "TARGET-Unkown" ~ Project_id,
             TRUE ~ cohort),
    cohort = case_when(
             cohort == "TARGET-AML" ~ "TARGET-LAML",
             TRUE ~ cohort)
  ) 

### Divide a part of data into a group based on whether the Sample_type is cell

target_cell <- TARGET_correct_after_pro |> filter(`Sample_type` == "Cell Lines")
target_cell_sample <- target_cell$Sample_ID |> unique()
target_cell_patient <- target_cell$Patient_ID |> unique()
### target_cell_sample
### [1] "TARGET-20-MUTZ3-50A"         "TARGET-20-ML1-50A"           "TARGET-20-MV411D1-50A"       "TARGET-20-KasumiD1-50A"    "TARGET-20-HEK293-50A"      "TARGET-20-MOLM14CBFGLIS-50A"     
### [7] "TARGET-20-TF1-50A"           "TARGET-20-IGROV1-50A"        "TARGET-20-CMS-50A"           "TARGET-20-WSUAML-50A"      "TARGET-20-KasumiAZAD5-50A" "TARGET-20-KasumiAZAD11-50A"
### [13]"TARGET-20-REH-50A"           "TARGET-20-MV411AZAD11-50A"   "TARGET-20-MV411AZAD5-50A"    "TARGET-20-THP1-50A"        "TARGET-20-OCIAML2-50A"     "TARGET-20-HL60-50A"  
### target_cell_patient
### [1] "TARGET-20-MUTZ3"         "TARGET-20-ML1"           "TARGET-20-MV411D1"       "TARGET-20-KasumiD1"      "TARGET-20-HEK293"        "TARGET-20-MOLM14CBFGLIS"
### [7] "TARGET-20-TF1"           "TARGET-20-IGROV1"        "TARGET-20-CMS"           "TARGET-20-WSUAML"        "TARGET-20-KasumiAZAD5"   "TARGET-20-KasumiAZAD11" 
### [13]"TARGET-20-REH"           "TARGET-20-MV411AZAD11"   "TARGET-20-MV411AZAD5"    "TARGET-20-THP1"          "TARGET-20-OCIAML2"       "TARGET-20-HL60"  
### reference : https://www.cellosaurus.org/
### MUTZ3：Acute myeloid leukemia
### ML1：    Adult acute myeloid leukemia
### MV411D1：biphenotypic B-myelomonocytic leukemia
### KasumiD1：Acute myeloblastic leukemia
### TF1：Acute erythroid leukemia 
### IGROV1： Ovarian endometrioid adenocarcinoma
### WSUAML：Acute myeloid leukemia。
### KasumiAZAD5：can't find, so KasumiD1：Acute myeloblastic leukemia
### KasumiAZAD11：can't find, so KasumiD1：Acute myeloblastic leukemia
### REH：Childhood B acute lymphoblastic leukemia
### MV411AZAD11：can't find, so MV411D1：biphenotypic B-myelomonocytic leukemia
### MV411AZAD5：can't find, so MV411D1：biphenotypic B-myelomonocytic leukemia
### THP1：Childhood acute monocytic leukemia
### OCIAML2：Adult acute myeloid leukemia
### HL60：Adult acute myeloid leukemia
### HEK293：Vaccine production cell line.
### CML：Childhood acute megakaryoblastic leukemia
### MOLM14CBFGLIS：can't find, so MOLM14：Adult acute myeloid leukemia
cell_data <- metafusion[metafusion$Sample_ID %in% target_cell_sample]
### > nrow(cell_data)
###  [1] 1853
cell_info <- TARGET_info_pretreatment[TARGET_info_pretreatment$Patient_ID %in% target_cell_patient]
###  nrow(cell_info)
###  [1] 18
### no os.etc
cell_info2 <- TARGET_xena_pretreatment[TARGET_xena_pretreatment$Patient_ID %in% target_cell_patient]
### nrow(cell_info2)
### [1] 0

TARGET_info_after_cell <- TARGET_correct_after_pro |>
  mutate(cohort = case_when(
    `Sample_type` %in% "Cell Lines" ~ "TARGET-CELL",
    TRUE ~ cohort  
  ))

### The criteria for distinguishing normal samples based on Sample_type column in the TARGET databases are as follows:
TARGET_info_after_cell$Sample_type |> unique()
### [1] "Primary Blood Derived Cancer - Bone Marrow"              "Recurrent Blood Derived Cancer - Bone Marrow"           
### [3] "Blood Derived Normal"                                    "Primary Blood Derived Cancer - Peripheral Blood"        
### [5] "Bone Marrow Normal"                                      "Cell Lines"                                             
### [7] "Next Generation Cancer Model"                            "Blood Derived Cancer - Bone Marrow, Post-treatment"     
### [9] "Primary Tumor"                                           "Recurrent Tumor"                                        
### [11] "Recurrent Blood Derived Cancer - Peripheral Blood"       "Blood Derived Cancer - Peripheral Blood, Post-treatment"
### 
### If the sample type is "Blood Derived Normal" or "Bone Marrow Normal", then it is a normal sample. Otherwise, it is a tumor sample.
TARGET_info_after_normal <- TARGET_info_after_cell |>
  mutate(cohort = case_when(
    `Sample_type` %in% c("Blood Derived Normal", "Bone Marrow Normal") ~ "TARGET-NORMAL",
    TRUE ~ cohort  
  ))

### Adjust the column order.
new_order <- c("Patient_ID", "Sample_ID", "Run", "Sample_type", "Age", "Sex", "race", "Cancer", 
               "OS_Time", "OS_Status", "Project_id", "cohort")

TARGET_info_after_processing <- TARGET_info_after_normal |>
                                select(all_of(new_order)) 

### make sure nrow is same between Before and After Processing. 
TARGET_pretreatment |> nrow()
### [1]  3222
TARGET_info_after_processing |> nrow() 
### [1] 3222

### make sure IOBR/TME data just contain sample which target_sheet gave. 
detected_id_TARGET <- TARGET_info_after_processing$Run

### save info
cohorts <- unique(TARGET_info_after_processing$cohort)

for (cohort1 in cohorts) {
  cohort_data <- subset(TARGET_info_after_processing, cohort == cohort1)
  name <- gsub("-", "_", cohort1)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/", name ,"_info.tsv")
  fwrite(cohort_data , file = file_name, sep = "\t")  
}

##Gene
metafusion1 <- metafusion |> 
  mutate(
    cohort = case_when(
      Sample_ID %in% TARGET_info_after_processing$Run & grepl("TARGET", cohort) ~ TARGET_info_after_processing$cohort[match(Sample_ID, TARGET_info_after_processing$Run)],
      TRUE ~ cohort
    )
  )  
  cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.target.tsv"
  fwrite(metafusion1, file = cp_file_name1, sep = "\t") 

##TME
NBL_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-NBL_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-NBL")
AML_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-AML_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-AML")
TARGET_TME <- rbind(NBL_TME,AML_TME) |>
              mutate(dash_count = str_count(ID, "-"))
### Manually check and delete composite samples from the info dataset.(TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R)
TARGET_TME |> 
filter(dash_count == 9) |> 
select(ID) |>
unique()
### 1: TARGET-30-PASYPX-01A-01R-TARGET-30-PAIXIF-01A-01R
### 2: TARGET-30-PANKFE-01A-01R-TARGET-30-PAPUAR-01A-01R
### 3: TARGET-30-PANUKV-01A-01R-TARGET-30-PASUML-01A-01R
### 4: TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R

TARGET_TME_pretreatment <- TARGET_TME |>
                           filter(dash_count != 9) |>
                           select(-dash_count)

### Filter TARGET Replicate Samples
### The filter rule following broad institute says:
###
### In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
### 
### Analyte Replicate Filter
### The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
### 
### Sort Replicate Filter
### The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
### 
### Ref Link: <https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon>
###           <https://gist.github.com/ShixiangWang/33b2f9b49b77eaa8f773b428480f9101>
TARGET_TME_after_replicate <- TARGET_TME_pretreatment |>
                              mutate(
                              prefixes = str_remove(ID, "-[^-]+$"),  
                              suffixes = str_extract(ID, "[^-]+$"), 
                              second_chars = str_extract(suffixes, "[0-9]+"),
                              second_chars = as.integer(second_chars)) |>
                              group_by(prefixes) |>
                              slice(which.max(second_chars)) |>
                              ungroup() |>
                              mutate(ID = prefixes) |> 
                              filter(ID %in% detected_id_TARGET)  |> ### Ensure that all IDs in the TME data can be found in the TARGET info data. 
                              select(-c(prefixes,suffixes,second_chars))

### Using TARGET info data cohort column to annotate TARGET TME data 
cohort_data   <- TARGET_info_after_processing |> 
                 select(Run, cohort)   |>
                 unique()   

TARGET_TME_after_cohort <- TARGET_TME_after_replicate |>
                           mutate(Run = ID)  |>
                           left_join(cohort_data, by = "Run")  |>
                           select(-Run)

### save TME
cohorts_TME <- unique(TARGET_TME_after_cohort$cohort)

for (cohort1 in cohorts_TME) {
  cohort_data <- subset(TARGET_TME_after_cohort, cohort == cohort1) |> select(-cohort)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  

## IOBR
NBL_IOBR_signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-NBL_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-NBL")
AML_IOBR_signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-AML_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-AML")
TARGET_IOBR <- rbind(NBL_IOBR_signature,AML_IOBR_signature) |>
               mutate(dash_count = str_count(ID, "-"))

### Manually check and delete composite samples from the info dataset.(TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R)
TARGET_IOBR |> 
filter(dash_count == 9) |> 
select(ID) |>
unique()
### 1: TARGET-30-PASYPX-01A-01R-TARGET-30-PAIXIF-01A-01R
### 2: TARGET-30-PANKFE-01A-01R-TARGET-30-PAPUAR-01A-01R
### 3: TARGET-30-PANUKV-01A-01R-TARGET-30-PASUML-01A-01R
### 4: TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R

TARGET_IOBR_pretreatment <- TARGET_IOBR |>
                            filter(dash_count != 9) |>
                            select(-dash_count)

TARGET_IOBR_after_replicate <- TARGET_IOBR_pretreatment |>
                            mutate(
                            prefixes = str_remove(ID, "-[^-]+$"),  
                            suffixes = str_extract(ID, "[^-]+$"), 
                            second_chars = str_extract(suffixes, "[0-9]+"),
                            second_chars = as.integer(second_chars)) |>
                            group_by(prefixes) |>
                            slice(which.max(second_chars)) |>
                            ungroup() |>
                            mutate(ID = prefixes) |> 
                            filter(ID %in% detected_id_TARGET)  |> ### Ensure that all IDs in the TME data can be found in the TARGET info data. 
                            select(-c(prefixes,suffixes,second_chars))

### Using TARGET info data cohort column to annotate TARGET TME data 
cohort_data   <- TARGET_info_after_processing |> 
                 select(Run, cohort)   |>
                 unique()   

TARGET_IOBR_after_cohort <- TARGET_IOBR_after_replicate |>
                            mutate(Run = ID)  |>
                            left_join(cohort_data, by = "Run")  |>
                            select(-Run)
### save IOBR
cohorts_IOBR <- unique(TARGET_IOBR_after_cohort$cohort)

for (cohort1 in cohorts_IOBR) {
  cohort_data <- subset(TARGET_IOBR_after_cohort, cohort == cohort1) |> select(-cohort)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}

CPTAC

library(data.table)
library(dplyr)
library(stringr)
library(tidyr)

file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.tsv"
metafusion <- fread(file_meta, sep = "\t")

# CPTAC
## info
CPTAC_sheet <- fread("/home/data2/Projects/Fusion/unclean/CPTAC_gdc_sample_sheet.2023-11-15.tsv", sep = "\t")

### Composite Sample 
###
### Composite Sample use comma to separate like C3L-03400-05, C3L-03400-02
### 
### It leads one question is whether these composite samples using comma to separate from identical patient?
rows_with_comma <- grepl(", ", CPTAC_sheet$`Sample ID`)
rows_with_comma_data <- CPTAC_sheet[rows_with_comma, ]

check_same_chars <- function(x) {
  parts <- unlist(strsplit(x, ", "))
  return(parts[1] == parts[2])
}
same_chars <- sapply(rows_with_comma_data$`Case ID`, check_same_chars)
if (all(same_chars)) {
  cat("TRUE\n")
} else {
  cat("FALSE\n")
}
### They are from identical patient.

### For the TARGET databases, check whether there is any overlap between composite and normal Sample_ID in the info data. If there is an overlap, retain only the normal Sample_ID. 
check_cptac_overlap <- CPTAC_sheet |>  
                       select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
                       unique() |> 
                       separate_rows(`Sample ID`, sep = ",\\s*") 
 
check_cptac_overlap <- check_cptac_overlap |>
                        mutate(Sample_Type = ifelse(grepl(", ", `Sample ID`), "Composite", "Normal"))
                      
overlap_samples <- check_cptac_overlap |>
                        group_by(`Sample ID`) |>
                        summarise(Count = n(), .groups = 'drop') |>
                        filter(Count > 1)
### > overlap_samples 
### A tibble: 0 × 2
### ℹ 2 variables: Sample ID <chr>, Count <int> 
### no this situation

### Due to the less rigorous patient naming conventions(like 604) in the CPTAC2 project, check for any overlapping patient names between CPTAC2 and CPTAC3.
CPTAC_sheet_rigorous  <- CPTAC_sheet |> 
                         select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>  
                         unique() 
duplicates <- CPTAC_sheet_rigorous |>
                          group_by(`Case ID`, `Sample ID`) |>
                          summarise(n_projects = n(), projects = paste(`Project ID`, collapse = ", ")) |>
                          filter(n_projects > 1)
### nrow(duplicates)
### [1] 0
### no same patient names between CPTAC2 and CPTAC3

### CPTAC_sheet pretreatment
matched_CPTAC_pretreatment <- CPTAC_sheet |> 
  select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
  rename("Patient_ID" = `Case ID`,
         "Sample_ID" = `Sample ID`,
         "Sample_type"  =  `Sample Type`,
         "Project_id" = `Project ID`) |> 
  separate_rows(Patient_ID, sep = ",\\s*") |>  #because they are from identical patient.
  unique() 

### CPTAC_info pretreatment
CPTAC_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")
# > CPTAC_info$Cancer |> unique()
# [1] "BRCA"       "CCRCC"      "COAD"       "GBM"        "HGSC"       "HNSCC"      "LSCC"       "LUAD"       "PDAC"       "UCEC"       "GTEX-Brain"
### We use our naming standard mutate a new column cohort.
cptac_replacement_table2 <- c(
  "BRCA" = "CPTAC-BRCA",
  "CCRCC" = "CPTAC-KIRC",
  "COAD" =  "CPTAC-COAD",
  "GBM" = "CPTAC-GBM",
  "HGSC" = "CPTAC-OV",
  "HNSCC" = "CPTAC-HNSC",
  "LSCC" = "CPTAC-LUSC" ,
  "LUAD" = "CPTAC-LUAD",
  "PDAC" = "CPTAC-PAAD",
  "UCEC" = "CPTAC-UCEC",
  "GTEX-Brain" = "CPTAC-GTEX-Brain"
)     

matched_CPTAC_info <- CPTAC_info$Cancer %in% names(cptac_replacement_table2)
CPTAC_info_pretreatment <- CPTAC_info %>%
  mutate(cohort = case_when(
    matched_CPTAC_info ~ cptac_replacement_table2[Cancer],
    TRUE ~ as.character(NA)  
  )) |> 
  select(-Sample_ID)
### CPTAC_info_pretreatment  |> filter(is.na(cohort)) |> nrow()
### [1] 0

matched_CPTAC <- matched_CPTAC_pretreatment |>
                 left_join(CPTAC_info_pretreatment, by = c("Patient_ID" = "Patient_ID")) |>
                 mutate(Run = str_replace_all(Sample_ID, ", ", "-"))

### CPTAC Sample type
### matched_CPTAC_pretreatment$Sample_type |> unique()
### [1] "Primary Tumor"                                                             "Solid Tissue Normal"                                                      
### [3] "Primary Tumor, Primary Tumor"                                              "Primary Tumor, Primary Tumor, Primary Tumor"                              
### [5] "Solid Tissue Normal, Solid Tissue Normal, Solid Tissue Normal"             "Solid Tissue Normal, Solid Tissue Normal"                                 
### [7] "Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor" "Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor"   
### 
### Just two sample type so if grepling "Solid Tissue Normal" is normal sample, grepling "Primary Tumor" is tumor sample
matched_CPTAC_after_processing <- matched_CPTAC |>
  mutate(cohort = case_when(
    grepl("Solid Tissue Normal", Sample_type) ~ "CPTAC-NORMAL",
    TRUE ~ cohort  
  )) 

### make sure nrow is same between Before and After Processing. 
matched_CPTAC_after_processing |> nrow()
### [1] 2558
matched_CPTAC_pretreatment |> nrow() 
### [1] 2558

### make sure IOBR/TME data just contain sample which cptac_sheet gave. 
detected_id_CPTAC <- matched_CPTAC_after_processing$Run |> unique()

### save info
cohorts <- unique(matched_CPTAC_after_processing$cohort)

for (cohort1 in cohorts) {
  cohort_data <- subset(matched_CPTAC_after_processing, cohort == cohort1)
  name <- gsub("-", "_", cohort1)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/", name ,"_info.tsv")
  fwrite(cohort_data , file = file_name, sep = "\t")  
}

## TME
CPTAC2_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-2_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC2")
CPTAC3_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-3_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC3")
CPTAC_TME <- rbind(CPTAC2_TME, CPTAC3_TME)

### Match matched_CPTAC_after_processing$Run to CPTAC_TME$ID
### 
### CPTAC_TME can match
matched_can <- CPTAC_TME %>%
               filter(ID %in% matched_CPTAC_after_processing$Run) 

### 
### CPTAC_TME can't match
matched_not <- CPTAC_TME %>%
               filter(!(ID %in% matched_CPTAC_after_processing$Run))

### 
### situtation 1: Due to the barcode naming convention and etc, some portion character are added to the endning of the IOBR/TME sample name.
### str_count(matched_last$ID, "_")
### [1] 1 2 0
### 1.1 "CPT000817-0008" and "CPT000817". If the value of str_count(matched_last$ID, "_") for a row is 0, then the value of the prefixes column for that row should be the characters before the last -, and the value of the suffixes column should be the characters after the last -.
### 1.2 "f6a7f764-2495-4a1d-89a2-e12f51_D7_1" and "f6a7f764-2495-4a1d-89a2-e12f51". If the value of str_count(matched_last$ID, "_") for a row is 1, then the value of the prefixes column for that row should be the characters before the first _, and the value of the suffixes column should be the characters after the first _.
### 1.3 "CPT008756_0001" and "CPT008756". f6a7f764-2495-4a1d-89a2-e12f51_D7_1, .If the value of str_count(matched_last$ID, "_") for a row is 2, then the value of the prefixes column for that row should be the characters before the first _, and the value of the suffixes column should be the characters after the first _.

matched_not <- matched_not %>%
  rowwise() %>%  
  mutate(
    prefixes = case_when(
      str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][1],  
      str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][1],  
      str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][1],  
      TRUE ~ NA_character_
    ),
    suffixes = case_when(
      str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][length(str_split(ID, "-")[[1]])],  
      str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][2],  
      str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][2],  
      TRUE ~ NA_character_
    ),
    name1 = prefixes
  ) %>%
  ungroup() 

### situtation 1 after processing can match
matched_in_situtation1 <-  matched_not %>%
                           filter((name1 %in% matched_CPTAC_after_processing$Run))%>%
                           mutate(ID = name1)  %>%
                           select(-c(suffixes,prefixes,name1) )     
matched_can <- rbind(matched_can, matched_in_situtation1)

### not in situtation 1 
matched_not_situtation1 <- matched_not %>%
                           filter(!(name1 %in% matched_CPTAC_after_processing$Run))

### situtation 2: Due to IOBR package analyze code, an additional "X" is added to the beginning of the IOBR/TME sample name.
### "X1104806" and "1104806", "X9f905736-f662-41d6-b3ac-16758d" and "9f905736-f662-41d6-b3ac-16758d". 
matched_not_situtation1  <- matched_not_situtation1 %>%
                            mutate(name2 = str_remove(ID, "^X"))  

### situtation 2 after processing can match
matched_in_situtation2 <- matched_not_situtation1 %>%
                          filter((name2 %in% matched_CPTAC_after_processing$Run))    %>%
                          mutate(ID = name2 )  %>%
                          select(-c(name1,prefixes,suffixes,name2)) 
matched_can <- rbind(matched_can, matched_in_situtation2)

### situtation 2 after processing can't match
matched_not_situtation2 <- matched_not_situtation1 %>%
                           filter(!(name2 %in% matched_CPTAC_after_processing$Run))

### situtation 3: Mixed situation1 and situation2.
### X161317f3-1a39-438e-af45-cd2502_D7 and 161317f3-1a39-438e-af45-cd2502
matched_not_situtation2  <- matched_not_situtation2 %>%
                            mutate(name3 = str_remove(prefixes, "^X")) 

### situtation 3 after processing can match    
matched_in_situtation3 <- matched_not_situtation2  %>%
                          filter((name3 %in% matched_CPTAC_after_processing$Run))  %>%
                          mutate(ID = name3) %>%
                          select(-c(name1,name2,prefixes,suffixes, name3))           
matched_can <- rbind(matched_can, matched_in_situtation3)

### situtation 3 after processing can't match
matched_not_situtation3 <- matched_not_situtation2  %>%
                           filter(!(name3 %in% matched_CPTAC_after_processing$Run))
nrow(matched_not_situtation3 )
### [1] 81

### Check the reason why it still can't match after Type 3 processing
### After manual review, these Sample_ID are not included in either the info dataset or the gdc_sheet.

### Check if CPTAC TME data ID are any samples that have been divided into multiple portions.
matched_can |> select(ID) |>  nrow()
### [1] 2558
matched_can |> select(ID) |>  unique() |> nrow()
### [1] 2558
### No this situation.

### Check if CPTAC TME data has same norw with CPTAC info data.
matched_can |> select(ID) |>   unique() |> nrow()
### [1] 2558
matched_CPTAC_after_processing |> select(Run) |>  unique() |> nrow()
### [1] 2558
### Sure.

cohort_data <- matched_CPTAC_after_processing |> 
               select(Run, cohort)   |>
               unique()   

matched_can <- matched_can  |>
               mutate(Run = ID)  |>
               left_join(cohort_data, by = "Run")  |>
               select(-Run)

cohorts_TME <- unique(matched_can$cohort)

for (cohort1 in cohorts_TME) {
  cohort_data <- subset(matched_can, cohort == cohort1) |> select(-cohort)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  

#IOBR
CPTAC2_Signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-2_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC2")
CPTAC3_Signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-3_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC3")
CPTAC_Signature <- rbind(CPTAC2_Signature, CPTAC3_Signature)

### Match matched_CPTAC_after_processing$Run to CPTAC_TME$ID
### 
### CPTAC_TME can match
matched_can <- CPTAC_Signature %>%
               filter(ID %in% matched_CPTAC_after_processing$Run) 


### 
### CPTAC_TME can't match
matched_not <-  CPTAC_Signature %>%
                filter(!(ID %in% matched_CPTAC_after_processing$Run))

### 
### situtation 1:like TME 
matched_not <- matched_not %>%
  rowwise() %>%  
  mutate(
    prefixes = case_when(
      str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][1],  
      str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][1],  
      str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][1], 
      TRUE ~ NA_character_
    ),
    suffixes = case_when(
      str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][length(str_split(ID, "-")[[1]])],  
      str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][2],  
      str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][2],  
      TRUE ~ NA_character_
    ),
    name1 = prefixes
  ) %>%
  ungroup()  

### situtation 1 after processing can match
matched_in_situtation1 <- matched_not %>%
                          filter((name1 %in% matched_CPTAC_after_processing$Run))%>%
                          mutate(ID = name1)  %>%
                          select(-c(suffixes,prefixes,name1) )    

matched_can <- rbind(matched_can, matched_in_situtation1)

### situtation 1 after processing can't match
matched_not_situtation1 <- matched_not %>%
                           filter(!(name1 %in% matched_CPTAC_after_processing$Run))

### situtation 2:like TME 
matched_not_situtation1  <- matched_not_situtation1 %>%
                            mutate(name2 = str_remove(ID, "^X"))  

### situtation 2 after processing can match
matched_in_situtation2 <- matched_not_situtation1 %>%
                          filter((name2 %in% matched_CPTAC_after_processing$Run))    %>%
                          mutate(ID = name2 )  %>%
                          select(-c(name1,prefixes,suffixes,name2)) 

matched_can <- rbind(matched_can, matched_in_situtation2)

### situtation 2 after processing can't match
matched_not_situtation2 <- matched_not_situtation1 %>%
                           filter(!(name2 %in% matched_CPTAC_after_processing$Run))

### situtation 3:like TME 
matched_not_situtation2   <- matched_not_situtation2  %>%
                             mutate(name3 = str_remove(prefixes, "^X")) 

### situtation 3 after processing can match   
matched_in_situtation3 <- matched_not_situtation2  %>%
                          filter((name3 %in% matched_CPTAC_after_processing$Run))  %>%
                          mutate(ID = name3) %>%
                          select(-c(name1,name2,prefixes,suffixes, name3))     

matched_can <- rbind(matched_can, matched_in_situtation3)

### situtation 3 after processing can't match  
matched_not_situtation3 <- matched_not_situtation2  %>%
                           filter(!(name3 %in% matched_CPTAC_after_processing$Run))

nrow(matched_not_situtation3 )
### [1] 81
                          
### Check the reason why it still can't match after Type 3 processing
### After manual review, these Sample_ID are not included in either the info dataset or the gdc_sheet.

### Check if CPTAC IOBR data ID are any samples that have been divided into multiple portions.
matched_can |> select(ID) |>  nrow()
### [1] 2558
matched_can |> select(ID) |>  unique() |> nrow()
### [1] 2558
### No this situation.

### Check if CPTAC IOBR data has same norw with CPTAC info data.
matched_can |> select(ID) |>   unique() |> nrow()
### [1] 2558
matched_CPTAC_after_processing |> select(Run) |>  unique() |> nrow()
### [1] 2558
### Sure.

cohort_data   <- matched_CPTAC_after_processing |> 
                 select(Run, cohort)   |>
                 unique()   

matched_can <- matched_can |>
               mutate(Run = ID)  |>
               left_join(cohort_data, by = "Run")  |>
               select(-Run)

cohorts_IOBR <- unique(matched_can$cohort)

for (cohort1 in cohorts_IOBR) {
  cohort_data <- subset(matched_can, cohort == cohort1) |> select(-cohort)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  
 
## Gene

metafusion1 <- metafusion |> 
  mutate(
    cohort = case_when(
      Sample_ID %in% matched_CPTAC_after_processing$Run  & grepl("CPTAC", cohort) ~ matched_CPTAC_after_processing$cohort[match(Sample_ID, matched_CPTAC_after_processing$Run)],
      TRUE ~ cohort
    )
  )  

### In the past, the metafusion file had an issue with the sample_type being incorrectly labeled.
### After further investigation, it was found that only the composite Sample_ID in CPTAC had this issue, where samples labeled as 'Normal' were mistakenly identified as 'Tumor' in the sample_type column.

normal_id <-   matched_CPTAC_after_processing |> 
               filter(cohort == "CPTAC-NORMAL") |> 
               select(Run) |>
               unique() |>
               pull(Run)

meta_normal <- metafusion1 |>
               filter(Sample_ID %in% normal_id)

unique(meta_normal$sample_type)
#[1] "Normal" "Tumor" 

meta_processing  <- metafusion1 |>
                    mutate(
                      sample_type = ifelse(
                        sample_type != "Normal" & Sample_ID %in% normal_id,
                        "Normal",
                        sample_type
                      )
                    )

cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.tsv"
fwrite(meta_processing, file = cp_file_name1, sep = "\t")

TCGA

library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.target.tsv"
metafusion <- fread(file_meta  , sep = "\t")

# TCGA
## info
TCGA_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")
### TCGA_sheet pretreatment
TCGA_sheet <- fread("/home/data2/Projects/Fusion/unclean/TCGA_sample_sheet.2024-05-14.tsv", sep = "\t") |> 
              select(c(`Sample ID`, `Sample Type`,`Project ID`)) 

colnames(TCGA_sheet) <- c("Sample_ID", "Sample_type","Project_id")

TCGA_sheet_pretreatment <- TCGA_sheet  |>
                           mutate(Sample_ID = substr(Sample_ID, 1, 15)) |> 
                           unique()
### We use our naming standard mutate a new column cohort. 
TCGA_after_processing <- TCGA_sheet_pretreatment |>
  left_join(TCGA_info, by = c("Sample_ID" = "Sample_ID")) |>
  mutate(
    cohort = Project_id,
    Run = Sample_ID,
    Patient_ID = substr(Sample_ID, 1, 12),  
  )
### TCGA Sample type
### TCGA_after_processing$Sample_type |> unique()
###   [1] "Primary Tumor"                                   "Solid Tissue Normal"                             "Metastatic"                                     
###   [4] "Additional - New Primary"                        "Primary Blood Derived Cancer - Peripheral Blood" "Additional Metastatic"                          
###   [7] "Recurrent Tumor"   
### 
### "Solid Tissue Normal" is normal sample, others is tumor sample

TCGA_NORMAL <- TCGA_after_processing |> 
               filter(Sample_type == "Solid Tissue Normal") 

NORMAL_id <- TCGA_NORMAL$Run |> unique()

TCGA_after_processing <- TCGA_after_processing |> 
  mutate(
    Project_id = cohort,
    cohort = case_when(
      Run %in% NORMAL_id ~ "TCGA-NORMAL",
      TRUE ~ cohort)) 

### make sure nrow is same between Before and After Processing. 
TCGA_after_processing |> nrow()
### [1] 10545
TCGA_sheet_pretreatment |> nrow() 
### [1] 10545

### make sure IOBR/TME data just contain sample which tcga_sheet gave. 
detected_id_TCGA <- TCGA_after_processing$Run |> unique()

### save info
cohorts <- unique(TCGA_after_processing$cohort)

file_name1 <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/")

for (cohort1 in cohorts) {
  cohort_data <- subset(TCGA_after_processing, cohort == cohort1)
  name <- gsub("-", "_", cohort1)
  file_name <- paste0(file_name1, name ,"_info.tsv")
  fwrite(cohort_data , file = file_name, sep = "\t")  
}  
## Gene
metafusion1 <- metafusion |> 
  mutate(
    cohort = case_when(
      Sample_ID %in% TCGA_after_processing$Run & grepl("TCGA", cohort) ~ TCGA_after_processing$cohort[match(Sample_ID, TCGA_after_processing$Run)],
      TRUE ~ cohort
    )
  ) 
cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_gdc_divide.tsv"
fwrite(metafusion1, file = cp_file_name1, sep = "\t") 

## IOBR
folder_path <- "/home/data2/Projects/Fusion/fusiondb/IOBR"

file_pattern <- "^TCGA.*_IOBR_signature.tsv.gz$"
file_list <- list.files(path = folder_path, pattern = file_pattern, full.names = TRUE)

combined_data <- data.table()

for (file in file_list) {
  
  file_data <- fread(file, sep = "\t")
  file_name <- basename(file)  
  project_id <- sub("^(TCGA-[^_]+).*", "\\1", file_name)  
  file_data <- file_data %>%
               mutate(Project_id = project_id)
  combined_data <- rbind(combined_data, file_data)
}

### Filter TCGA Replicate Samples
### The filter rule following broad institute says:
###
### In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
### 
### Analyte Replicate Filter
### The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
### 
### Sort Replicate Filter
### The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
### 
### Ref Link: <https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon>
###           <https://gist.github.com/ShixiangWang/33b2f9b49b77eaa8f773b428480f9101>


result_data <- combined_data |>
  mutate(
    prefixes = str_split(ID, "-", simplify = TRUE)[, 1:4] %>% apply(1, paste, collapse = "-"),
    suffixes = str_split(ID, "-", simplify = TRUE)[, 5],
    second_chars = str_extract(suffixes, "[0-9]+"),
    second_chars = as.integer(second_chars))
result_IOBR <- result_data  |>
  group_by(prefixes) |>
  slice(which.max(second_chars)) |>
  ungroup() |>
  mutate(cohort = Project_id,
         ID = substr(prefixes, 1, 15),
         cohort = case_when(ID %in% NORMAL_id ~ "TCGA-NORMAL",
                            TRUE ~ cohort)
  ) |>
  filter(ID %in% detected_id_TCGA)  |> 
  select(-c("suffixes", "second_chars","prefixes")) |>
  unique() 

cohorts_IOBR <- unique(result_IOBR$cohort)

for (cohort1 in cohorts_IOBR) {
  cohort_data <- subset(result_IOBR, cohort == cohort1) |> select(-c(cohort,Project_id) )
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  

## TME

folder_path <- "/home/data2/Projects/Fusion/fusiondb/IOBR"
file_pattern <- "^TCGA.*_TME.tsv.gz$"
file_list <- list.files(path = folder_path, pattern = file_pattern, full.names = TRUE)

combined_data1 <- data.table()

for (file in file_list) {

  file_data <- fread(file, sep = "\t")
  
  file_name <- basename(file)  
  project_id <- sub("^(TCGA-[^_]+).*", "\\1", file_name)  
  
  file_data <- file_data %>%
    mutate(Project_id = project_id)
  
  combined_data1 <- rbind(combined_data1, file_data)
}

result_data1 <- combined_data1 |>
  mutate(
    prefixes = str_split(ID, "-", simplify = TRUE)[, 1:4] %>% apply(1, paste, collapse = "-"),
    suffixes = str_split(ID, "-", simplify = TRUE)[, 5],
    second_chars = str_extract(suffixes, "[0-9]+"),
    second_chars = as.integer(second_chars))
result_TME <- result_data1  |>
  group_by(prefixes) |>
  slice(which.max(second_chars)) |>
    ungroup() |>
      mutate(cohort = Project_id,
             ID = substr(prefixes, 1, 15),
             cohort = case_when(ID %in% NORMAL_id ~ "TCGA-NORMAL",
                                TRUE ~ cohort)
      ) |>
      filter(ID %in% detected_id_TCGA)  |> 
      select(-c("suffixes", "second_chars","prefixes")) |>
      unique() 

cohorts_TME <- unique(result_TME$cohort)

for (cohort1 in cohorts_TME) {
  cohort_data <- subset(result_TME, cohort == cohort1) |> select(-c(cohort,Project_id) )
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}

Fusion Event Scoring and Classification

library(data.table)
library(dplyr)
library(stringr)
library(tidyr)

file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_gdc_divide.tsv"
metafusion <- fread(file_meta , sep = "\t")
# Stratification Strategy
## Normal data
### 1. If one fusion event was found in the past Arriba blocklist, it will be given a score of 0 and will not be changed.
### 2. If one fusion event's max_split_cnt >= 1 and max_span_cnt >= 1, +1 score.
### 3. If Arriba and STAR-Fusion detected identical fusion event, +1 score.
### 4. If past FusionAnnotator annotated one normal fusion event, +2 score(Because it is a strong evidence)
Normal_data <- metafusion |>
               filter(sample_type == "Normal") |>
               mutate(Normal_score = 0) 

### 1. Arriba blocklist
#### reference:https://arriba.readthedocs.io/en/latest/input-files/#blacklist
Normal_data <- Normal_data %>%
  mutate(
    Normal_score = case_when(
      blocklist == TRUE ~ 0,  
      TRUE ~ Normal_score     
    ),
    score0 = case_when(
      blocklist == TRUE ~ TRUE,  
      TRUE ~ FALSE    
    )
  )

###  2. Reads counts
Normal_data <- Normal_data %>%
  mutate(
    Normal_score = case_when(
         max_split_cnt >= 1 & max_span_cnt >= 1  & score0 == FALSE ~ Normal_score + 1 ,
         TRUE ~ Normal_score 
                            )
        )

###  3. Tools detected
Normal_data <- Normal_data %>%
    mutate(
    Normal_score = case_when(
           tools == "arriba,starfusion" & score0 == FALSE ~ Normal_score + 1,
           TRUE ~ Normal_score
                            )
          )
            
### 4. Fusionanntator annotate 
normal_markers <- c("GTEx_recurrent_StarF2019", "DGD_PARALOGS", "HGNC_GENEFAM", "BodyMap", "Greger_Normal", "ConjoinG", "Babiceanu_Normal")

#### Past Red Herrings summary didn't contain TCGA, TARGET, CPTAC(except for GTEx-Brain) normal sample data.
#### reference:https://github.com/FusionAnnotator/CTAT_HumanFusionLib/wiki
#### BodyMap    40
#### Babiceanu_Normal   239
#### ConjoinG   698
#### DGD_PARALOGS   14463
#### Greger_Normal  113
#### GTEx_recurrent_StarF2019   4047
#### HGNC_GENEFAM   Unknown

Normal_data <- Normal_data %>%
  mutate(
    Normal_score = case_when(
      rowSums(sapply(normal_markers, grepl, x = cancer_db_hits)) >= 1 & score0 == FALSE ~ Normal_score + 2,
      TRUE ~ Normal_score   )
        )

### Nomral true 
Normal_data_true <- Normal_data |>
                          group_by(gene1, gene2) %>%
                          slice_max(Normal_score) %>%
                          ungroup() %>%
                          filter(Normal_score >= 2) 

## Tumor data
### 1. If one fusion event was found in the past Arriba blocklist, it will be given a score of 0 and will not be changed.
### 2. If fusion event was in our detcted normal fusion（score>=2), it will be given a score of 0 and will not be changed.
### 3. If one fusion event's max_split_cnt and max_span_cnt are both greater than or equal to 1, and at least one of them is less than 5, +1 score.
###    If one fusion event's max_split_cnt and max_span_cnt are both greater than or equal to 5, +2 score.
### 4. If Arriba detected exclusive fusion event, +0 score; 
###    If STAR-Fusion detected exclusive fusion event, +0 score; 
###    If Arriba and STAR-Fusion detected identical fusion event, +1 score; 
### 5. If past FusionAnnotator annotated one tumor fusion event, +2 score(Because it is a strong evidence)
Tumor_data <-  metafusion |>
               filter(sample_type == "Tumor" | sample_type == "Blood") |>
               mutate(Tumor_score = 0) 

### 1. Arriba blocklist
#### reference:https://arriba.readthedocs.io/en/latest/input-files/#blacklist
Tumor_data <- Tumor_data %>%
  mutate(
    Tumor_score = case_when(
      blocklist == TRUE ~ 0,  
      TRUE ~ Tumor_score   
    ),
    score0 = case_when(
      blocklist == TRUE ~ TRUE,  
      TRUE ~ FALSE    
    )
  )

### 2. Filter not in normal data.
Tumor_data <- Tumor_data %>%
              mutate(gene_pair = paste0(gene1, "::", gene2))

Normal_data_true <- Normal_data_true %>%
                    mutate(gene_pair = paste0(gene1, "::", gene2))

Tumor_data <- Tumor_data %>%
  mutate(
    Tumor_score = case_when(
      gene_pair %in% Normal_data_true$gene_pair & score0 == FALSE ~ 0,
      TRUE ~ Tumor_score   
    ),
    score0 = case_when(
      gene_pair %in% Normal_data_true$gene_pair & score0 == FALSE ~ TRUE,
      TRUE ~ FALSE    
    )
  )

###  3. Reads counts

Tumor_data <- Tumor_data %>%
  mutate(
    Tumor_score = case_when(
      max_split_cnt >= 5 & max_span_cnt >= 5  & score0 == FALSE ~ Tumor_score + 2,
      (max_split_cnt >= 1 & max_span_cnt >= 1) & ((max_split_cnt < 5 & max_span_cnt >= 5)| (max_split_cnt >= 5 & max_span_cnt < 5)| (max_split_cnt < 5 & max_span_cnt < 5))~ Tumor_score + 1,
      TRUE ~ Tumor_score
    )
  )

###  4. Tools detected

Tumor_data <- Tumor_data %>%
    mutate(
      Tumor_score = case_when(
           tools == "arriba,starfusion" & score0 == FALSE ~ Tumor_score + 1,
           tools == "arriba" & score0 == FALSE ~ Tumor_score,
           tools == "starfusion" & score0 == FALSE ~ Tumor_score,
           TRUE ~ Tumor_score
                            )
          )

### 5. Fusionanntator annotate 
tumor_markers <- c("Mitelman", "chimerdb_omim", "chimerdb_pubmed", "Cosmic", "YOSHIHARA_TCGA", "ChimerKB", "ChimerPub", "ChimerSeq", "Klijn_CellLines", "Larsson_TCGA", "CCLE_StarF2019", "HaasMedCancer", "TCGA_StarF2019", "TumorFusionsNAR2018", "DEEPEST2019", "GUO2018CR_TCGA", "DepMap2023")

Tumor_data <- Tumor_data %>%
    mutate(
      Tumor_score = case_when(
        rowSums(sapply(tumor_markers, grepl, x = cancer_db_hits)) >= 1 ~ Tumor_score + 2,
        TRUE ~ Tumor_score)
          ) 

### New metafusion
Tumor_data1 <- Tumor_data  |> 
               select(-c(gene_pair, score0)) |>
               rename(gene_score = Tumor_score)
Normal_data1 <- Normal_data |> 
               select(-c( score0)) |>
               rename(gene_score = Normal_score)

metafusion_new <- rbind(Normal_data1,Tumor_data1)
file_name <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls.star.tsv"
write.table(metafusion_new, 
                      file = file_name, 
                      sep = "\t", 
                      row.names = FALSE, 
                      quote = FALSE)