Methods

Authors

Shixiang Wang

Yankun Zhao

Published

March 16, 2024

Gene Fusions identification and differential expression analysis

We employed two Gene Fusions(GFs) detection tools, including arriba, STAR-Fusion, to identify, parse, and annotate GFs junctions within each sample, using the human genome version hg38 as the reference. In our pursuit of biologically robust GFs identification within each cohort, we undertook an approach metafusion that amalgamated results from two GFs detection tools. Firstly, we exclusively retained Fusions located on chromosomes 1 to 22, as well as the X and Y chromosomes. Secondly, GFs were mandated to exhibit genomic overlap with gene regions as defined in the reference genome annotation file, specifically “gencode.v34.annotation.gtf.”. Thirdly to ensure the precision and recall in GFs prediction with different tools, one can either rely on consensus predictions or all predctions and choosing junction-crossing reads and reads mates which flank (span) the breakpoint. Fourthly, we offer FusionAnnotator to identify and prioritize GFs that have been previously reported in cancer or normal samples. Lastly, the detected data can be divided into Cis-SAGes and other categories to further explore the debate on what should be considered true fusion RNAs. This meticulous methodological approach was instrumental in ensuring the acquisition of biologically sound and reliable GFs datasets, thereby fortifying the integrity of our analysis within each cohort. This analysis facilitated comparisons between patient groups, such as those who responded to checkpoint immunotherapy versus non-responders, as well as between samples collected before and during ICB treatment.

Response and Response2

We gathered and processed patient ‘Response’ data from various cohorts. However, due to inconsistencies among these cohorts, such as the usage of different terms like CR, PR, SD, PD, R, NR, DCB, and NDB, we harmonized the response variable based on the timing of treatment and treatment response. As a result, we introduced a new variable called ‘Response2’ with four distinct categories: Pre-R, Pre-NR, On-R, and On-NR. In cases where response data was not available, we assigned the category NE.

Cancer Type Abbreviation

Abbr. Description
BLCA Bladder urothelial carcinoma
SKCM Skin cutaneous melanoma
KIRC & CCRCC Kidney renal clear cell carcinoma
HNSC & HNSCC Head and neck squamous cell carcinoma
NSCLC Non-small cell lung cancer, including lung adenocarcinoma and lung squamous cell carcinoma
SCLC Small cell lung cancer
SARC Sarcoma
BRCA Breast cancer
GBM Glioblastoma Multiforme
STAD Stomach Adenocarcinoma
SGC Salivary gland cancer
LUSC Lung Squamous Cell Carcinoma
MESO Mesothelioma
PCPG Pheochromocytoma and Paraganglioma
LIHC Liver Hepatocellular Carcinoma
PRAD Prostate Adenocarcinoma
ACC Adrenocortical Carcinoma
TGCT Testicular Germ Cell Tumors
PAAD Pancreatic Adenocarcinoma
UCEC Uterine Corpus Endometrial Carcinoma
THCA Thyroid carcinoma
CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma
ESCA Esophageal carcinoma
READ Rectum adenocarcinoma
OV Ovarian Serous Cystadenocarcinoma
CHOL Cholangiocarcinoma
UCS Uterine Carcinosarcoma
LUAD Lung Adenocarcinoma
LGG Brain Lower Grade Glioma
HGSC High-Grade Serous Carcinoma
LSCC & LUSC Lung Squamous Cell Carcinoma
PDAC Pancreatic Ductal Adenocarcinoma
THYM Thymoma
KICH Kidney Chromophobe
KIRP Kidney Renal Papillary Cell Carcinoma
COAD Colon Adenocarcinoma
LAML & AML Acute Myeloid Leukemia
NBL Neuroblastoma
GNB Ganglioneuroblastoma

ImmunoFusion cohort naming maping

Cancer Sample_type Project_id Immunofusion-cohort
BRCA Primary Tumor CPTAC-2 CPTAC-BRCA
COAD Primary Tumor CPTAC-2 CPTAC-COAD
HGSC Primary Tumor CPTAC-2 CPTAC-OV
LSCC Primary Tumor CPTAC-3 CPTAC-LUSC
UCEC Primary Tumor CPTAC-3 CPTAC-UCEC
LUAD Primary Tumor CPTAC-3 CPTAC-LUAD
GBM Primary Tumor CPTAC-3 CPTAC-GBM
HNSCC Primary Tumor CPTAC-3 CPTAC-HNSC
CCRCC Primary Tumor CPTAC-3 CPTAC-KIRC
PDAC Primary Tumor CPTAC-3 CPTAC-PAAD
PDAC Solid Tissue Normal CPTAC-3 CPTAC-NORMAL
LUAD Solid Tissue Normal CPTAC-3 CPTAC-NORMAL
UCEC Solid Tissue Normal CPTAC-3 CPTAC-NORMAL
CCRCC Solid Tissue Normal CPTAC-3 CPTAC-NORMAL
HNSCC Solid Tissue Normal CPTAC-3 CPTAC-NORMAL
LSCC Solid Tissue Normal CPTAC-3 CPTAC-NORMAL
GTEX-Brain Solid Tissue Normal CPTAC-3 CPTAC-NORMAL
Primary Blood Derived Cancer - Peripheral Blood/Recurrent Blood Derived Cancer - Peripheral Blood/Next Generation Cancer Model/Recurrent Blood Derived Cancer - Bone Marrow/Primary Blood Derived Cancer - Bone Marrow TARGET-AML TARGET-LAML
Acute myeloid leukemia, NOS Primary Blood Derived Cancer - Bone Marrow/Recurrent Blood Derived Cancer - Bone Marrow/Primary Blood Derived Cancer - Peripheral Blood/ Blood Derived Cancer - Bone Marrow/Post-treatment/Recurrent Blood Derived Cancer - Peripheral Blood/ Blood Derived Cancer - Peripheral Blood TARGET-AML TARGET-LAML
Neuroblastoma, NOS Primary Tumor/Recurrent Tumor/Recurrent Blood Derived Cancer - Bone Marrow TARGET-NBL TARGET-NBL
Ganglioneuroblastoma Primary Tumor/Recurrent Tumor TARGET-NBL TARGET-GNB
Cell Lines TARGET-AML TARGET-CELL
Bone Marrow Normal TARGET-AML TARGET-NORMAL
Acute myeloid leukemia, NOS Blood Derived Normal/Bone Marrow Normal TARGET-AML TARGET-NORMAL
ACC Primary Tumor TCGA-ACC TCGA-ACC
BLCA Primary Tumor TCGA-BLCA TCGA-BLCA
BRCA Primary Tumor TCGA-BRCA TCGA-BRCA
BRCA Metastatic TCGA-BRCA TCGA-BRCA
CESC Primary Tumor TCGA-CESC TCGA-CESC
CESC Metastatic TCGA-CESC TCGA-CESC
CHOL Primary Tumor TCGA-CHOL TCGA-CHOL
COAD Primary Tumor TCGA-COAD TCGA-COAD
COAD Recurrent Tumor TCGA-COAD TCGA-COAD
COAD Metastatic TCGA-COAD TCGA-COAD
DLBC Primary Tumor TCGA-DLBC TCGA-DLBC
ESCA Primary Tumor TCGA-ESCA TCGA-ESCA
ESCA Metastatic TCGA-ESCA TCGA-ESCA
GBM Primary Tumor TCGA-GBM TCGA-GBM
GBM Recurrent Tumor TCGA-GBM TCGA-GBM
HNSC Primary Tumor TCGA-HNSC TCGA-HNSC
HNSC Metastatic TCGA-HNSC TCGA-HNSC
KICH Primary Tumor TCGA-KICH TCGA-KICH
KIRC Primary Tumor TCGA-KIRC TCGA-KIRC
KIRC Additional - New Primary TCGA-KIRC TCGA-KIRC
KIRP Primary Tumor TCGA-KIRP TCGA-KIRP
KIRP Additional - New Primary TCGA-KIRP TCGA-KIRP
LAML Primary Blood Derived Cancer - Peripheral Blood TCGA-LAML TCGA-LAML
LGG Primary Tumor TCGA-LGG TCGA-LGG
LGG Recurrent Tumor TCGA-LGG TCGA-LGG
LIHC Primary Tumor TCGA-LIHC TCGA-LIHC
LIHC Recurrent Tumor TCGA-LIHC TCGA-LIHC
LUAD Primary Tumor TCGA-LUAD TCGA-LUAD
LUAD Recurrent Tumor TCGA-LUAD TCGA-LUAD
LUSC Primary Tumor TCGA-LUSC TCGA-LUSC
MESO Primary Tumor TCGA-MESO TCGA-MESO
OV Primary Tumor TCGA-OV TCGA-OV
OV Recurrent Tumor TCGA-OV TCGA-OV
PAAD Primary Tumor TCGA-PAAD TCGA-PAAD
PAAD Metastatic TCGA-PAAD TCGA-PAAD
PCPG Primary Tumor TCGA-PCPG TCGA-PCPG
PCPG Additional - New Primary TCGA-PCPG TCGA-PCPG
PCPG Metastatic TCGA-PCPG TCGA-PCPG
PRAD Primary Tumor TCGA-PRAD TCGA-PRAD
PRAD Metastatic TCGA-PRAD TCGA-PRAD
READ Primary Tumor TCGA-READ TCGA-READ
READ Recurrent Tumor TCGA-READ TCGA-READ
SARC Primary Tumor TCGA-SARC TCGA-SARC
SARC Recurrent Tumor TCGA-SARC TCGA-SARC
SARC Metastatic TCGA-SARC TCGA-SARC
SKCM Metastatic TCGA-SKCM TCGA-SKCM
SKCM Primary Tumor TCGA-SKCM TCGA-SKCM
SKCM Additional Metastatic TCGA-SKCM TCGA-SKCM
STAD Primary Tumor TCGA-STAD TCGA-STAD
TGCT Primary Tumor TCGA-TGCT TCGA-TGCT
TGCT Additional - New Primary TCGA-TGCT TCGA-TGCT
THCA Primary Tumor TCGA-THCA TCGA-THCA
THCA Metastatic TCGA-THCA TCGA-THCA
THYM Primary Tumor TCGA-THYM TCGA-THYM
UCEC Primary Tumor TCGA-UCEC TCGA-UCEC
UCEC Recurrent Tumor TCGA-UCEC TCGA-UCEC
UCS Primary Tumor TCGA-UCS TCGA-UCS
UVM Primary Tumor TCGA-UVM TCGA-UVM
BLCA Solid Tissue Normal TCGA-BLCA TCGA-NORMAL
BRCA Solid Tissue Normal TCGA-BRCA TCGA-NORMAL
CESC Solid Tissue Normal TCGA-CESC TCGA-NORMAL
CHOL Solid Tissue Normal TCGA-CHOL TCGA-NORMAL
COAD Solid Tissue Normal TCGA-COAD TCGA-NORMAL
ESCA Solid Tissue Normal TCGA-ESCA TCGA-NORMAL
HNSC Solid Tissue Normal TCGA-HNSC TCGA-NORMAL
KICH Solid Tissue Normal TCGA-KICH TCGA-NORMAL
KIRC Solid Tissue Normal TCGA-KIRC TCGA-NORMAL
KIRP Solid Tissue Normal TCGA-KIRP TCGA-NORMAL
LIHC Solid Tissue Normal TCGA-LIHC TCGA-NORMAL
LUAD Solid Tissue Normal TCGA-LUAD TCGA-NORMAL
LUSC Solid Tissue Normal TCGA-LUSC TCGA-NORMAL
PAAD Solid Tissue Normal TCGA-PAAD TCGA-NORMAL
PCPG Solid Tissue Normal TCGA-PCPG TCGA-NORMAL
PRAD Solid Tissue Normal TCGA-PRAD TCGA-NORMAL
READ Solid Tissue Normal TCGA-READ TCGA-NORMAL
SARC Solid Tissue Normal TCGA-SARC TCGA-NORMAL
SKCM Solid Tissue Normal TCGA-SKCM TCGA-NORMAL
STAD Solid Tissue Normal TCGA-STAD TCGA-NORMAL
THCA Solid Tissue Normal TCGA-THCA TCGA-NORMAL
THYM Solid Tissue Normal TCGA-THYM TCGA-NORMAL
UCEC Solid Tissue Normal TCGA-UCEC TCGA-NORMAL

Clinical data preprocessing

TCGA

library(data.table)
library(dplyr)

# load("/home/data2/Projects/Fusion/unclean/TCGA_ALL_clinical.rdata")
#
# TCGA_ALL_clinical |>
#   rename(Patient_ID = bcr_patient_barcode,
#          Age = age_at_initial_pathologic_diagnosis,
#          Sex = gender,
#          Race = race,
#          )

# [Discrepancy] [Not Applicable]  [Not Available]
# [Discrepancy] [Not Available]       [Unknown]
# # [Discrepancy]                                        [Not Applicable]
# 29                                                          105
# [Not Available]                                         [Not Evaluated]
# 5140                                                       4
# [Unknown]


library(UCSCXenaShiny)

table(tcga_clinical_fine$Stage_ajcc)
tcga_clinical

tcga_clinical_fine
tcga_genome_instability
tcga_purity
tcga_subtypes
tcga_surv

nrow(tcga_clinical_fine)
nrow(unique(tcga_clinical_fine))

tcga_clinical = unique(tcga_clinical_fine) |>
  left_join(tcga_surv, by = c("Sample"="sample")) |>
  left_join(tcga_purity, by = c("Sample"="sample"))

tcga_clinical[551, ]$Sample

sum(startsWith(tcga_genome_instability$sample, "TCGA"))

tcga_clinical = tcga_clinical |>
  left_join(tcga_genome_instability[startsWith(tcga_genome_instability$sample, "TCGA") & !duplicated(tcga_genome_instability$sample), ],  by = c("Sample"="sample")) |>
  #left_join(tcga_genome_instability[duplicated(tcga_genome_instability$sample), ],  by = c("Sample"="sample")) |>
  left_join(tcga_subtypes,  by = c("Sample"="sampleID"))


sum(is.na(tcga_clinical$Sample))
sum(duplicated(tcga_clinical$Sample))

tcga_clinical[duplicated(tcga_clinical$Sample), ]$Sample

tcga_clinical.final = tcga_clinical[!duplicated(tcga_clinical$Sample), ] |>
  rename(Sample_ID = Sample, Sex = Gender, OS_Time = OS.time, OS_Status = OS, DSS_Time = DSS.time, DSS_Status = DSS, DFS_Time = DFI.time, DFS_Status = DFI, PFS_Time = PFI.time, PFS_Status = PFI) |>
  select(-cancer_type) |>
  mutate(Sex = ifelse(Sex == "FEMALE", "F", "M"),
         Patient_ID = substr(Sample_ID, 1, 12))

sum(is.na(tcga_clinical.final$Sample_ID))
sum(is.na(tcga_clinical.final$Cancer))
sum(duplicated(tcga_clinical.final$Sample_ID))

tcga_clinical.final = tcga_clinical.final |>
  select(Patient_ID, Sample_ID, everything()) |>
  as.data.table()

tcga_clinical.final

fwrite(tcga_clinical.final, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")

CPTAC

library(data.table)


# Clinical data from CPTAC paper ------------------------------------------


cli_list = fs::dir_ls("CPTAC_Clinical_meta_data_v1/")

cli = purrr::map(cli_list, fread) |> rbindlist(fill = TRUE, use.names = TRUE)
colnames(cli)
# Problem: Need to skip the second row
# This also includes samples from CPTAC-2/3, but the total number is fewer

cli2 = purrr::map(cli_list, function(x) {
  # Since the column names are also different, read them twice—once for the column names and once for the data
  data1 = fread(x)
  data2 = fread(x, skip = 2, header = FALSE)
  # Update the column names
  colnames(data2) = colnames(data1)
  data2$Cancer = stringr::str_remove(basename(x), "_meta.txt")
  data2 |> as.data.table()
}) |> rbindlist(fill = TRUE, use.names = TRUE)
#colnames(cli2) = colnames(cli)
colnames(cli2)

cli2[Tumor == "Yes" & Normal == "Yes"] # This is to mark whether the patient has both tumor and normal samples


cli_list2 = fs::dir_ls("meta_data_BCM/")

cli3 = purrr::map(cli_list2, function(x) {
  # Since the column names are also different, read them twice—once for the column names and once for the data
  data1 = fread(x)
  data2 = fread(x, skip = 2, header = FALSE)
  # 更新列名
  colnames(data2) = colnames(data1)
  data2$Cancer = stringr::str_remove(basename(x), "_meta.txt")
  data2 |> as.data.table()
}) |> rbindlist(fill = TRUE, use.names = TRUE)
colnames(cli3)

all.equal(cli2, cli3)
all.equal(sort(colnames(cli2)), sort(colnames(cli3)))

cli4 = cli3[, colnames(cli2), with = FALSE]
all.equal(cli2, cli4)
identical(cli2, cli4)

dplyr::setdiff(cli2, cli4) |> View()
dplyr::setdiff(cli4, cli2) |> nrow()

# rbind(cli2[idx == "01OV007"],
#       cli4[idx == "01OV007"]) |> View()

# The files in the directories of the two data sources are basically identical

cli.final = cli2 |>
  dplyr::rename(
    Patient_ID = idx,
    OS_Status = OS_event,
    OS_Time = OS_days,
    PFS_Status = PFS_event,
    PFS_Time = PFS_days
  ) |>
  dplyr::mutate(
    Sample_ID = Patient_ID,
    Sex = ifelse(Sex == "Female", "F", "M")
  ) |>
  dplyr::select(
    Patient_ID, Sample_ID, Cancer, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status, dplyr::everything()
  )
  # dplyr::select(
  #   - dplyr::starts_with("CIBERSORT"), - dplyr::starts_with("xCell")
  # )


fwrite(cli.final, file = "CPTAC_cli_paper_combined.tsv", sep = "\t")

# Clinical data from GDC --------------------------------------------------


system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/CPTAC
mkdir CPTAC-3 CPTAC-2
tar zxvf clinical.project-cptac-2.2024-08-07.tar.gz -C CPTAC-2
tar zxvf clinical.project-cptac-3.2024-08-07.tar.gz -C CPTAC-3
       ")

cptac2 = fread("gdc_cli/CPTAC/CPTAC-2/clinical.tsv")
cptac3 = fread("gdc_cli/CPTAC/CPTAC-3/clinical.tsv")

which(colnames(cptac2) %in% "residual_disease")
which(colnames(cptac3) %in% "residual_disease")
all.equal(cptac2[[119]], cptac2[[192]])
all.equal(cptac3[[119]], cptac3[[192]])
cptac2[[192]] = NULL
cptac3[[192]] = NULL

setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# > setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)
# [1] "01OV049"   "26OV010"   "100004028" "05BR058"   "05BR055"   "17OV019"   "11BR069"   "02OV042"   "100004012" "05BR051"   "03BR012"   "02OV045"   "04OV041"
# [14] "05BR052"   "05BR031"   "1488"      "11OV009"   "02OV040"   "01OV002"   "100002921" "22OV001"   "13OV004"   "100003304" "02OV035"   "01OV046"   "01OV045"
# [27] "17OV034"
# > setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)
# [1] "C3N-02672"                    "C3L-01929"                    "C3N-02682"                    "C3N-02012"                    "C3N-02070"
# [6] "C3N-02696"                    "C3L-02354"                    "C3N-03026"                    "C3L-01739"                    "C3N-02088"
# [11] "C3L-04213"                    "C3N-03205"                    "C3L-04081"                    "C3N-03791"                    "C3L-03679"
# [16] "C3L-02643"                    "C3N-04686"                    "C3L-01558"                    "C3L-03268"                    "C3L-01633"
# [21] "C3L-02220"                    "C3N-02439"                    "C3N-02028"                    "C3N-03800"                    "C3L-04392"
# [26] "C3L-02556"                    "C3N-02721"                    "C3L-01672"                    "C3L-00938"                    "C3L-02201"
# [31] "C3N-02146"                    "C3N-02639"                    "C3N-03788"                    "C3L-03733"                    "C3L-04037"
# [36] "C3N-01091"                    "C3L-02747"                    "C3N-01879"                    "C3N-02996"                    "C3N-02296"
# [41] "C3L-02553"                    "C3L-03632"                    "C3N-03755"                    "GTEX-NPJ7-0011-R10A-SM-HAKXW" "C3L-02746"
length(setdiff(cptac3$case_submitter_id, cli.final$Sample_ID))  # 586


View(cptac2[case_submitter_id %in% setdiff(cptac2$case_submitter_id, cli.final$Sample_ID)])
View(cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)])

data.table::fwrite(
  cptac3[case_submitter_id %in% setdiff(cptac3$case_submitter_id, cli.final$Sample_ID)],
  file = "CPTAC3_non_included.tsv", sep = "\t"
)

cptac3


# Supplement the missing sample information in the paper based on the existing information in the GDC. ---------------------------------------------

# might be columns that have labeled cancer types.
# primary_diagnosis site_of_resection_or_biopsy tissue_or_organ_of_origin

cptac = rbind(cptac2, cptac3)[, .(case_submitter_id, primary_diagnosis, site_of_resection_or_biopsy, tissue_or_organ_of_origin)] |> unique()
cptac = dplyr::left_join(cptac, cli.final[, .(Sample_ID, Cancer)], by = c("case_submitter_id"="Sample_ID"))

setdiff(cli.final$Sample_ID, cptac$case_submitter_id)
# 29 cases are not in the GDC data.

table(cptac$Cancer, cptac$primary_diagnosis) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$site_of_resection_or_biopsy) |> pheatmap::pheatmap()
table(cptac$Cancer, cptac$tissue_or_organ_of_origin) |> pheatmap::pheatmap()

table(cptac$Cancer,
      paste(
        cptac$primary_diagnosis,
        cptac$site_of_resection_or_biopsy,
        cptac$tissue_or_organ_of_origin, sep = "_"
      )) |> pheatmap::pheatmap(show_colnames = FALSE)

# Solve them one by one.
cat(sub("_meta.txt", "", basename(names(cli_list))))

typelist = sub("_meta.txt", "", basename(names(cli_list)))
# BRCA CCRCC COAD GBM HGSC HNSCC LSCC LUAD PDAC UCEC
for (i in typelist) {
  message("check type ", i)
  cptac_part = cptac |> dplyr::filter(Cancer == i)
  message("  primary_diagnosis")
  table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
  message("  site_of_resection_or_biopsy")
  table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
  message("  tissue_or_organ_of_origin")
  table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}
# BRCA (122): tissue_or_organ_of_origin "Breast, NOS"
# CCRCC (103): primary_diagnosis "Renal cell carcinoma, NOS"
#        tissue_or_organ_of_origin "Kidney, NOS"
# COAD (106): tissue_or_organ_of_origin "Colon, NOS", "Rectum, NOS"
# GBM (99): primary_diagnosis "Glioblastoma"
# HGSC (87): primary_diagnosis "Serous adenocarcinoma, NOS" AND
#  tissue_or_organ_of_origin Fallopian tube Ovary Peritoneum, NOS
# HNSCC (108): Other situation: primary_diagnosis "Squamous cell carcinoma, NOS"
# Base of tongue, NOS Cheek mucosa Floor of mouth, NOS Gum, NOS
# HNSCC                   1            2                  18        3
#
# Head, face or neck, NOS Larynx, NOS Lip, NOS Oropharynx, NOS
# HNSCC                       2          47        4               4
#
# Overlapping lesion of lip, oral cavity and pharynx Tongue, NOS Tonsil, NOS
# HNSCC
# LSCC (108): primary_diagnosis "Squamous cell carcinoma, NOS" AND
#   tissue_or_organ_of_origin "lung" pattern
# LUAD (110): primary_diagnosis "Adenocarcinoma, NOS" AND
#   tissue_or_organ_of_origin "lung" pattern
# PDAC (105): tissue_or_organ_of_origin "pancreas" pattern
# UCEC (95): primary_diagnosis "Endometrioid adenocarcinoma, NOS"
sum(!is.na(cptac$Cancer))

# rbind(cptac2, cptac3)[,192] # There are duplicate columns, need to go back and handle this earlier (fixed).

# Actual testing and updating
cptac_updated = rbind(cptac2, cptac3) |>
  unique() |>
  dplyr::mutate(
    Cancer = dplyr::case_when(
      tissue_or_organ_of_origin %in% "Breast, NOS" ~ "BRCA",
      primary_diagnosis %in% "Renal cell carcinoma, NOS" ~ "CCRCC",
      tissue_or_organ_of_origin %in% c("Colon, NOS", "Rectum, NOS") ~ "COAD",
      grepl("Glioblastoma", primary_diagnosis, ignore.case = TRUE) | grepl("Oligodendroglioma", primary_diagnosis, ignore.case = TRUE) ~ "GBM",
      grepl("GTEX", case_submitter_id, ignore.case = TRUE) ~ "GTEX-Brain",
      case_submitter_id %in% c("C3L-06912", "C3L-07212") ~ "GBM",
      #primary_diagnosis %in% "Glioblastoma" ~ "GBM",
      primary_diagnosis %in% "Serous adenocarcinoma, NOS" & tissue_or_organ_of_origin %in% c("Fallopian tube", "Ovary", "Peritoneum, NOS") ~ "HGSC",
      primary_diagnosis %in% "Squamous cell carcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LSCC",
      primary_diagnosis %in% "Adenocarcinoma, NOS" & grepl("lung", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "LUAD",
      grepl("pancreas", tissue_or_organ_of_origin, ignore.case = TRUE) ~ "PDAC",
      primary_diagnosis %in% "Endometrioid adenocarcinoma, NOS" ~ "UCEC",
      TRUE ~ "HNSCC"
    )
  ) |> as.data.table()

for (i in "HNSCC") { # typelist
  message("check type ", i)
  cptac_part = cptac_updated |> dplyr::filter(Cancer == i)
  message("  primary_diagnosis")
  table(cptac_part$Cancer, cptac_part$primary_diagnosis) |> print()
  message("  site_of_resection_or_biopsy")
  table(cptac_part$Cancer, cptac_part$site_of_resection_or_biopsy) |> print()
  message("  tissue_or_organ_of_origin")
  table(cptac_part$Cancer, cptac_part$tissue_or_organ_of_origin) |> print()
}

table(cptac$Cancer)
table(cptac_updated$Cancer)

cptac_updated |>
  dplyr::filter(
    primary_diagnosis == "'--" | site_of_resection_or_biopsy == "'--" |
    tissue_or_organ_of_origin %in% c("'--", "Unknown")
  )
# 1-9 from GTEX-Brain
# C3L-06912 C3L-07212 Gliomas # https://portal.gdc.cancer.gov/cases/e6027623-a5b2-4d82-a9ca-f37c948dd3ed
cptac_updated$case_submitter_id[grepl("GTEX", cptac_updated$case_submitter_id)]

nrow(cptac_updated)

fwrite(cptac_updated, file = "CPTAC_combined_with_cancer_annotated.tsv", sep = "\t")

cptac_unmatched = cptac_updated[! case_submitter_id %in% cli.final$Sample_ID]
colnames(cptac_unmatched)
colnames(cli.final)[1:20]

cptac_unmatched2 = cptac_unmatched |>
  dplyr::select(
    case_submitter_id, age_at_diagnosis, gender,
    days_to_death, vital_status,
    dplyr::starts_with("ajcc"),
    dplyr::starts_with("days"),
    last_known_disease_status, progression_or_recurrence, residual_disease,
    tumor_grade, Cancer
  )

sum(cptac_unmatched2$days_to_last_follow_up != "'--")
sum(cptac_unmatched2$days_to_death != "'--")
sum(cptac_unmatched2$days_to_recurrence != "'--")
sum(cptac_unmatched2$days_to_last_known_disease_status != "'--")

sum(cptac_unmatched2$progression_or_recurrence != "'--")


sum(!is.na(cli.final$OS_Time))
sum(!is.na(cli.final$PFS_Time))
sum(cli.final$OS_Time >= cli.final$PFS_Time, na.rm = TRUE)
cli.final[cli.final$OS_Time < cli.final$PFS_Time] |> View()

sum(as.integer(cptac_unmatched2$days_to_last_follow_up) >= as.integer(cptac_unmatched2$days_to_last_known_disease_status), na.rm = TRUE)

cptac_unmatched3 = cptac_unmatched2 |>
  dplyr::rename(
    Sample_ID = case_submitter_id,
    Age = age_at_diagnosis,
    Sex = gender,
    OS_Time = days_to_last_follow_up,
    OS_Status = vital_status,
    PFS_Time = days_to_last_known_disease_status,
    PFS_Status = progression_or_recurrence,
    Histologic_Grade = tumor_grade,
    Path_Stage_pT = ajcc_pathologic_t,
    Path_Stage_pN = ajcc_pathologic_n,
    Stage = ajcc_pathologic_stage
  ) |>
  dplyr::select(
    Sample_ID, Age, Sex, OS_Time, OS_Status, PFS_Time, PFS_Status,
    Path_Stage_pN:Path_Stage_pT,
    Histologic_Grade, Cancer
  ) |>
  dplyr::mutate(
    Patient_ID = Sample_ID,
    Age = as.integer(round(as.integer(Age) / 365)),
    Sex = fcase(Sex %in% "female", "F",
                Sex %in% "male", "M", default = NA_character_),
    OS_Time = as.integer(OS_Time),
    OS_Status = fcase(OS_Status %in% "Alive", 0L,
                      OS_Status %in% "Dead", 1L, default = NA_integer_),
    PFS_Time = as.integer(PFS_Time),
    PFS_Status = fcase(PFS_Status %in% "no", 0L,
                       PFS_Status %in% "yes", 1L, default = NA_integer_),
    OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time),
    Path_Stage_pN = fcase(
      Path_Stage_pN %in% "N0", "pN0",
      Path_Stage_pN %in% c("N1", "N1a"), "pN1",
      Path_Stage_pN %in% c("N2", "N2a"), "pN2",
      default = NA_character_
    ),
    Path_Stage_pT = fcase(
      Path_Stage_pT %in% "T0", "pT0",
      Path_Stage_pT %in% c("T1", "T1a", "T1b", "T1c"), "pT1",
      Path_Stage_pT %in% c("T2", "T2a", "T2b"), "pT2",
      Path_Stage_pT %in% c("T3", "T3a", "T3b"), "pT3",
      Path_Stage_pT %in% "T4", "pT4",
      default = NA_character_
    ),
    Stage = fcase(
      Stage %in% c("Stage I", "Stage IA", "Stage IA1", "Stage IA2", "Stage IA3", "Stage IB"), "Stage I",
      Stage %in% c("Stage II", "Stage IIA", "Stage IIB"), "Stage II",
      Stage %in% c("Stage III", "Stage IIIA", "Stage IIIB", "Stage IIIC", "Stage IIIC1", "Stage IIIC2"), "Stage III",
      Stage %in% c("Stage IV", "Stage IVB"), "Stage IV",
      default = NA_character_
    ),
    Histologic_Grade = fcase(
      Histologic_Grade %in% "G1", "G1 Well differentiated",
      Histologic_Grade %in% "G2", "G2 Moderately differentiated",
      Histologic_Grade %in% "G3", "G3 Poorly differentiated",
      Histologic_Grade %in% "G4", "G4 Undifferentiated",
      default = NA_character_
    )
  )

cptac_unmatched3

# X: cannot be assessed

table(cptac_unmatched3$Path_Stage_pN)
table(cli.final$Path_Stage_pN)

table(cptac_unmatched3$Path_Stage_pT)
table(cli.final$Path_Stage_pT)

table(cptac_unmatched3$Stage)
table(cli.final$Stage)

table(cptac_unmatched3$Histologic_Grade)
table(cli.final$Histologic_Grade)



# Final output ------------------------------------------------------------

cli.final.2 = dplyr::bind_rows(
  cli.final, cptac_unmatched3
) |> as.data.table()

table(cli.final.2$Cancer)
sum(table(cli.final.2$Cancer))

# Include all Race information?
cli.final.2 = dplyr::left_join(
  cli.final.2, unique(rbind(cptac2, cptac3)[, .(case_submitter_id, race)]),
  by = c("Sample_ID"="case_submitter_id")
) |>
  dplyr::mutate(
    race = ifelse(race %in% c("Unknown", "not reported"), NA_character_,
                  race)
  ) |>
  as.data.table()

sum(is.na(cli.final.2$race))

fwrite(cli.final.2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")

TARGET

library(data.table)

cli1 = fread("TARGET_donor_allprojects_transfer_to_sample.gz")
cli2 = fread("TARGET_phenotype.gz")

target_cli = cli2 |>
  dplyr::select(-`_cohort`) |>
  dplyr::rename(Patient_ID = `_PATIENT`,
                Sample_ID = sample_id,
                disease_code = primary_disease_code,
                disease = `_primary_disease`,
                sample_type = `_sample_type`) |>
  dplyr::full_join(
    cli1 |> dplyr::select(
      -`_OS_UNIT`, -`_EVENT`, -`_TIME_TO_EVENT`
    ) |>
      dplyr::rename(
        Sample_ID = xena_sample,
        Sex = `_gender`,
        Age = `_age_at_diagnosis`,
        OS_Status = `_OS_IND`,
        OS_Time = `_OS`
      ), by = "Sample_ID"
  ) |>
  dplyr::mutate(
    Sex = ifelse(Sex == "Male", "M", "F"),
    OS_Status = ifelse(OS_Status == "Alive", 0L, 1L)
  ) |>
  dplyr::select(Patient_ID, Sample_ID, Age, Sex, OS_Time, OS_Status, dplyr::everything()) |>
  as.data.table()

cli1

target_cli

fwrite(target_cli, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t")


# Using cli from GDC ------------------------------------------------------

system("
cd /home/data2/Projects/Fusion/unclean/gdc_cli/TARGET
mkdir ALL_P1 ALL_P2 ALL_P3 AML CCSK NBL OS RT WT
tar zxvf clinical.project-target-all-p1.2024-08-07.tar.gz -C ALL_P1
tar zxvf clinical.project-target-all-p2.2024-08-07.tar.gz -C ALL_P2
tar zxvf clinical.project-target-all-p3.2024-08-07.tar.gz -C ALL_P3
tar zxvf clinical.project-target-aml.2024-08-07.tar.gz -C AML
tar zxvf clinical.project-target-ccsk.2024-08-07.tar.gz -C CCSK
tar zxvf clinical.project-target-nbl.2024-08-07.tar.gz -C NBL
tar zxvf clinical.project-target-os.2024-08-07.tar.gz -C OS
tar zxvf clinical.project-target-rt.2024-08-07.tar.gz -C RT
tar zxvf clinical.project-target-wt.2024-08-07.tar.gz -C WT
       ")

gdc_target = rbindlist(
  list(
    fread("gdc_cli/TARGET/ALL_P1/clinical.tsv"),
    fread("gdc_cli/TARGET/ALL_P2/clinical.tsv"),
    fread("gdc_cli/TARGET/ALL_P3/clinical.tsv"),
    fread("gdc_cli/TARGET/AML/clinical.tsv"),
    fread("gdc_cli/TARGET/CCSK/clinical.tsv"),
    fread("gdc_cli/TARGET/NBL/clinical.tsv"),
    fread("gdc_cli/TARGET/OS/clinical.tsv"),
    fread("gdc_cli/TARGET/RT/clinical.tsv"),
    fread("gdc_cli/TARGET/WT/clinical.tsv")
  ),
  use.names = TRUE, fill = TRUE
)

table(gdc_target$project_id)
any(duplicated(gdc_target))
gdc_target = gdc_target[!duplicated(gdc_target)]

all.equal(gdc_target[[119]], gdc_target[[192]])
gdc_target[[192]] = NULL

sum(gdc_target$days_to_last_follow_up != "'--")
sum(gdc_target$days_to_death != "'--")

gdc_target2 = gdc_target |>
  dplyr::rename(
    Sample_ID = case_submitter_id,
    Age = age_at_diagnosis,
    Sex = gender,
    OS_Time = days_to_death,
    OS_Status = vital_status
  ) |>
  dplyr::select(
    Sample_ID, Age, Sex, OS_Time, OS_Status,
    project_id, race, primary_diagnosis
  ) |>
  dplyr::mutate(
    Patient_ID = Sample_ID,
    Age = as.integer(round(as.integer(Age) / 365)),
    Sex = fcase(Sex %in% "female", "F",
                Sex %in% "male", "M", default = NA_character_),
    OS_Time = as.integer(OS_Time),
    OS_Status = fcase(OS_Status %in% "Alive", 0L,
                      OS_Status %in% "Dead", 1L, default = NA_integer_),
    OS_Time = ifelse(OS_Time < 0, NA_integer_, OS_Time)
  ) |>
  as.data.table()

#unique()

# exist  a large number of case level  replicate
gdc_target2[duplicated(gdc_target2), ]
duplicated(gdc_target[case_submitter_id == "TARGET-20-PANTIV"])
gdc_target[case_submitter_id == "TARGET-20-PANTIV"] |> View() # A few records that have been updated?

gdc_target2 = gdc_target2[!duplicated(gdc_target2)]

#table(gdc_target2$PFS_Time)
any(duplicated(gdc_target2))
any(duplicated(gdc_target2$Sample_ID))

gdc_target2[duplicated(gdc_target2$Sample_ID)]
# They may exist in different ALL projects.
gdc_dup = gdc_target2[Sample_ID %in% Sample_ID[duplicated(Sample_ID)]]
table(gdc_dup$project_id)
# TARGET-ALL-P1 TARGET-ALL-P2 TARGET-ALL-P3
# 10            57            47
#
# For now, we won't address this issue. If we conduct a combined analysis of ALL, we will consider removing duplicates.

colnames(gdc_target2)
gdc_target2 = gdc_target2[, c("Patient_ID", "Sample_ID", "Age", "Sex", "race", "primary_diagnosis", "OS_Time", "OS_Status", "project_id")]

fwrite(gdc_target2, file = "/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t")

sum(!is.na(gdc_target2$OS_Time))
sum(!is.na(cli1$`_OS`))

nrow(na.omit(unique(target_cli[, .(Patient_ID, OS_Time, OS_Status)])))
nrow(na.omit(unique(gdc_target2[, .(Patient_ID, OS_Time, OS_Status)])))

TCGA/TARGET/CPTAC Clinical/IOBR/TME/GENE data divide and anotate preprocessing

TARGET

library(data.table)
library(dplyr)
library(stringr)
library(tidyr)

file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.tsv"
metafusion <- fread(file_meta, sep = "\t")

# TARGET
## info
### Because some projects are not included in the target_sheet, we should abandon them.
TARGET_sheet <- fread("/home/data2/Projects/Fusion/unclean/TARGET_gdc_sample_sheet.2023-11-15.tsv", sep = "\t")
project_choose <- substr(TARGET_sheet$`Sample ID`, 1, 9) |> unique()
project_choose
### "TARGET-20" "TARGET-00" "TARGET-21" "TARGET-30"

### TARGET_sheet pretreatment

### Composite Sample 
###
### Composite Sample use comma to separate like TARGET-30-PAPEAV, TARGET-30-PAPTFZ
TARGET_Composite <- TARGET_sheet |> 
  filter(grepl(", ", `Sample ID`)) |>  
  select(`Sample ID`) |>
  unique()
TARGET_Composite
### 1: TARGET-30-PAPEAV-01A, TARGET-30-PAPTFZ-01A
### 2: TARGET-30-PANUKV-01A, TARGET-30-PASUML-01A
### 3: TARGET-30-PANKFE-01A, TARGET-30-PAPUAR-01A
### 4: TARGET-30-PASYPX-01A, TARGET-30-PAIXIF-01A

### For the CPTAC databases, check whether there is any overlap between composite and normal Sample_ID in the info data. 
check_TARGET_overlap <- TARGET_sheet |>  
  select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
  unique() |> 
  separate_rows(`Sample ID`, sep = ",\\s*") 

check_TARGET_overlap <- check_TARGET_overlap |>
  mutate(Sample_Type = ifelse(grepl(", ", `Sample ID`), "Composite", "Normal"))

overlap_samples <- check_TARGET_overlap |>
  group_by(`Sample ID`) |>
  summarise(Count = n(), .groups = 'drop') |>
  filter(Count > 1)

overlap_samples
###  A tibble: 4 × 2
###  `Sample ID`          Count
###  <chr>                <int>
###  1 TARGET-30-PANKFE-01A     2
###  2 TARGET-30-PAPTFZ-01A     2
###  3 TARGET-30-PASUML-01A     2
###  4 TARGET-30-PASYPX-01A     2

### So we choose abandon them.

TARGET_pretreatment <- TARGET_sheet |> 
  select(c(`Case ID`, `Sample ID`, `Sample Type`)) |>
  rename("Patient_ID" = `Case ID`,
         "Sample_ID" = `Sample ID`,
         "Sample_type"  =  `Sample Type`) |> 
  filter(!grepl(",", Sample_ID)) |> 
  filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |> 
  unique()

### TARGET_xena_info pretreatment
### This data use
TARGET_xena_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_xena_info.tsv", sep = "\t") 
TARGET_xena_pretreatment   <- TARGET_xena_info |>
                              mutate(Patient_ID = str_extract(Sample_ID, "^[^-]+-[^-]+-[^-]+")) |>
                              filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |>
                              mutate(Age = round(Age, digits = 0),
                                     across(c(Age, OS_Time, OS_Status), ~ as.integer(.))) |>
                              rename(Cancer = disease_code)

### TARGET_info pretreatment
### In the TARGET database, rows with the info column marked as '--' are annotated as NA.
filter_lose <- function(x) {
  is.na(x) | x == "" | x == "--" | x == "Unknown" | x == "'--"
 }
TARGET_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TARGET_info.tsv", sep = "\t") 
replacement_table2 <- c(
  "Acute lymphocytic leukemia" =  "TARGET-ALL",             
  "'--"   = "TARGET-Unkown",                                  
  "Precursor B-cell lymphoblastic leukemia" = "TARGET-B-ALL", 
  "Not Reported"  = "TARGET-Unkown",                         
  "Acute myeloid leukemia, NOS" = "TARGET-LAML",             
  "Clear cell sarcoma of kidney" = "TARGET-CCSK",           
  "Neuroblastoma, NOS"  = "TARGET-NBL",                     
  "Ganglioneuroblastoma" = "TARGET-GNB",                   
  "Osteosarcoma, NOS"   = "TARGET-OS",                    
  "Malignant rhabdoid tumor" = "TARGET-MRT",               
  "Wilms tumor" = "TARGET-WT"  
)
matched2 <- TARGET_info$primary_diagnosis %in% names(replacement_table2)
TARGET_info_pretreatment <- TARGET_info |>
                            mutate(cohort = case_when(
                                   matched2 ~ replacement_table2[primary_diagnosis],
                                   TRUE ~ as.character(NA)  
                                   )) |>
                            rename(Cancer = primary_diagnosis,
                                   Project_id = project_id) |> 
                            filter(grepl("^TARGET-(20|00|21|30)", Sample_ID)) |> 
                            mutate(across(c("Age", "Sex", "OS_Time", "OS_Status", "race"), ~ ifelse(filter_lose(.x), NA, .x))) |> 
                            mutate(Cancer = ifelse(Cancer == "'--", NA, Cancer))

### Check if the valid rows in TARGET_info match the valid rows in TARGET_xena_info.

### 1.Filter the valid rows.
filter_valid <- function(x) {
  !is.na(x) & x != "" & x != "--" & x != "Unknown" & x == "'--"
}

### Filter the valid rows in TARGET_info.
TARGET_info_valid <- TARGET_info_pretreatment %>%
                     filter(filter_valid(Age) & filter_valid(OS_Time) & filter_valid(OS_Status) & filter_valid(Sex)) |>
                     select(c(Patient_ID, Age, OS_Time, OS_Status, Sex))

### Filter the valid rows in TARGET_xena_info.
TARGET_xena_info_valid <- TARGET_xena_pretreatment %>%
                          filter(filter_valid(Age) & filter_valid(OS_Time) & filter_valid(OS_Status) & filter_valid(Sex)) |>
                          select(c(Patient_ID, Age, OS_Time, OS_Status, Sex))

### 2. Merge the two data frames based on Patient_ID.
merged_data <- TARGET_info_valid %>%
               inner_join(TARGET_xena_info_valid, by = "Patient_ID", suffix = c("_info", "_xena"))

### 3. Check whether the valid values in the merged data frame are consistent.
is_consistent <- merged_data %>%
  rowwise() %>%
  mutate(
    Age_consistent = Age_info == Age_xena,
    Sex_consistent = Sex_info == Sex_xena,
    OS_Time_consistent = OS_Time_info == OS_Time_xena,
    OS_Status_consistent = OS_Status_info == OS_Status_xena
  ) %>%
  ungroup()

### 4. Age not onsistent
Age_not_consistent <- is_consistent |> 
                      filter(Age_consistent == "FALSE") 

### 5. Sex not onsistent
Sex_not_consistent <- is_consistent |> 
                      filter(Sex_consistent == "FALSE") 
### Sex_not_consistent

###  A tibble: 0 × 13

### 6. OS_Time not onsistent
OS_Time_not_consistent <- is_consistent |> 
                          filter(OS_Time_consistent == "FALSE") |>
                          mutate(
                          OS_Time_info_greater = OS_Time_info > OS_Time_xena,  
                          OS_Time_xena_greater = OS_Time_xena > OS_Time_info,  
                          OS_Time_equal = OS_Time_info == OS_Time_xena         
                          )
nrow(OS_Time_not_consistent |> filter(OS_Time_info_greater == "FALSE"))
### [1]0
nrow(OS_Time_not_consistent |> filter(OS_Time_xena_greater == "TRUE"))
### [1]0
nrow(OS_Time_not_consistent |> filter(OS_Time_equal == "TRUE"))
### [1]0

### 7. OS_Status not onsistent
OS_Status_not_consistent <- is_consistent |> 
                            filter(OS_Status_consistent == "FALSE") 

unique(OS_Status_not_consistent$OS_Status_info)
### [1] 1
unique(OS_Status_not_consistent$OS_Status_xena)
### [1] 0

### All evidence shows that the GDC data has been updated more recently than the Xena data.
### Therefore, if there is a conflict between GDC data and Xena data, we should use the GDC data.

### Use the valid values from TARGET_xena_info to fill the missing values in TARGET_info.
### 1: Determine the matching Patient_ID.
matched_ids <- intersect(TARGET_xena_pretreatment$Patient_ID, TARGET_info_pretreatment$Patient_ID)
setDT(TARGET_xena_pretreatment)
setDT(TARGET_info_pretreatment)

### 2: Find the columns that require updates.
cols_to_update <- setdiff(names(TARGET_info_pretreatment), c("Sample_ID", "Patient_ID", "Cancer","project_id","cohort","race"))

### 3.Iterate through each matching Patient_ID and update TARGET_info_pretreatment using TARGET_xena_pretreatment.
### update Age
TARGET_xena_can  <-  TARGET_xena_pretreatment |> 
                     filter(if_any("Age", ~ !filter_lose(.x))) 

TARGET_info_match <- TARGET_info_pretreatment |>
                            mutate(Age = case_when(
                                                   filter_lose(Age) & Patient_ID %in% TARGET_xena_can$Patient_ID ~  TARGET_xena_can$Age[match(Patient_ID, TARGET_xena_can$Patient_ID)],
                                                   TRUE ~ Age))

### update Sex
TARGET_xena_can  <-  TARGET_xena_pretreatment |> 
                     filter(if_any("Sex", ~ !filter_lose(.x))) 

TARGET_info_match1 <- TARGET_info_match |>
                            mutate(Sex = case_when(
                                                   filter_lose(Sex) & Patient_ID %in% TARGET_xena_can$Patient_ID ~  TARGET_xena_can$Sex[match(Patient_ID, TARGET_xena_can$Patient_ID)],
                                                   TRUE ~ Sex))

### update OS_Time
TARGET_xena_can  <-  TARGET_xena_pretreatment |> 
                     filter(if_any("OS_Time", ~ !filter_lose(.x))) 

TARGET_info_match2 <- TARGET_info_match1 |>
                      mutate(OS_Time = case_when(
                                                   filter_lose(OS_Time) & Patient_ID %in% TARGET_xena_can$Patient_ID  & OS_Status %in% TARGET_xena_can$OS_Status ~  TARGET_xena_can$OS_Time[match(Patient_ID, TARGET_xena_can$Patient_ID)],
                                                   TRUE ~ OS_Time))

### update OS_Status
TARGET_xena_can  <-  TARGET_xena_pretreatment |> 
                     filter(if_any("OS_Status", ~ !filter_lose(.x))) 

TARGET_info_after_match <- TARGET_info_match2 |>
                            mutate(OS_Status = case_when(
                                                   filter_lose(OS_Status) & Patient_ID %in% TARGET_xena_can$Patient_ID  & OS_Time %in% TARGET_xena_can$OS_Time ~  TARGET_xena_can$OS_Status[match(Patient_ID, TARGET_xena_can$Patient_ID)],
                                                   TRUE ~ OS_Status)) 
  
### Use TARGET_sheet to perform a left join, ensuring that the Sample_ID is correct.
TARGET_info_after_match <- TARGET_info_after_match |> select(- Sample_ID)
TARGET_correct      <- TARGET_pretreatment |>
                       left_join(TARGET_info_after_match, by = c("Patient_ID" = "Patient_ID")) |>
                       mutate(Run = Sample_ID)

### Replace the cohort column values with Project_ID where the row is TARGET-Unkown.
### TARGET_correct$project_id |> unique()
### [1] "TARGET-AML" "TARGET-NBL"
TARGET_correct_after_pro <- TARGET_correct  %>%
  mutate(
    cohort = case_when(
             cohort == "TARGET-Unkown" ~ Project_id,
             TRUE ~ cohort),
    cohort = case_when(
             cohort == "TARGET-AML" ~ "TARGET-LAML",
             TRUE ~ cohort)
  ) 

### Divide a part of data into a group based on whether the Sample_type is cell

target_cell <- TARGET_correct_after_pro |> filter(`Sample_type` == "Cell Lines")
target_cell_sample <- target_cell$Sample_ID |> unique()
target_cell_patient <- target_cell$Patient_ID |> unique()
### target_cell_sample
### [1] "TARGET-20-MUTZ3-50A"         "TARGET-20-ML1-50A"           "TARGET-20-MV411D1-50A"       "TARGET-20-KasumiD1-50A"    "TARGET-20-HEK293-50A"      "TARGET-20-MOLM14CBFGLIS-50A"     
### [7] "TARGET-20-TF1-50A"           "TARGET-20-IGROV1-50A"        "TARGET-20-CMS-50A"           "TARGET-20-WSUAML-50A"      "TARGET-20-KasumiAZAD5-50A" "TARGET-20-KasumiAZAD11-50A"
### [13]"TARGET-20-REH-50A"           "TARGET-20-MV411AZAD11-50A"   "TARGET-20-MV411AZAD5-50A"    "TARGET-20-THP1-50A"        "TARGET-20-OCIAML2-50A"     "TARGET-20-HL60-50A"  
### target_cell_patient
### [1] "TARGET-20-MUTZ3"         "TARGET-20-ML1"           "TARGET-20-MV411D1"       "TARGET-20-KasumiD1"      "TARGET-20-HEK293"        "TARGET-20-MOLM14CBFGLIS"
### [7] "TARGET-20-TF1"           "TARGET-20-IGROV1"        "TARGET-20-CMS"           "TARGET-20-WSUAML"        "TARGET-20-KasumiAZAD5"   "TARGET-20-KasumiAZAD11" 
### [13]"TARGET-20-REH"           "TARGET-20-MV411AZAD11"   "TARGET-20-MV411AZAD5"    "TARGET-20-THP1"          "TARGET-20-OCIAML2"       "TARGET-20-HL60"  
### reference : https://www.cellosaurus.org/
### MUTZ3:Acute myeloid leukemia
### ML1:    Adult acute myeloid leukemia
### MV411D1:biphenotypic B-myelomonocytic leukemia
### KasumiD1:Acute myeloblastic leukemia
### TF1:Acute erythroid leukemia 
### IGROV1: Ovarian endometrioid adenocarcinoma
### WSUAML:Acute myeloid leukemia。
### KasumiAZAD5:can't find, so KasumiD1:Acute myeloblastic leukemia
### KasumiAZAD11:can't find, so KasumiD1:Acute myeloblastic leukemia
### REH:Childhood B acute lymphoblastic leukemia
### MV411AZAD11:can't find, so MV411D1:biphenotypic B-myelomonocytic leukemia
### MV411AZAD5:can't find, so MV411D1:biphenotypic B-myelomonocytic leukemia
### THP1:Childhood acute monocytic leukemia
### OCIAML2:Adult acute myeloid leukemia
### HL60:Adult acute myeloid leukemia
### HEK293:Vaccine production cell line.
### CML:Childhood acute megakaryoblastic leukemia
### MOLM14CBFGLIS:can't find, so MOLM14:Adult acute myeloid leukemia
cell_data <- metafusion[metafusion$Sample_ID %in% target_cell_sample]
### > nrow(cell_data)
###  [1] 1853
cell_info <- TARGET_info_pretreatment[TARGET_info_pretreatment$Patient_ID %in% target_cell_patient]
###  nrow(cell_info)
###  [1] 18
### no os.etc
cell_info2 <- TARGET_xena_pretreatment[TARGET_xena_pretreatment$Patient_ID %in% target_cell_patient]
### nrow(cell_info2)
### [1] 0

TARGET_info_after_cell <- TARGET_correct_after_pro |>
  mutate(cohort = case_when(
    `Sample_type` %in% "Cell Lines" ~ "TARGET-CELL",
    TRUE ~ cohort  
  ))

### The criteria for distinguishing normal samples based on Sample_type column in the TARGET databases are as follows:
TARGET_info_after_cell$Sample_type |> unique()
### [1] "Primary Blood Derived Cancer - Bone Marrow"              "Recurrent Blood Derived Cancer - Bone Marrow"           
### [3] "Blood Derived Normal"                                    "Primary Blood Derived Cancer - Peripheral Blood"        
### [5] "Bone Marrow Normal"                                      "Cell Lines"                                             
### [7] "Next Generation Cancer Model"                            "Blood Derived Cancer - Bone Marrow, Post-treatment"     
### [9] "Primary Tumor"                                           "Recurrent Tumor"                                        
### [11] "Recurrent Blood Derived Cancer - Peripheral Blood"       "Blood Derived Cancer - Peripheral Blood, Post-treatment"
### 
### If the sample type is "Blood Derived Normal" or "Bone Marrow Normal", then it is a normal sample. Otherwise, it is a tumor sample.
TARGET_info_after_normal <- TARGET_info_after_cell |>
  mutate(cohort = case_when(
    `Sample_type` %in% c("Blood Derived Normal", "Bone Marrow Normal") ~ "TARGET-NORMAL",
    TRUE ~ cohort  
  ))

### Adjust the column order.
new_order <- c("Patient_ID", "Sample_ID", "Run", "Sample_type", "Age", "Sex", "race", "Cancer", 
               "OS_Time", "OS_Status", "Project_id", "cohort")

TARGET_info_after_processing <- TARGET_info_after_normal |>
                                select(all_of(new_order)) 

### make sure nrow is same between Before and After Processing. 
TARGET_pretreatment |> nrow()
### [1]  3222
TARGET_info_after_processing |> nrow() 
### [1] 3222

### make sure IOBR/TME data just contain sample which target_sheet gave. 
detected_id_TARGET <- TARGET_info_after_processing$Run

### save info
cohorts <- unique(TARGET_info_after_processing$cohort)

for (cohort1 in cohorts) {
  cohort_data <- subset(TARGET_info_after_processing, cohort == cohort1)
  name <- gsub("-", "_", cohort1)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/", name ,"_info.tsv")
  fwrite(cohort_data , file = file_name, sep = "\t")  
}

##Gene
metafusion1 <- metafusion |> 
  mutate(
    cohort = case_when(
      Sample_ID %in% TARGET_info_after_processing$Run & grepl("TARGET", cohort) ~ TARGET_info_after_processing$cohort[match(Sample_ID, TARGET_info_after_processing$Run)],
      TRUE ~ cohort
    )
  )  
  cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.target.tsv"
  fwrite(metafusion1, file = cp_file_name1, sep = "\t") 

##TME
NBL_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-NBL_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-NBL")
AML_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-AML_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-AML")
TARGET_TME <- rbind(NBL_TME,AML_TME) |>
              mutate(dash_count = str_count(ID, "-"))
### Manually check and delete composite samples from the info dataset.(TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R)
TARGET_TME |> 
filter(dash_count == 9) |> 
select(ID) |>
unique()
### 1: TARGET-30-PASYPX-01A-01R-TARGET-30-PAIXIF-01A-01R
### 2: TARGET-30-PANKFE-01A-01R-TARGET-30-PAPUAR-01A-01R
### 3: TARGET-30-PANUKV-01A-01R-TARGET-30-PASUML-01A-01R
### 4: TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R

TARGET_TME_pretreatment <- TARGET_TME |>
                           filter(dash_count != 9) |>
                           select(-dash_count)

### Filter TARGET Replicate Samples
### The filter rule following broad institute says:
###
### In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
### 
### Analyte Replicate Filter
### The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
### 
### Sort Replicate Filter
### The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
### 
### Ref Link: <https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon>
###           <https://gist.github.com/ShixiangWang/33b2f9b49b77eaa8f773b428480f9101>
TARGET_TME_after_replicate <- TARGET_TME_pretreatment |>
                              mutate(
                              prefixes = str_remove(ID, "-[^-]+$"),  
                              suffixes = str_extract(ID, "[^-]+$"), 
                              second_chars = str_extract(suffixes, "[0-9]+"),
                              second_chars = as.integer(second_chars)) |>
                              group_by(prefixes) |>
                              slice(which.max(second_chars)) |>
                              ungroup() |>
                              mutate(ID = prefixes) |> 
                              filter(ID %in% detected_id_TARGET)  |> ### Ensure that all IDs in the TME data can be found in the TARGET info data. 
                              select(-c(prefixes,suffixes,second_chars))

### Using TARGET info data cohort column to annotate TARGET TME data 
cohort_data   <- TARGET_info_after_processing |> 
                 select(Run, cohort)   |>
                 unique()   

TARGET_TME_after_cohort <- TARGET_TME_after_replicate |>
                           mutate(Run = ID)  |>
                           left_join(cohort_data, by = "Run")  |>
                           select(-Run)

### save TME
cohorts_TME <- unique(TARGET_TME_after_cohort$cohort)

for (cohort1 in cohorts_TME) {
  cohort_data <- subset(TARGET_TME_after_cohort, cohort == cohort1) |> select(-cohort)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  

## IOBR
NBL_IOBR_signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-NBL_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-NBL")
AML_IOBR_signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/TARGET-AML_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "TARGET-AML")
TARGET_IOBR <- rbind(NBL_IOBR_signature,AML_IOBR_signature) |>
               mutate(dash_count = str_count(ID, "-"))

### Manually check and delete composite samples from the info dataset.(TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R)
TARGET_IOBR |> 
filter(dash_count == 9) |> 
select(ID) |>
unique()
### 1: TARGET-30-PASYPX-01A-01R-TARGET-30-PAIXIF-01A-01R
### 2: TARGET-30-PANKFE-01A-01R-TARGET-30-PAPUAR-01A-01R
### 3: TARGET-30-PANUKV-01A-01R-TARGET-30-PASUML-01A-01R
### 4: TARGET-30-PAPTFZ-01A-01R-TARGET-30-PAPEAV-01A-01R

TARGET_IOBR_pretreatment <- TARGET_IOBR |>
                            filter(dash_count != 9) |>
                            select(-dash_count)

TARGET_IOBR_after_replicate <- TARGET_IOBR_pretreatment |>
                            mutate(
                            prefixes = str_remove(ID, "-[^-]+$"),  
                            suffixes = str_extract(ID, "[^-]+$"), 
                            second_chars = str_extract(suffixes, "[0-9]+"),
                            second_chars = as.integer(second_chars)) |>
                            group_by(prefixes) |>
                            slice(which.max(second_chars)) |>
                            ungroup() |>
                            mutate(ID = prefixes) |> 
                            filter(ID %in% detected_id_TARGET)  |> ### Ensure that all IDs in the TME data can be found in the TARGET info data. 
                            select(-c(prefixes,suffixes,second_chars))

### Using TARGET info data cohort column to annotate TARGET TME data 
cohort_data   <- TARGET_info_after_processing |> 
                 select(Run, cohort)   |>
                 unique()   

TARGET_IOBR_after_cohort <- TARGET_IOBR_after_replicate |>
                            mutate(Run = ID)  |>
                            left_join(cohort_data, by = "Run")  |>
                            select(-Run)
### save IOBR
cohorts_IOBR <- unique(TARGET_IOBR_after_cohort$cohort)

for (cohort1 in cohorts_IOBR) {
  cohort_data <- subset(TARGET_IOBR_after_cohort, cohort == cohort1) |> select(-cohort)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  

CPTAC

library(data.table)
library(dplyr)
library(stringr)
library(tidyr)

file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.tsv"
metafusion <- fread(file_meta, sep = "\t")

# CPTAC
## info
CPTAC_sheet <- fread("/home/data2/Projects/Fusion/unclean/CPTAC_gdc_sample_sheet.2023-11-15.tsv", sep = "\t")

### Composite Sample 
###
### Composite Sample use comma to separate like C3L-03400-05, C3L-03400-02
### 
### It leads one question is whether these composite samples using comma to separate from identical patient?
rows_with_comma <- grepl(", ", CPTAC_sheet$`Sample ID`)
rows_with_comma_data <- CPTAC_sheet[rows_with_comma, ]

check_same_chars <- function(x) {
  parts <- unlist(strsplit(x, ", "))
  return(parts[1] == parts[2])
}
same_chars <- sapply(rows_with_comma_data$`Case ID`, check_same_chars)
if (all(same_chars)) {
  cat("TRUE\n")
} else {
  cat("FALSE\n")
}
### They are from identical patient.

### For the TARGET databases, check whether there is any overlap between composite and normal Sample_ID in the info data. If there is an overlap, retain only the normal Sample_ID. 
check_cptac_overlap <- CPTAC_sheet |>  
                       select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
                       unique() |> 
                       separate_rows(`Sample ID`, sep = ",\\s*") 
 
check_cptac_overlap <- check_cptac_overlap |>
                        mutate(Sample_Type = ifelse(grepl(", ", `Sample ID`), "Composite", "Normal"))
                      
overlap_samples <- check_cptac_overlap |>
                        group_by(`Sample ID`) |>
                        summarise(Count = n(), .groups = 'drop') |>
                        filter(Count > 1)
### > overlap_samples 
### A tibble: 0 × 2
### ℹ 2 variables: Sample ID <chr>, Count <int> 
### no this situation

### Due to the less rigorous patient naming conventions(like 604) in the CPTAC2 project, check for any overlapping patient names between CPTAC2 and CPTAC3.
CPTAC_sheet_rigorous  <- CPTAC_sheet |> 
                         select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>  
                         unique() 
duplicates <- CPTAC_sheet_rigorous |>
                          group_by(`Case ID`, `Sample ID`) |>
                          summarise(n_projects = n(), projects = paste(`Project ID`, collapse = ", ")) |>
                          filter(n_projects > 1)
### nrow(duplicates)
### [1] 0
### no same patient names between CPTAC2 and CPTAC3

### CPTAC_sheet pretreatment
matched_CPTAC_pretreatment <- CPTAC_sheet |> 
  select(c(`Case ID`, `Sample ID`, `Sample Type`, `Project ID`)) |>
  rename("Patient_ID" = `Case ID`,
         "Sample_ID" = `Sample ID`,
         "Sample_type"  =  `Sample Type`,
         "Project_id" = `Project ID`) |> 
  separate_rows(Patient_ID, sep = ",\\s*") |>  #because they are from identical patient.
  unique() 

### CPTAC_info pretreatment
CPTAC_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/CPTAC_info.tsv", sep = "\t")
# > CPTAC_info$Cancer |> unique()
# [1] "BRCA"       "CCRCC"      "COAD"       "GBM"        "HGSC"       "HNSCC"      "LSCC"       "LUAD"       "PDAC"       "UCEC"       "GTEX-Brain"
### We use our naming standard mutate a new column cohort.
cptac_replacement_table2 <- c(
  "BRCA" = "CPTAC-BRCA",
  "CCRCC" = "CPTAC-KIRC",
  "COAD" =  "CPTAC-COAD",
  "GBM" = "CPTAC-GBM",
  "HGSC" = "CPTAC-OV",
  "HNSCC" = "CPTAC-HNSC",
  "LSCC" = "CPTAC-LUSC" ,
  "LUAD" = "CPTAC-LUAD",
  "PDAC" = "CPTAC-PAAD",
  "UCEC" = "CPTAC-UCEC",
  "GTEX-Brain" = "CPTAC-GTEX-Brain"
)     

matched_CPTAC_info <- CPTAC_info$Cancer %in% names(cptac_replacement_table2)
CPTAC_info_pretreatment <- CPTAC_info %>%
  mutate(cohort = case_when(
    matched_CPTAC_info ~ cptac_replacement_table2[Cancer],
    TRUE ~ as.character(NA)  
  )) |> 
  select(-Sample_ID)
### CPTAC_info_pretreatment  |> filter(is.na(cohort)) |> nrow()
### [1] 0

matched_CPTAC <- matched_CPTAC_pretreatment |>
                 left_join(CPTAC_info_pretreatment, by = c("Patient_ID" = "Patient_ID")) |>
                 mutate(Run = str_replace_all(Sample_ID, ", ", "-"))

### CPTAC Sample type
### matched_CPTAC_pretreatment$Sample_type |> unique()
### [1] "Primary Tumor"                                                             "Solid Tissue Normal"                                                      
### [3] "Primary Tumor, Primary Tumor"                                              "Primary Tumor, Primary Tumor, Primary Tumor"                              
### [5] "Solid Tissue Normal, Solid Tissue Normal, Solid Tissue Normal"             "Solid Tissue Normal, Solid Tissue Normal"                                 
### [7] "Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor" "Primary Tumor, Primary Tumor, Primary Tumor, Primary Tumor"   
### 
### Just two sample type so if grepling "Solid Tissue Normal" is normal sample, grepling "Primary Tumor" is tumor sample
matched_CPTAC_after_processing <- matched_CPTAC |>
  mutate(cohort = case_when(
    grepl("Solid Tissue Normal", Sample_type) ~ "CPTAC-NORMAL",
    TRUE ~ cohort  
  )) 

### make sure nrow is same between Before and After Processing. 
matched_CPTAC_after_processing |> nrow()
### [1] 2558
matched_CPTAC_pretreatment |> nrow() 
### [1] 2558

### make sure IOBR/TME data just contain sample which cptac_sheet gave. 
detected_id_CPTAC <- matched_CPTAC_after_processing$Run |> unique()

### save info
cohorts <- unique(matched_CPTAC_after_processing$cohort)

for (cohort1 in cohorts) {
  cohort_data <- subset(matched_CPTAC_after_processing, cohort == cohort1)
  name <- gsub("-", "_", cohort1)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/", name ,"_info.tsv")
  fwrite(cohort_data , file = file_name, sep = "\t")  
}

## TME
CPTAC2_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-2_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC2")
CPTAC3_TME <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-3_TME.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC3")
CPTAC_TME <- rbind(CPTAC2_TME, CPTAC3_TME)

### Match matched_CPTAC_after_processing$Run to CPTAC_TME$ID
### 
### CPTAC_TME can match
matched_can <- CPTAC_TME %>%
               filter(ID %in% matched_CPTAC_after_processing$Run) 

### 
### CPTAC_TME can't match
matched_not <- CPTAC_TME %>%
               filter(!(ID %in% matched_CPTAC_after_processing$Run))

### 
### situtation 1: Due to the barcode naming convention and etc, some portion character are added to the endning of the IOBR/TME sample name.
### str_count(matched_last$ID, "_")
### [1] 1 2 0
### 1.1 "CPT000817-0008" and "CPT000817". If the value of str_count(matched_last$ID, "_") for a row is 0, then the value of the prefixes column for that row should be the characters before the last -, and the value of the suffixes column should be the characters after the last -.
### 1.2 "f6a7f764-2495-4a1d-89a2-e12f51_D7_1" and "f6a7f764-2495-4a1d-89a2-e12f51". If the value of str_count(matched_last$ID, "_") for a row is 1, then the value of the prefixes column for that row should be the characters before the first _, and the value of the suffixes column should be the characters after the first _.
### 1.3 "CPT008756_0001" and "CPT008756". f6a7f764-2495-4a1d-89a2-e12f51_D7_1, .If the value of str_count(matched_last$ID, "_") for a row is 2, then the value of the prefixes column for that row should be the characters before the first _, and the value of the suffixes column should be the characters after the first _.

matched_not <- matched_not %>%
  rowwise() %>%  
  mutate(
    prefixes = case_when(
      str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][1],  
      str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][1],  
      str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][1],  
      TRUE ~ NA_character_
    ),
    suffixes = case_when(
      str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][length(str_split(ID, "-")[[1]])],  
      str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][2],  
      str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][2],  
      TRUE ~ NA_character_
    ),
    name1 = prefixes
  ) %>%
  ungroup() 

### situtation 1 after processing can match
matched_in_situtation1 <-  matched_not %>%
                           filter((name1 %in% matched_CPTAC_after_processing$Run))%>%
                           mutate(ID = name1)  %>%
                           select(-c(suffixes,prefixes,name1) )     
matched_can <- rbind(matched_can, matched_in_situtation1)

### not in situtation 1 
matched_not_situtation1 <- matched_not %>%
                           filter(!(name1 %in% matched_CPTAC_after_processing$Run))

### situtation 2: Due to IOBR package analyze code, an additional "X" is added to the beginning of the IOBR/TME sample name.
### "X1104806" and "1104806", "X9f905736-f662-41d6-b3ac-16758d" and "9f905736-f662-41d6-b3ac-16758d". 
matched_not_situtation1  <- matched_not_situtation1 %>%
                            mutate(name2 = str_remove(ID, "^X"))  

### situtation 2 after processing can match
matched_in_situtation2 <- matched_not_situtation1 %>%
                          filter((name2 %in% matched_CPTAC_after_processing$Run))    %>%
                          mutate(ID = name2 )  %>%
                          select(-c(name1,prefixes,suffixes,name2)) 
matched_can <- rbind(matched_can, matched_in_situtation2)

### situtation 2 after processing can't match
matched_not_situtation2 <- matched_not_situtation1 %>%
                           filter(!(name2 %in% matched_CPTAC_after_processing$Run))

### situtation 3: Mixed situation1 and situation2.
### X161317f3-1a39-438e-af45-cd2502_D7 and 161317f3-1a39-438e-af45-cd2502
matched_not_situtation2  <- matched_not_situtation2 %>%
                            mutate(name3 = str_remove(prefixes, "^X")) 

### situtation 3 after processing can match    
matched_in_situtation3 <- matched_not_situtation2  %>%
                          filter((name3 %in% matched_CPTAC_after_processing$Run))  %>%
                          mutate(ID = name3) %>%
                          select(-c(name1,name2,prefixes,suffixes, name3))           
matched_can <- rbind(matched_can, matched_in_situtation3)

### situtation 3 after processing can't match
matched_not_situtation3 <- matched_not_situtation2  %>%
                           filter(!(name3 %in% matched_CPTAC_after_processing$Run))
nrow(matched_not_situtation3 )
### [1] 81

### Check the reason why it still can't match after Type 3 processing
### After manual review, these Sample_ID are not included in either the info dataset or the gdc_sheet.

### Check if CPTAC TME data ID are any samples that have been divided into multiple portions.
matched_can |> select(ID) |>  nrow()
### [1] 2558
matched_can |> select(ID) |>  unique() |> nrow()
### [1] 2558
### No this situation.

### Check if CPTAC TME data has same norw with CPTAC info data.
matched_can |> select(ID) |>   unique() |> nrow()
### [1] 2558
matched_CPTAC_after_processing |> select(Run) |>  unique() |> nrow()
### [1] 2558
### Sure.

cohort_data <- matched_CPTAC_after_processing |> 
               select(Run, cohort)   |>
               unique()   

matched_can <- matched_can  |>
               mutate(Run = ID)  |>
               left_join(cohort_data, by = "Run")  |>
               select(-Run)

cohorts_TME <- unique(matched_can$cohort)

for (cohort1 in cohorts_TME) {
  cohort_data <- subset(matched_can, cohort == cohort1) |> select(-cohort)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  

#IOBR
CPTAC2_Signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-2_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC2")
CPTAC3_Signature <- fread("/home/data2/Projects/Fusion/fusiondb/IOBR/CPTAC-3_IOBR_signature.tsv.gz", sep = "\t") |> mutate(Project_id = "CPTAC3")
CPTAC_Signature <- rbind(CPTAC2_Signature, CPTAC3_Signature)

### Match matched_CPTAC_after_processing$Run to CPTAC_TME$ID
### 
### CPTAC_TME can match
matched_can <- CPTAC_Signature %>%
               filter(ID %in% matched_CPTAC_after_processing$Run) 


### 
### CPTAC_TME can't match
matched_not <-  CPTAC_Signature %>%
                filter(!(ID %in% matched_CPTAC_after_processing$Run))

### 
### situtation 1:like TME 
matched_not <- matched_not %>%
  rowwise() %>%  
  mutate(
    prefixes = case_when(
      str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][1],  
      str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][1],  
      str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][1], 
      TRUE ~ NA_character_
    ),
    suffixes = case_when(
      str_count(ID, "_") == 0 ~ str_split(ID, "-")[[1]][length(str_split(ID, "-")[[1]])],  
      str_count(ID, "_") == 1 ~ str_split(ID, "_")[[1]][2],  
      str_count(ID, "_") == 2 ~ str_split(ID, "_")[[1]][2],  
      TRUE ~ NA_character_
    ),
    name1 = prefixes
  ) %>%
  ungroup()  

### situtation 1 after processing can match
matched_in_situtation1 <- matched_not %>%
                          filter((name1 %in% matched_CPTAC_after_processing$Run))%>%
                          mutate(ID = name1)  %>%
                          select(-c(suffixes,prefixes,name1) )    

matched_can <- rbind(matched_can, matched_in_situtation1)

### situtation 1 after processing can't match
matched_not_situtation1 <- matched_not %>%
                           filter(!(name1 %in% matched_CPTAC_after_processing$Run))

### situtation 2:like TME 
matched_not_situtation1  <- matched_not_situtation1 %>%
                            mutate(name2 = str_remove(ID, "^X"))  

### situtation 2 after processing can match
matched_in_situtation2 <- matched_not_situtation1 %>%
                          filter((name2 %in% matched_CPTAC_after_processing$Run))    %>%
                          mutate(ID = name2 )  %>%
                          select(-c(name1,prefixes,suffixes,name2)) 

matched_can <- rbind(matched_can, matched_in_situtation2)

### situtation 2 after processing can't match
matched_not_situtation2 <- matched_not_situtation1 %>%
                           filter(!(name2 %in% matched_CPTAC_after_processing$Run))

### situtation 3:like TME 
matched_not_situtation2   <- matched_not_situtation2  %>%
                             mutate(name3 = str_remove(prefixes, "^X")) 

### situtation 3 after processing can match   
matched_in_situtation3 <- matched_not_situtation2  %>%
                          filter((name3 %in% matched_CPTAC_after_processing$Run))  %>%
                          mutate(ID = name3) %>%
                          select(-c(name1,name2,prefixes,suffixes, name3))     

matched_can <- rbind(matched_can, matched_in_situtation3)

### situtation 3 after processing can't match  
matched_not_situtation3 <- matched_not_situtation2  %>%
                           filter(!(name3 %in% matched_CPTAC_after_processing$Run))

nrow(matched_not_situtation3 )
### [1] 81
                          
### Check the reason why it still can't match after Type 3 processing
### After manual review, these Sample_ID are not included in either the info dataset or the gdc_sheet.

### Check if CPTAC IOBR data ID are any samples that have been divided into multiple portions.
matched_can |> select(ID) |>  nrow()
### [1] 2558
matched_can |> select(ID) |>  unique() |> nrow()
### [1] 2558
### No this situation.

### Check if CPTAC IOBR data has same norw with CPTAC info data.
matched_can |> select(ID) |>   unique() |> nrow()
### [1] 2558
matched_CPTAC_after_processing |> select(Run) |>  unique() |> nrow()
### [1] 2558
### Sure.

cohort_data   <- matched_CPTAC_after_processing |> 
                 select(Run, cohort)   |>
                 unique()   

matched_can <- matched_can |>
               mutate(Run = ID)  |>
               left_join(cohort_data, by = "Run")  |>
               select(-Run)

cohorts_IOBR <- unique(matched_can$cohort)

for (cohort1 in cohorts_IOBR) {
  cohort_data <- subset(matched_can, cohort == cohort1) |> select(-cohort)
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  
 
## Gene

metafusion1 <- metafusion |> 
  mutate(
    cohort = case_when(
      Sample_ID %in% matched_CPTAC_after_processing$Run  & grepl("CPTAC", cohort) ~ matched_CPTAC_after_processing$cohort[match(Sample_ID, matched_CPTAC_after_processing$Run)],
      TRUE ~ cohort
    )
  )  

### In the past, the metafusion file had an issue with the sample_type being incorrectly labeled.
### After further investigation, it was found that only the composite Sample_ID in CPTAC had this issue, where samples labeled as 'Normal' were mistakenly identified as 'Tumor' in the sample_type column.

normal_id <-   matched_CPTAC_after_processing |> 
               filter(cohort == "CPTAC-NORMAL") |> 
               select(Run) |>
               unique() |>
               pull(Run)

meta_normal <- metafusion1 |>
               filter(Sample_ID %in% normal_id)

unique(meta_normal$sample_type)
#[1] "Normal" "Tumor" 

meta_processing  <- metafusion1 |>
                    mutate(
                      sample_type = ifelse(
                        sample_type != "Normal" & Sample_ID %in% normal_id,
                        "Normal",
                        sample_type
                      )
                    )

cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.tsv"
fwrite(meta_processing, file = cp_file_name1, sep = "\t") 

TCGA

library(data.table)
library(dplyr)
library(stringr)
library(tidyr)
file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls_matched.cptac.target.tsv"
metafusion <- fread(file_meta  , sep = "\t")

# TCGA
## info
TCGA_info <- fread("/home/data2/Projects/Fusion/fusiondb/Clininfo/TCGA_info.tsv", sep = "\t")
### TCGA_sheet pretreatment
TCGA_sheet <- fread("/home/data2/Projects/Fusion/unclean/TCGA_sample_sheet.2024-05-14.tsv", sep = "\t") |> 
              select(c(`Sample ID`, `Sample Type`,`Project ID`)) 

colnames(TCGA_sheet) <- c("Sample_ID", "Sample_type","Project_id")

TCGA_sheet_pretreatment <- TCGA_sheet  |>
                           mutate(Sample_ID = substr(Sample_ID, 1, 15)) |> 
                           unique()
### We use our naming standard mutate a new column cohort. 
TCGA_after_processing <- TCGA_sheet_pretreatment |>
  left_join(TCGA_info, by = c("Sample_ID" = "Sample_ID")) |>
  mutate(
    cohort = Project_id,
    Run = Sample_ID,
    Patient_ID = substr(Sample_ID, 1, 12),  
  )
### TCGA Sample type
### TCGA_after_processing$Sample_type |> unique()
###   [1] "Primary Tumor"                                   "Solid Tissue Normal"                             "Metastatic"                                     
###   [4] "Additional - New Primary"                        "Primary Blood Derived Cancer - Peripheral Blood" "Additional Metastatic"                          
###   [7] "Recurrent Tumor"   
### 
### "Solid Tissue Normal" is normal sample, others is tumor sample

TCGA_NORMAL <- TCGA_after_processing |> 
               filter(Sample_type == "Solid Tissue Normal") 

NORMAL_id <- TCGA_NORMAL$Run |> unique()

TCGA_after_processing <- TCGA_after_processing |> 
  mutate(
    Project_id = cohort,
    cohort = case_when(
      Run %in% NORMAL_id ~ "TCGA-NORMAL",
      TRUE ~ cohort)) 

### make sure nrow is same between Before and After Processing. 
TCGA_after_processing |> nrow()
### [1] 10545
TCGA_sheet_pretreatment |> nrow() 
### [1] 10545

### make sure IOBR/TME data just contain sample which tcga_sheet gave. 
detected_id_TCGA <- TCGA_after_processing$Run |> unique()

### save info
cohorts <- unique(TCGA_after_processing$cohort)

file_name1 <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/Clininfo/")

for (cohort1 in cohorts) {
  cohort_data <- subset(TCGA_after_processing, cohort == cohort1)
  name <- gsub("-", "_", cohort1)
  file_name <- paste0(file_name1, name ,"_info.tsv")
  fwrite(cohort_data , file = file_name, sep = "\t")  
}  
## Gene
metafusion1 <- metafusion |> 
  mutate(
    cohort = case_when(
      Sample_ID %in% TCGA_after_processing$Run & grepl("TCGA", cohort) ~ TCGA_after_processing$cohort[match(Sample_ID, TCGA_after_processing$Run)],
      TRUE ~ cohort
    )
  ) 
cp_file_name1 <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_gdc_divide.tsv"
fwrite(metafusion1, file = cp_file_name1, sep = "\t") 

## IOBR
folder_path <- "/home/data2/Projects/Fusion/fusiondb/IOBR"

file_pattern <- "^TCGA.*_IOBR_signature.tsv.gz$"
file_list <- list.files(path = folder_path, pattern = file_pattern, full.names = TRUE)

combined_data <- data.table()

for (file in file_list) {
  
  file_data <- fread(file, sep = "\t")
  file_name <- basename(file)  
  project_id <- sub("^(TCGA-[^_]+).*", "\\1", file_name)  
  file_data <- file_data %>%
               mutate(Project_id = project_id)
  combined_data <- rbind(combined_data, file_data)
}

### Filter TCGA Replicate Samples
### The filter rule following broad institute says:
###
### In many instances there is more than one aliquot for a given combination of individual, platform, and data type. However, only one aliquot may be ingested into Firehose. Therefore, a set of precedence rules are applied to select the most scientifically advantageous one among them. Two filters are applied to achieve this aim: an Analyte Replicate Filter and a Sort Replicate Filter.
### 
### Analyte Replicate Filter
### The following precedence rules are applied when the aliquots have differing analytes. For RNA aliquots, T analytes are dropped in preference to H and R analytes, since T is the inferior extraction protocol. If H and R are encountered, H is the chosen analyte. This is somewhat arbitrary and subject to change, since it is not clear at present whether H or R is the better protocol. If there are multiple aliquots associated with the chosen RNA analyte, the aliquot with the later plate number is chosen. For DNA aliquots, D analytes (native DNA) are preferred over G, W, or X (whole-genome amplified) analytes, unless the G, W, or X analyte sample has a higher plate number.
### 
### Sort Replicate Filter
### The following precedence rules are applied when the analyte filter still produces more than one sample. The sort filter chooses the aliquot with the highest lexicographical sort value, to ensure that the barcode with the highest portion and/or plate number is selected when all other barcode fields are identical.
### 
### Ref Link: <https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon>
###           <https://gist.github.com/ShixiangWang/33b2f9b49b77eaa8f773b428480f9101>


result_data <- combined_data |>
  mutate(
    prefixes = str_split(ID, "-", simplify = TRUE)[, 1:4] %>% apply(1, paste, collapse = "-"),
    suffixes = str_split(ID, "-", simplify = TRUE)[, 5],
    second_chars = str_extract(suffixes, "[0-9]+"),
    second_chars = as.integer(second_chars))
result_IOBR <- result_data  |>
  group_by(prefixes) |>
  slice(which.max(second_chars)) |>
  ungroup() |>
  mutate(cohort = Project_id,
         ID = substr(prefixes, 1, 15),
         cohort = case_when(ID %in% NORMAL_id ~ "TCGA-NORMAL",
                            TRUE ~ cohort)
  ) |>
  filter(ID %in% detected_id_TCGA)  |> 
  select(-c("suffixes", "second_chars","prefixes")) |>
  unique() 

cohorts_IOBR <- unique(result_IOBR$cohort)

for (cohort1 in cohorts_IOBR) {
  cohort_data <- subset(result_IOBR, cohort == cohort1) |> select(-c(cohort,Project_id) )
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_IOBR_signature.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
}  

## TME

folder_path <- "/home/data2/Projects/Fusion/fusiondb/IOBR"
file_pattern <- "^TCGA.*_TME.tsv.gz$"
file_list <- list.files(path = folder_path, pattern = file_pattern, full.names = TRUE)

combined_data1 <- data.table()

for (file in file_list) {

  file_data <- fread(file, sep = "\t")
  
  file_name <- basename(file)  
  project_id <- sub("^(TCGA-[^_]+).*", "\\1", file_name)  
  
  file_data <- file_data %>%
    mutate(Project_id = project_id)
  
  combined_data1 <- rbind(combined_data1, file_data)
}

result_data1 <- combined_data1 |>
  mutate(
    prefixes = str_split(ID, "-", simplify = TRUE)[, 1:4] %>% apply(1, paste, collapse = "-"),
    suffixes = str_split(ID, "-", simplify = TRUE)[, 5],
    second_chars = str_extract(suffixes, "[0-9]+"),
    second_chars = as.integer(second_chars))
result_TME <- result_data1  |>
  group_by(prefixes) |>
  slice(which.max(second_chars)) |>
    ungroup() |>
      mutate(cohort = Project_id,
             ID = substr(prefixes, 1, 15),
             cohort = case_when(ID %in% NORMAL_id ~ "TCGA-NORMAL",
                                TRUE ~ cohort)
      ) |>
      filter(ID %in% detected_id_TCGA)  |> 
      select(-c("suffixes", "second_chars","prefixes")) |>
      unique() 

cohorts_TME <- unique(result_TME$cohort)

for (cohort1 in cohorts_TME) {
  cohort_data <- subset(result_TME, cohort == cohort1) |> select(-c(cohort,Project_id) )
  file_name <- paste0("/home/data2/Projects/Fusion/fusiondb_v3/IOBR/", cohort1 ,"_TME.tsv.gz")
  write.table(cohort_data, 
              file = gzfile(file_name), 
              sep = "\t", 
              row.names = FALSE, 
              quote = FALSE) 
} 

Fusion Event Scoring and Classification

library(data.table)
library(dplyr)
library(stringr)
library(tidyr)

file_meta <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_gdc_divide.tsv"
metafusion <- fread(file_meta , sep = "\t")
# Stratification Strategy
## Normal data
### 1. If one fusion event was found in the past Arriba blocklist, it will be given a score of 0 and will not be changed.
### 2. If one fusion event's max_split_cnt >= 1 and max_span_cnt >= 1, +1 score.
### 3. If Arriba and STAR-Fusion detected identical fusion event, +1 score.
### 4. If past FusionAnnotator annotated one normal fusion event, +2 score(Because it is a strong evidence)
Normal_data <- metafusion |>
               filter(sample_type == "Normal") |>
               mutate(Normal_score = 0) 

### 1. Arriba blocklist
#### reference:https://arriba.readthedocs.io/en/latest/input-files/#blacklist
Normal_data <- Normal_data %>%
  mutate(
    Normal_score = case_when(
      blocklist == TRUE ~ 0,  
      TRUE ~ Normal_score     
    ),
    score0 = case_when(
      blocklist == TRUE ~ TRUE,  
      TRUE ~ FALSE    
    )
  )

###  2. Reads counts
Normal_data <- Normal_data %>%
  mutate(
    Normal_score = case_when(
         max_split_cnt >= 1 & max_span_cnt >= 1  & score0 == FALSE ~ Normal_score + 1 ,
         TRUE ~ Normal_score 
                            )
        )

###  3. Tools detected
Normal_data <- Normal_data %>%
    mutate(
    Normal_score = case_when(
           tools == "arriba,starfusion" & score0 == FALSE ~ Normal_score + 1,
           TRUE ~ Normal_score
                            )
          )
            
### 4. Fusionanntator annotate 
normal_markers <- c("GTEx_recurrent_StarF2019", "DGD_PARALOGS", "HGNC_GENEFAM", "BodyMap", "Greger_Normal", "ConjoinG", "Babiceanu_Normal")

#### Past Red Herrings summary didn't contain TCGA, TARGET, CPTAC(except for GTEx-Brain) normal sample data.
#### reference:https://github.com/FusionAnnotator/CTAT_HumanFusionLib/wiki
#### BodyMap    40
#### Babiceanu_Normal   239
#### ConjoinG   698
#### DGD_PARALOGS   14463
#### Greger_Normal  113
#### GTEx_recurrent_StarF2019   4047
#### HGNC_GENEFAM   Unknown

Normal_data <- Normal_data %>%
  mutate(
    Normal_score = case_when(
      rowSums(sapply(normal_markers, grepl, x = cancer_db_hits)) >= 1 & score0 == FALSE ~ Normal_score + 2,
      TRUE ~ Normal_score   )
        )

### Nomral true 
Normal_data_true <- Normal_data |>
                          group_by(gene1, gene2) %>%
                          slice_max(Normal_score) %>%
                          ungroup() %>%
                          filter(Normal_score >= 2) 

## Tumor data
### 1. If one fusion event was found in the past Arriba blocklist, it will be given a score of 0 and will not be changed.
### 2. If fusion event was in our detcted normal fusion(score>=2), it will be given a score of 0 and will not be changed.
### 3. If one fusion event's max_split_cnt and max_span_cnt are both greater than or equal to 1, and at least one of them is less than 5, +1 score.
###    If one fusion event's max_split_cnt and max_span_cnt are both greater than or equal to 5, +2 score.
### 4. If Arriba detected exclusive fusion event, +0 score; 
###    If STAR-Fusion detected exclusive fusion event, +0 score; 
###    If Arriba and STAR-Fusion detected identical fusion event, +1 score; 
### 5. If past FusionAnnotator annotated one tumor fusion event, +2 score(Because it is a strong evidence)
Tumor_data <-  metafusion |>
               filter(sample_type == "Tumor" | sample_type == "Blood") |>
               mutate(Tumor_score = 0) 

### 1. Arriba blocklist
#### reference:https://arriba.readthedocs.io/en/latest/input-files/#blacklist
Tumor_data <- Tumor_data %>%
  mutate(
    Tumor_score = case_when(
      blocklist == TRUE ~ 0,  
      TRUE ~ Tumor_score   
    ),
    score0 = case_when(
      blocklist == TRUE ~ TRUE,  
      TRUE ~ FALSE    
    )
  )

### 2. Filter not in normal data.
Tumor_data <- Tumor_data %>%
              mutate(gene_pair = paste0(gene1, "::", gene2))

Normal_data_true <- Normal_data_true %>%
                    mutate(gene_pair = paste0(gene1, "::", gene2))

Tumor_data <- Tumor_data %>%
  mutate(
    Tumor_score = case_when(
      gene_pair %in% Normal_data_true$gene_pair & score0 == FALSE ~ 0,
      TRUE ~ Tumor_score   
    ),
    score0 = case_when(
      gene_pair %in% Normal_data_true$gene_pair & score0 == FALSE ~ TRUE,
      TRUE ~ FALSE    
    )
  )

###  3. Reads counts

Tumor_data <- Tumor_data %>%
  mutate(
    Tumor_score = case_when(
      max_split_cnt >= 5 & max_span_cnt >= 5  & score0 == FALSE ~ Tumor_score + 2,
      (max_split_cnt >= 1 & max_span_cnt >= 1) & ((max_split_cnt < 5 & max_span_cnt >= 5)| (max_split_cnt >= 5 & max_span_cnt < 5)| (max_split_cnt < 5 & max_span_cnt < 5))~ Tumor_score + 1,
      TRUE ~ Tumor_score
    )
  )

###  4. Tools detected

Tumor_data <- Tumor_data %>%
    mutate(
      Tumor_score = case_when(
           tools == "arriba,starfusion" & score0 == FALSE ~ Tumor_score + 1,
           tools == "arriba" & score0 == FALSE ~ Tumor_score,
           tools == "starfusion" & score0 == FALSE ~ Tumor_score,
           TRUE ~ Tumor_score
                            )
          )

### 5. Fusionanntator annotate 
tumor_markers <- c("Mitelman", "chimerdb_omim", "chimerdb_pubmed", "Cosmic", "YOSHIHARA_TCGA", "ChimerKB", "ChimerPub", "ChimerSeq", "Klijn_CellLines", "Larsson_TCGA", "CCLE_StarF2019", "HaasMedCancer", "TCGA_StarF2019", "TumorFusionsNAR2018", "DEEPEST2019", "GUO2018CR_TCGA", "DepMap2023")

Tumor_data <- Tumor_data %>%
    mutate(
      Tumor_score = case_when(
        rowSums(sapply(tumor_markers, grepl, x = cancer_db_hits)) >= 1 ~ Tumor_score + 2,
        TRUE ~ Tumor_score)
          ) 

### New metafusion
Tumor_data1 <- Tumor_data  |> 
               select(-c(gene_pair, score0)) |>
               rename(gene_score = Tumor_score)
Normal_data1 <- Normal_data |> 
               select(-c( score0)) |>
               rename(gene_score = Normal_score)

metafusion_new <- rbind(Normal_data1,Tumor_data1)
file_name <- "/home/data2/Projects/Fusion/fusiondb_v3/metafusion_calls.star.tsv"
write.table(metafusion_new, 
                      file = file_name, 
                      sep = "\t", 
                      row.names = FALSE, 
                      quote = FALSE)