CRISPR/Cas9 screen uncovers functional translation of cryptic lncRNA-encoded open reading frames in human cancer

Emerging evidence suggests that cryptic translation within long noncoding RNAs (lncRNAs) may produce novel proteins with important developmental/physiological functions. However, the role of this cryptic translation in complex diseases (e.g., cancer) remains elusive. Here, we applied an integrative strategy combining ribosome profiling and CRISPR/Cas9 screening with large-scale analysis of molecular/clinical data for breast cancer (BC) and identified estrogen receptor α–positive (ER+) BC dependency on the cryptic ORFs encoded by lncRNA genes that were upregulated in luminal tumors. We confirmed the in vivo tumor-promoting function of an unannotated protein, GATA3-interacting cryptic protein (GT3-INCP) encoded by LINC00992, the expression of which was associated with poor prognosis in luminal tumors. GTE-INCP was upregulated by estrogen/ER and regulated estrogen-dependent cell growth. Mechanistically, GT3-INCP interacted with GATA3, a master transcription factor key to mammary gland development/BC cell proliferation, and coregulated a gene expression program that involved many BC susceptibility/risk genes and impacted estrogen response/cell proliferation. GT3-INCP/GATA3 bound to common cis regulatory elements and upregulated the expression of the tumor-promoting and estrogen-regulated BC susceptibility/risk genes MYB and PDZK1. Our study indicates that cryptic lncRNA-encoded proteins can be an important integrated component of the master transcriptional regulatory network driving aberrant transcription in cancer, and suggests that the “hidden” lncRNA-encoded proteome might be a new space for therapeutic target discovery.

! 3! three consecutive T; (3)with high level of GC content (>60%); (4)guide efficiency score< 0.2; (5)being mapped to the annotated CDSs, were filtered out from the library. The 636 sgRNAs targeting 106 core essential genes were included as positive controls, and the 1,064 sgRNAs that target AAVS1 sites in the human genome or do not target the human genome were included as negative controls, respectively. The sgRNAs flanked by linker sequences (Table S10) were synthesized as a pooled library using the CustmoArray 12K chips (CustmoArray, Inc). The array-synthesized sgRNA library was amplified for 8 cycles, using specific primers (Table S10) and Q5 High-Fidelity DNA Polymerase (New England Biolabs #M0491S). The PCR product was purified and assembled into a BsmBI (Thermo Fisher #ER0452)-digested lentiGuide-Puro vector (Addgene #52963) by Gibson assembly (Gibson Assembly® Master Mix, New England Biolabs # E2611L). A total of 2 ul of 10-50 ng/ul ligation products was transfected into 25 ul electrocompetent cells (Lucigen) by using Micropulser Electroporator (Bio-Rad) with one-shot EC1 program (~3-4 reactions for one library). The transformed electrocompetent cells were plated on each of pre-made 24.5 cm2 bioassay plates (ampicillin) using a spreader after recovering in recovery media for 1 hour rotated at 37 °C. All plates were grown inverted for 14 hours at 32 °C. Finally, the colonies were scraped off and the plasmids were extracted with NucleoBond Xtra Midi EF kit (Takara #740422.50) for downstream virus production.

CRISPR-Cas9 screen and data analysis
The MCF7 cells transduced with lentiCas9-EGFP (addgene, #63592) were sorted on a FACSAria cell sorter (BD Biosciences) and the cells with high EGFP expression were collected.
These MCF7 cells with high expression of SpCas9 were plated into ten 10-cm dishes and infected with lentiviruses containing the sgRNA library at an MOI of 0.2~0.3. Following ! 5! factors were applied to all sgRNAs. The cryptic ORFs with at least 2 significantly depleted sgRNAs (log 2 (Fold-Change)<−log 2 (1.5) and p<0.05), whose expression was up-regulated in Luminal A BRCA than normal breast tissues. (log 2 (Fold-Change)≥log 2 (1.2), FDR<0.01) were selected as the final candidates of Luminal A BRCA dependency.

5' and 3' RACE
The 5' and 3' RACE experiments were conducted using the SMARTer ® RACE 5'/3' Kit (Clontech #634859) as described previously 9 . Total RNA from MCF7 cells was extracted using the RNeasy Mini kit (QIAGEN #74104) according to the manufacturer's instruction. First-strand cDNA was synthesized using 5′-CDS and 3′-CDS primer A and SMARTer II A oligonucleotide as described in the user's manual. The touchdown nested PCR was used to amplify cDNA ends.! All the primers are listed in Supplementary Table 10. The PCR product was purified from 2% agarose gel with NuceloSpin Gel and PCR Clean-Up Kit (supplied with the SMARTer ® RACE 5'/3' Kit) and was then cloned into pRACE vector using In-Fusion HD Master Mix (both vector and mix were provided as SMARTer RACE 5'/3' Kit Components) for Sanger sequencing.

RNA-seq
Total RNA was isolated from cells using RNeasy Mini kit (QIAGEN, #74104) and was treated with DNase I(QIAGEN #79254). RNA-seq libraries were prepared from 3 µg of total RNA, using TruSeq Stranded mRNA Library Prep kit (Illumina # 20020594), according to the manufacturer's instructions. The libraries were sequenced on an Illumina NextSeq 500 (singleend 76-bp), at the Advanced Technology Genomics Core of MDACC.

RNA-seq/ChIP-seq data analysis, integrative analyses of TCGA data and breast cancer susceptibility/risk gene curation
The RNA-seq and ChIP-seq reads were first trimmed by Trim Galore (v0.6.5) (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), a wrapper around two tools: cutadapt v2.8 (https://github.com/marcelm/cutadapt/) and FastQC v0.11.5 (https://github.com/chgibb/FastQC0.11.5/; https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), and were then mapped to the human genome (GRCh38), using HISAT2 10 (v2.1.0) and Bowtie 2 11 (v2.4.1), respectively. For RNA-seq, The gene-level raw read-counts were calculated using htseq-count function of HTSeq 12 (0.11.0), based on the aligned and sorted bam files. The normalization of read counts and differential gene expression analysis were performed, using DESeq2 8 (1.22.2). The filters of basemean>1, |log 2 (Fold-Change)|≥log 2 (1.5) and FDR<0.05 were used to define differentially expressed genes for most downstream analysis. The GATA3 ChIP-seq data generated from MC7 and T47D cells were obtained from ENCODE (GSE32465) and GSE128460. The sorted BAM files were converted into bedGraph and bigWig formats using BEDTools 13 (v2.24.0) and UCSC bedGraphToBigWig 14 (v4). The ChIP-seq peaks were identified by MACS2 15 (v2.1.2) with the parameters "macs2 callpeak -t ChIP.bam -c INPUT.bam -g hs --outdir output -n NAME 2> NAME.callpeak.log" for GATA3 and GT3-INCP ChIP-seq data. The genome-wide distribution of GT3-INCP binding sites was calculated using the CEAS tool in Cistrome 16 , a web-based analysis platform for transcriptional regulation studies. The conservation plot of the GT3-INCP binding sites and their flanking sequences was made with a window size of +/-500 bps in Cistrome. The motif enrichment analysis was performed using the Cistrome SeqPos tool. The BETA (1.0.7) 17 was used to find the target genes of the ChIP-seq peaks. The gene ontology ! 7! enrichment analysis and KEGG pathway enrichment analysis was performed by DAVID 18 , with the genes co-regulated by LINC00992/GT3-INCP and GATA3 (basemean>1, |log 2 (Fold-Change)|≥log 2 (1.2) and FDR<0.05) being the input gene list. Gene Set Enrichment Analysis 19 (GSEA)!was performed using the hallmark gene sets from the Molecular Signatures Database 20 , with GSEA (v4.2.2) Java desktop program. TCGA breast cancer (BRCA) RNA-seq read count data were downloaded from TCGA. Raw count normalization, and differential gene expression analysis between tumors and the corresponding normal breast tissues were performed using DESeq2. The genes with "basemean>1, log 2 (Fold-Change)≥log 2 (1.2) and FDR<0.01" were considered as the significantly up-regulated genes in tumor vs. normal. The Kaplan-Meier survival curves were used to show the survival distributions and the log-rank test was used to assess the corresponding statistical significance. The survival analysis was performed, using the "survival" and "survminer" package in R. Breast cancer susceptibility/risk genes were compiled from a total of 32 published studies based on GWAS, cis-QTL and/or integrative analysis of functional genomic data (Table S7).

Generation of custom anti-GT3-INCP rabbit polyclonal antibody
The custom anti-GT3-INCP rabbit polyclonal antibody was generated by ABclonal Technology.
Briefly, 1.5 year old New Zealand rabbits with weight of 2.5 kg under specific-pathogen-free conditions were injected subcutaneously with 300 µg purified antigen protein (full-length 120 aa GT3-INCP) supplemented with Complete Freund's Adjuvant (CFA) for the primary injection and 150 µg antigen protein supplemented with Incomplete Freund's Adjuvant (IFA) for three subsequent boosting injections at 2 weeks intervals. Terminal bleeds were collected after immunization and rabbit polyclonal antibodies were purified from terminal bleeds by antigen ! 8! affinity chromatography.

Three-dimensional structure prediction
The three-dimensional protein structure of the full-length GATA3 was obtained from Alphafold Database 21,22 or predicted by I-TASSER-MTD webserver (https://zhanggroup.org/I-TASSER-MTD) 23,24 with default parameters. Five structural models were predicted by I-TASSER-MTD webserver, among which the model that showed the best alignment with the experimentally determined structures of ZF1 and ZF2 domain (PDB ID: 4hc7) was presented. The structures of the GT3-INCP protein were predicted by I-TASSER webserver (https://zhanggroup.org/I-TASSER/) 25,26 with default parameters and Alphafold2 21,27 , respectively.

ChIP-qPCR and ChIP-seq
ChIP was performed as described in Duncan Odom's group's protocol with some adaptations 28 .
In brief, at about 80-90% confluence, approximately 2×10 7 MCF7 or T47D cells were first crosslinked with 1% formaldehyde (methanol-free, 16% Thermo Scientific, #28908) at room temperature for 10 minutes and then quenched with 0.125M Glycine (final concentration) for 5 minutes. For GT3-INCP ChIP experiments, the MCF7 cells stably expressing FLAG-tagged GT3-INCP were used. After washing with cold PBS for three times, the cells were harvested using a silicon scraper. Cell pellets were resuspended in 5mL lysis buffer 1(50 mM Hepes-KOH, pH 7.5, 140 mM NaCl, 1 mM EDTA, 10% glycerol, 0.5% NP-40, 0.25% Triton X-100) and rocked at 4 degree for 5 minutes, followed by centrifugation at 2000 g for 4 minutes at 4°C. The cell pellets were then incubated with 5 mL LB2 buffer (10 mM Tris-HCl, pH=8.0, 200 mM NaCl, ! 9! down by centrifugation of the cells at 2000 g for 5 min and were resuspended in 1 mL LB3 buffer (10 mM Tris-HCl, pH=8.0, 100 mM NaCl, 1 mM EDTA, 0.5 mM EGTA, 0.1% Na-Deoxycholate, 0.5% N-Lauroylsarcosine). All lysis buffer contained protease inhibitors (Roche #04693112001). Chromatin was sonicated to around 200 bp DNA fragments, using a Diagenode Bioruptor (three rounds of 5 cycles, 30" on 30" off). Lysates were cleared by addition of Triton X-100 to a final concentration of 1% and centrifugation at 2000 g for 10 minutes at 4°C. A total of 50 µl lysates were saved from each sample for input and stored at -80°C until use. To prepare antibody-bound beads, 30 µl of magnetic beads (Invitrogen, Dynabeads) were washed three times with blocking buffer (1× PBS, 0.5% BSA) and incubated overnight with 5 µg of anti-GATA3 or anti-FLAG antibody at 4°C. For each ChIP, 900 uL sonicated lysate from 2×10 7 cells was incubated with the antibody-bound beads overnight at 4°C. Beads were washed 6 times with RIPA wash buffer (50 mM Hepes-KOH, pH= 7.6, 500 mM LiCl, 1 mM EDTA, 1% NP-40, 0.7% Na-Deoxycholate) and 1 time with TBS (20 Mm Tris-HCL, PH 7.6, 150Mm NaCl) for 5 minutes each time at room temperature with gently rocking. All washing buffers contain the protease inhibitors. The beads were eluted twice with 50 ul elution buffer (50 mM Tris-HCl pH 8, 10 mM EDTA, 1% SDS) for 10 minutes at 65°C with rocking. Crosslinking was reversed by adding 6 uL 5M NaCl to the eluates and incubated at 65°C overnight. RNAs were degraded by incubation with 1µl of 10 mg/ml RNase at 37°C for 30 minutes and proteins were digested by incubation with 2 µl of 20 mg/ml of Proteinase K(Thermo Fisher) at 56°C for 2 hours. DNAs were then purified using QIAquik PCR purification kit (QIAGEN, #28106). The samples were analyzed by qPCR or further processed for sequencing. ChIP-seq libraries were prepared from 10 ng of ChIP DNA using the TruSeq ChIP Library Preparation Kit (Illumina, #!IP-202-1012), according to the manufacturer's instructions. The libraries were sequenced on an Illumina NextSeq 500 (singleend 76-bp), at the Advanced Technology Genomics Core of MDACC.

Immunoprecipitation, subcellular fractionation and western blotting
For immunoprecipitation assays, the cells were lysed in Pierce IP lysis buffer (Thermo Fisher, #87787) with protease inhibitor and 10mM PMSF (Thermo Fisher, #36978). For immunoprecipitation of exogenous FLAG-tagged proteins, anti-FLAG M2 agarose Beads (Sigma-Aldrich, #A2220) were incubated with the whole cell lysates overnight with gently rotating at 4 °C. For immunoprecipitation of endogenous proteins, the specific antibodies were first coupled to protein G magnetic beads (Invitrogen, #10004D) and then incubated with the cell lysates. After incubation, the beads were washed 5 times with washing buffer (10 mM Tris, PH 7.4, 1 mM EDTA, 1 mM EGTA, pH 8.0, 150 mM NaCl, 1% Triton X-100) and resuspended in SDS-PAGE sample buffer(Bio-Rad #1610747). For mass spectrometry, the precipitated proteins on the beads were eluted by a competition with 3×FLAG peptides (Sigma-Aldrich, #F4799).
Eluted proteins and 5% of the whole-cell extracts were analyzed by immunoblot. To confirm the interaction between GT3-INCP and GATA3 on chromatin, the immunoprecipitation from chromatin extracts was performed as described previously 29,30 , which is similar to the ChIP experiment. After 6 times of RIPA washing, the beads were re-suspended in SDS-PAGE sample buffer (Bio-Rad #1610747) for western blot analysis. To segregate and enrich nuclear and cytoplasmic proteins, the subcellular protein fractionation kit for cultured cells (Thermo Scientific™, 78840) was used for ER+ cell lines, according to the manufacturer's instructions Whole-cell lysates were generated using RIPA lysis and extraction buffer (Thermo Fisher #89900) supplemented with protease Inhibitor Cocktail (Sigma #11697498001) according to the ! 11! manufacturer's instructions. Protein concentration was measured by using the Bradford assay (Bio-Rad # 5000006). Proteins were separated by 4-15% or 4-20% Mini-PROTEAN TGX precast polyacrylamide gel (Bio-Rad), and then transferred to PVDF membranes (Millipore, #GVWP04700) in a transfer buffer (Invitrogen, #LC3675) at 4 °C. Membranes were first blocked and incubated with specific antibodies overnight at 4 °C, and then incubated with immobilon western chemiluminescent HRP substrate (Millipore, #WBKLS0500) followed by analysis using ChemiDoc Touch Imaging Systems (Bio-Rad).

Immunofluorescence staining
The MCF7 cells stably expressing FLAG-tagged GT3-INCP were seeded into 4-well culture/chamber slides (Lab-Tek, 154917) with 30-50% confluency. Cells were washed with cold PBS and fixed using 4% paraformaldehyde for 15 min followed by permeabilization in 0.25% Triton X-100 solution for 10 min at room temperature. The fixed cells were blocked with 10% normal goat serum (Life Technologies, PCN5000) in PBS for 30 min at room temperature, and then incubated with anti-FLAG antibody (Sigma, F1804) at 1:500 in PBS overnight at 4°C. After washing, the cells were incubated with fluorochrome-conjugated secondary antibody (Invitrogen, A32723) at 1:1000 in PBS for 1 hour at room temperature in dark. The slips were mounted onto the microscope slide with Vectashield Mounting Medium containing DAPI (Vector Laboratories, H-1500). The images were captured by ZEISS LISM880 confocal microscopy. The histograms showing the distribution of log 2 (Fold-Change) between day 21 and day 0 for sgRNAs targeting the cryptic ORFs (grey), the positive (orange) and negative control sgRNAs (blue) in the CRISPR/Cas9 screen. The growth of the T47D cells transduced with the negative control sgRNA (sgNC)/gene-specific sgRNAs targeting the (E) ORF-LINC00992 or (F) ORF-GATA3-AS1, was monitored with CCK-8 assay. The OD450 absorbance for WST-8 formazan was measured each day for 4 days. The representative pictures of clonogenic growth and the bar graph quantifying the colonies formed by the MCF7 cells that were transduced with the sgNC/sgRNAs targeting the (G) ORF-LINC00992 or (H) ORF-GATA3-AS1, after the cells were cultured for two weeks. The representative pictures of clonogenic growth and the bar graph quantifying the colonies formed by the T47D cells that were transduced with the sgNC/sgRNAs targeting the (I) ORF-LINC00992 or (J) ORF-GATA3-AS1. (K)!The wild-type FLAG-tagged ORF-GATA3-AS1 or the mutant one (AGG mutation in start codon) was stably expressed in MCF7 and T47D cells and the protein expression was determined by western blot with an anti-FLAG antibody, where β-actin was used as a loading control. (L)!QRT-PCR was performed to determine the siRNA-mediated GATA3-AS1 knockdown efficiency in MCF7 and T47D cells, where GAPDH served as an internal control. The rescue experiment results for the cell growth defect caused by GATA3-AS1 knockdown are shown. The (M) MCF7 and (N) T47D cells stably transduced with the ORF-GATA3-AS1 that has a wild-type (ATG)/mutant (AGG) start codon or the empty vector control (EV), were transfected with the negative control siRNA (siNC) or the siRNAs targeting GATA3-AS1 outside the CDS region (siGATA3-AS1) and were cultured for 4 days. The cell growth was monitored each day with CCK-8 assay. The rescue experiment results for the clonogenic growth defect caused by GATA3-AS1 knockdown are shown. The representative pictures of clonogenic growth and the bar graph quantifying the colonies formed by the (O) MCF7 or (P) T47D cells that were transduced with the wild-type/mutant (AGG mutation in start codon) ORF-GATA3-AS1 or the EV control, and were transfected with the siNC and the siRNAs targeting GATA3-AS1. Data (E-J and L-P) are shown as mean+/-standard deviation (SD), n=3. Oneway ANOVA with Dunnett's multiple comparison test (*P<0.05; **P<0.01; ns: not significant, P>0.05). Data (K) are representative of 3 independent experiments. Figure S2. (A) Higher LINC00992 RNA expression was associated with worse overall survival of the patients with luminal tumors, based on TCGA data. The Kaplan-Meier survival curves are plotted for patient groups with high (top 50%) and low (bottom 50%) LINC00992 RNA expression in luminal tumors. The p-value was calculated using log-rank test. (B) The 5' and 3'end of the LINC00992 transcript (ENST00000504107.1) were identified by 5' and 3' RACE. An extension of the 5' end was identified compared with the transcript annotation from GENECODE v22, whereas the 3'end was the same as the annotated one.!(C, D)!The MS2 spectra of the two GT3-INCP-derived tryptic peptides that were detected by MS in the proteins co-IPed with an anti-FLAG antibody from the lysates of the MCF7 cells ectopically expressing FLAG-tagged GT3-INCP. The     qRT-PCR analysis of LINC00992 RNA expression in the T47D cells that were transfected with the negative control siRNA (siNC) or LINC00992-targeting siRNAs (siLINC00992), after E2 (30 nM) or ETOH vehicle treatment (ETOH). (B) After E2/ETOH.treatment, the numbers of the T47D cells treated with the transfection reagent (control) or transfected with the siNC, GATA3-targeting siRNAs (siGATA3) or siLINC00992 were counted every 24 hs for 72 hs. (C) After E2/ETOH treatment, the number of the T47D cells that were treated with the transfection reagent (control) or the T47D cells that were transduced with the EV or the indicated ORFs and were transfected with the siNC/siLINC00992 (siL), was counted every 24 hs for 72 hs. (D) qRT-PCR analysis of the MYB and PDZK1 RNA expression in the MCF7 cells that were transfected with the siNC, siGATA3 or siLINC00992, after E2/ETOH treatment. QRT-PCR analysis of the (E) MYB and (F) PDZK1 RNA expression in the T47D cells that were treated with the transfection reagent (control) or the T47D cells that were transduced with the EV or the indicated ORFs and were transfected with the siNC/LINC00992targeting siRNA (siL), after E2/ETOH treatment. (A-F) Data are shown as mean+/-standard deviation (SD), n=3. One-way ANOVA with Dunnett's multiple comparison test (*P<0.05; **P<0.01; ns: not significant, P>0.05). .