Pipeline
Before getting started
Make sure that your SLKB pipeline package is appropriately installed. In order to run GEMINI Score and MAGeCK Score, you need to follow two additional steps:
- GEMINI Score: Make sure that you have an R environment with GEMINI installed (version >= 2.1.1), and ggplot2 (will be installed alongside GEMINI).
- MAGeCK Score: Make sure that you have MAGECK installed into your system path.
If you cannot change your system path and/or need to load R environment seperately (e.g. working on HPC systems), this will be covered later.
You can check whether you can access your R environment and MAGeCK on your computer.
import shutil
shutil.which('R') ## should yield accessed R environment location
shutil.which('mageck') ## should yield MAGeCK location
SLKB Pipeline Template
A template has been prepared that follows through the guide's steps. Feel free to install it from its GitHub link following SLKB package's installation.
Starting with a local database (SLKB Schema)
First, we start with creating a local database to store the CRISPR synthetic lethality data at hand. SLKB supports mysql and sqlite3 at this time. A URL object is passed that will then create the database via stored schemas.
url_object = sqlalchemy.URL.create(
"mysql+mysqlconnector",
username="root",
password="password", # plain (unescaped) text
host="localhost",
port = '3306',
database="SLKB_mysql",
)
url_object = 'sqlite:///SLKB_sqlite3'
SLKB_engine = sqlalchemy.create_engine(url_object, pool_size = 0)
# create the database at the url_object
SLKB.create_SLKB(engine = SLKB_engine, db_type = 'sqlite3') # or mysql
Preparing Data for Insert
sgRNA sequences
Sequences reference file consists of 3 columns:
- sgRNA_guide_name: Name of your sgRNA guide
- sgRNA_guide_sequence: Sequence for your sgRNA guide
- sgRNA_guide_name: Target gene of the sgRNA guide
Example:
sgRNA_guide_name | sgRNA_guide_sequence | sgRNA_guide_target |
---|---|---|
0Safe-safe-ACOC-204550.4522 | GTGTATTTGGCTTCCAAAA | control |
0Safe-safe-ACOC-204550.4525 | GCATGGCCTCCACTTGCAA | control |
AKT3-1 | GTAAGGTAAATCCACATCTTG | AKT3 |
Dual sgRNA counts
Counts file contains X columns. Make sure that your counts annotation matches with the prior sequence information.
- guide_1: Name of your sgRNA guide at first location
- gene_1: Target of your sgRNA guide at first location (CONTROL if targeting control)
- guide_2: Name of your sgRNA guide at second location
- gene_2: Target of your sgRNA guide at second location (CONTROL if targeting control)
- study_origin: Study identifier of the added data (i.e. PubmedID, MYCDKO)
- cell_line_origin: Cell line identifier of the added data (e.g. 22Rv1)
- study_conditions: Replicate names combined, separated by ';'
- count_replicates: sgRNA counts from each replicate combined, separated by ';'
Example:
guide_1 | guide_2 | gene_1 | gene_2 | count_replicates | cell_line_origin | study_conditions | study_origin |
---|---|---|---|---|---|---|---|
0Safe-safe-ACOC-204550.4522 | 0Safe-safe-ACOC-204550.4522 | 0Safe-safe-ACOC | 0Safe-safe-ACOC | 22;17;36;43 | 22Rv1 | T0_1;T0_2;T12_1;T12_2 | 36060092 |
0Safe-safe-ACOC-204550.4522 | 0Safe-safe-ACOC-204550.4525 | 0Safe-safe-ACOC | 0Safe-safe-ACOC | 57;73;107;98 | 22Rv1 | T0_1;T0_2;T12_1;T12_2 | 36060092 |
0Safe-safe-ACOC-204550.4522 | AKT3-1 | 0Safe-safe-ACOC | AKT3 | 47;45;68;85 | 22Rv1 | T0_1;T0_2;T12_1;T12_2 | 36060092 |
Calculated SL Scores
If you have calculated the gene level SL scores for your data already, you can add them here with the following columns. Otherwise, leave empty.
- gene_1
- gene_2
- study_origin
- cell_line_origin
- SL_score
- SL_score_cutoff (If NaN, no cutoff applied)
- statistical_score
- statistical_score_cutoff (If NaN, no cutoff applied)
Example:
gene_1 | gene_2 | study_origin | cell_line_origin | SL_score | SL_score_cutoff | statistical_score | statistical_score_cutoff |
---|---|---|---|---|---|---|---|
AKT3 | AR | 36060092 | 22Rv1 | -0.977141 | -0.361 | NaN | NaN |
Accessing Database
Database contents can be accessed externally, and also to insert records. In this pipeline, sqlalchemy will be used to load in the database we have just created. The following codes need to be ran to ensure score calculation is correct.
SLKB_engine = sqlalchemy.create_engine(db_engine)
Inserting Data to Database
Following data preperation, the data is now ready to be processed. We will need to declare two additonal variables before calling the processing function:
- control_list: List of control targets. These need to be included at counts reference; gene_1, gene_2 names to use as controls
- timepoint_list: A list of two elements; T0 replicates and TEnd replicates. Make sure that the replicate names align.
Example:
sequence_ref = ...
counts_ref = ...
scores_ref = ...
study_controls = ['CONTROL']
study_conditions = [['T0_rep1', 'T0_rep2', 'T0_rep3'],
['TEnd_rep1', 'TEnd_rep2', 'TEnd_rep3']]
db_inserts = SLKB.prepare_study_for_export(sequence_ref = sequence_ref,
counts_ref = counts_ref,
scores_ref = scores_ref,
study_controls = study_controls,
study_conditions = study_conditions)
If passed checks successfully, you will notice that db_inserts contains the 3 items: sequence, counts, and score reference. In the event no scores reference was given, a dummy score of 0 was given to each possible gene pair. This is done in order to make sure that gene pairs are unique to each study and cell line.
If SL scores are supplied, by default, SL scores and statistical scores below the specified threshold are deemed as SL (column SL_or_not). Otherwise, they are not SL. You can customize this behavior by accessing the yielding db_inserts['scores_ref'].
By default, control genes supplied in scores file are removed.
Finally, data can be inserted to the database.
SLKB.insert_study_to_db(SLKB_engine, db_inserts)
Calculating SL Scores and Inserting to Database
Score calculation methods are independent of each other. They can be ran in any order. The details of each scoring method are located in the original paper. Each score is accompanied with two helper functions; checking whether scores have been added to the database and inserting scores to the database.
- check
- insert
Initial steps
SL scores for the gene pairs are calculated for each cell line individually under each study and cell line. First, we filter the counts to obtain the study counts, followed by the cell line counts.
# read the data
# experiment design
experiment_design = pd.read_sql_query(con=SLKB_engine.connect(),
sql=sqlalchemy.text('SELECT * from CDKO_EXPERIMENT_DESIGN'.lower()), index_col = 'sgRNA_id')
experiment_design.reset_index(drop = True, inplace = True)
experiment_design.index.rename('sgRNA_id', inplace = True)
# counts
counts = pd.read_sql_query(con=SLKB_engine.connect(),
sql=sqlalchemy.text('SELECT * from joined_counts'.lower()), index_col = 'sgRNA_pair_id')
# scores
scores = pd.read_sql_query(con=SLKB_engine.connect(),
sql=sqlalchemy.text('SELECT * from CDKO_ORIGINAL_SL_RESULTS'.lower()), index_col = 'id')
scores.reset_index(drop = True, inplace = True)
scores.index.rename('gene_pair_id', inplace = True)
For all scores, files will be created in the process. You can specify the location to save your files (default: current working directory). This is done in order to enable quick loading to database for repeated analyses. GEMINI Score and MAGeCK score require file generation in order to run. In the event of updated counts file (e.g., adding additional counts), setting the parameter re_run=TRUE
will restart the analysis from scratch.
Median-B/NB Score
if SLKB.check_if_added_to_table(curr_counts.copy(), 'median_nb_score', SLKB_engine):
median_res = SLKB.run_median_scores(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'MEDIAN_Files')
SLKB.add_table_to_db(curr_counts.copy(), median_res['MEDIAN_NB_SCORE'], 'median_nb_score', SLKB_engine)
if median_res['MEDIAN_B_SCORE'] is not None:
SLKB.add_table_to_db(curr_counts.copy(), median_res['MEDIAN_B_SCORE'], 'median_b_score', SLKB_engine)
sgRNA-Derived-B/NB Score
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'sgrna_derived_nb_score', SLKB_engine):
sgRNA_res = SLKB.run_sgrna_scores(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'sgRNA-DERIVED_Files')
SLKB.add_table_to_db(curr_counts.copy(), sgRNA_res['SGRNA_DERIVED_NB_SCORE'], 'sgrna_derived_nb_score', SLKB_engine)
if sgRNA_res['SGRNA_DERIVED_B_SCORE'] is not None:
SLKB.add_table_to_db(curr_counts.copy(), sgRNA_res['SGRNA_DERIVED_B_SCORE'], 'sgrna_derived_b_score', SLKB_engine)
MAGeCK Score
In MAGeCK score, files will be created in process. You can specify the location to save your files (default: current working directory). If you wish to re-run to store new results in its stead, set re_run
to True.
MAGeCK is run through a script file at the designated location. If you need to load in any packages or set path to your mageck installation, please supply them to cmd_params.
cmd_params = []
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'mageck_score', SLKB_engine):
mageck_res = SLKB.run_mageck_score(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'MAGECK_Files', command_line_params = cmd_params,re_run = False)
SLKB.add_table_to_db(curr_counts.copy(), mageck_res['MAGECK_SCORE'], 'mageck_score', SLKB_engine)
Horlbeck Score
In Horlbeck score, files will be created in process. You can specify the location to save your files (default: current working directory). If you wish to re-run to store new results in its stead, set re_run
to True.
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'horlbeck_score', SLKB_engine):
horlbeck_res = SLKB.run_horlbeck_score(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'HORLBECK_Files', do_preprocessing = True, re_run = False)
SLKB.add_table_to_db(curr_counts.copy(), horlbeck_res['HORLBECK_SCORE'], 'horlbeck_score', SLKB_engine)
GEMINI Score
In GEMINI score, files will be created in process. You can specify the location to save your files (default: current working directory). Scores will be stored following GEMINI analysis for quick inserts to the database. If you wish to re-run to store new results in its stead, set re_run
to True.
Similarly to MAGeCK, GEMINI is run through a script file at the designated location. If you need to load in any packages or set path to your mageck installation, please supply them to cmd_params.
cmd_params = ['module load R/4.1.0']
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'gemini_score', SLKB_engine):
gemini_res = SLKB.run_gemini_score(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'GEMINI_Files', command_line_params = cmd_params, re_run = False)
SLKB.add_table_to_db(curr_counts.copy(), gemini_res['GEMINI_SCORE'], 'gemini_score', SLKB_engine)
Query Results (For one table)
Following the score calculations, the query is relatively easy. In this snippet of code, we will access the scores for one of the tables.
score = SLKB.query_result_table(curr_counts.copy(), 'median_b_score', curr_study, curr_cl, SLKB_engine)
Query Results (For all tables)
Here, we will query all scores using the view within the database.
all_scores = pd.read_sql_query(con=SLKB_engine.connect(),
sql=sqlalchemy.text('SELECT * from calculated_sl_table'))
Further Analyses
SLKB web application is available for download to help analyze your generated data. You can access the website at the following link, and it's code at the link. To use user generated data, check server.r
within the web app.