Published May 9, 2023 | Version v1
Software Open

LoGoFunc

Creators

Description

# LoGoFunc

This repository contains code to train LoGoFunc as well as the trained LoGoFunc model along with the data and code to generate the GOF, LOF, and neutral predictions for missense variants genome-wide. 
The trained LoGoFunc models are included in the models directory. Training and testing data along with functional labels are included in the data directory.  

# Requirements

To reproduce the LoGoFunc predictions, you will need the following python libraries:
 - pandas
 - joblib
 - lightgbm
 - scikit-learn
 - imbalanced-learn

# Setup 

We recommend the use of a conda environment for the management of python dependencies. You may prepare and run LoGoFunc in a conda environment using the following steps assuming you have conda installed and added to your path. 
1. `conda create -n logofunc python=3 pandas=1.5.0 joblib lightgbm=3.2.1 scikit-learn=1.1.2 imbalanced-learn=0.8.0`
2. `conda activate logofunc`
3. `git clone https://gitlab.com/itan-lab/logofunc.git`
4. `cd logofunc`

# Testing

The test.py script may be used generate predictions for the test variants from the LoGoFunc manuscript using the pretrained LoGoFunc model as follows.
1. `python ./scripts/test.py predictions.csv`

Predictions for the testing variants will be included in the predictions.csv file. 

# Training models

LoGoFunc may be trained from scratch as follows.
1. `python ./scripts/train.py output_dir`

Trained models and the fit data preprocessor will be stored in `output_dir`. To generate predictions with the fit models, you may edit the paths in the `test.py` file to load the models and preprocessor from `output_dir`. 

# Generating predictions for all missense variants

To reproduce LoGoFunc's GOF, LOF, and neutral predictions for missense variants, you must download the annotated missense variants which are available here: https://zenodo.org/record/7562029/files/data.csv.gz?download=1

1. `wget -O data.csv.gz "https://zenodo.org/record/7562029/files/data.csv.gz?download=1"`
2. `python ./scripts/predict.py -f data.csv.gz -m ./models -o output.csv`

Due to the large set of annotations employed by LoGoFunc, most users will not be able to load the entire missense variant file into memory. As a result, missense variants are read and predicted 10,000 at a time. This may be adjusted by passing the '-s' flag and an integer of your choosing to the predict script (see `python ./scripts/predict.py` -h for more details). The output file will contain the LoGoFunc neutral, GOF, and LOF predictions.

Files

logofunc.zip

Files (144.4 MB)

Name Size Download all
md5:0c06f1888774e9338f767a8ec39b8aaa
144.4 MB Preview Download