Natural Language Processing
Molecular Sequence Analysis
Apache Spark (Databricks)
Cluster Computing (MPI/SNOW)
GPU-Based Processing (CUDA)
High Performance Computing
Machine Learning, Parallel Computing, Data Engineering
A collection of "cookbook-style" scripts for simplifying data engineering and machine learning in Apache Spark.
Apache Spark is a highly-scalable, massively-parallel computing platform perfect for machine learning and data engineering tasks. Using distributed processing with the Spark API, users can perform various tasks on huge amounts of data using their their preferred language (Python, R, Scala, SQL, etc.), but often there is a bit of a learning curve to using the Spark functionality (PySpark or SparkR) even if the user is a pro at the base language. Sparkitecture is a ebook collection of various script to help make this process a little easier.
Genomics, Machine Learning, Infectious Diseases
Parallel Processing and Ensemble Machine Learning Modeling for the Prediction of Artemisinin Resistance in Malaria (Malaria DREAM Challenge 2019 Submission).
The Malaria DREAM Challenge is open to anyone interested in contributing to the development of computational models that address important problems in advancing the fight against malaria. The overall goal of the first Malaria DREAM Challenge is to predict Artemisinin (Art) drug resistance level of a test set of malaria parasites using their in vitro transcription data and a training set consisting of published in vivo and unpublished in vitro transcriptomes. The in vivo dataset consists of ~1000 transcription samples from various geographic locations covering a wide range of life cycles and resistance levels, with other accompanying data such as patient age, geographic location, Art combination therapy used, etc. [Mok et al., (2015) Science]. The in vitro transcription dataset consists of 55 isolates, with transcription collected at two timepoints (6 and 24 hours post-invasion), in the absence or presence of an Art perturbation, for two biological replicates using a custom microarray at the Ferdig lab. Using these transcription datasets, participants will be asked to predict three different resistance states of a subset of the 55 in vitro isolate samples.
Genomics, Phylogenetics, Infectious Diseases
Shiny web application for visualizing disease transmittion networks from phylogenetic trees.
Strainhub is designed as a web-based software to generate disease transmission networks and associated metrics from a combination of a phylogenetic tree and a metadata associated file. The software maps the metadata onto the tree and performs a parsimony ancestry reconstruction step to create links between the associated metadata and enable the construction of the network.