Creating Projects
Introduction
The MDV project structure contains all the information required to store data and views about the biological study in question. We will use the pbmc3k project as an example:
pbmc3k-mdv
├── datafile.h5
├── datasources.json
├── scanpy-pbmc3k.h5ad
├── state.json
├── tracks
└── views.json
Briefly:
- datafile.h5
- HDF5 data file containing the actual dataset (in this case a single-cell RNA sequencing data based on the context) Binary format file for efficient storage and access of large scientific datasets
- datasources.json
- Contains metadata and schema information for data columns/fields
- Defines data types, statistical summaries (min/max, quantiles) for fields like n_genes, percent_mito, etc.
- Describes cell-level and gene-level data characteristics for visualization purposes
- scanpy-pbmc3k.h5ad
- Would typically be a scanpy AnnData format file containing single-cell data (in this example, PBMC = Peripheral Blood Mononuclear Cells, 3k = ~3000 cells). Note this file type only need be included for ChatMDV projects as ChatMDV uses scanpy to do certain operations.
- state.json
- Simple configuration file defining available views ("default") and user permissions ("edit")
- Controls initial view settings and access levels
- tracks (only used in genomics based projects)
- Directory containing genomic track files (hg38.bed.gz and index file)
- Template files for genome browser functionality
- views.json
- Defines visualization view configurations with empty initial charts for cells and genes
- Sets up the default view structure for the data visualization interface
This project structure can be generated using mdvtools which comprises a lightweight version of MDV without the catalogue interface and infrastructure you would get installing the full docker version. mdvtools is for advanced users and data processing workflows, providing command-line tools for creating projects directly from common bioinformatics file formats.
Installation of mdvtools
To ensure a clean and isolated environment for your project, it's recommended to create a Python virtual environment. This helps manage dependencies and avoid conflicts with other projects. Follow these steps to set up a virtual environment and install mdvtools
:
-
Create a Virtual Environment: Navigate to your project directory and run the following command to create a virtual environment named
venv
:python -m venv venv
-
Activate the Virtual Environment:
- On Windows, use:
.\venv\Scripts\activate
- On macOS and Linux, use:
source venv/bin/activate
- On Windows, use:
-
Install mdvtools: With the virtual environment activated, install
mdvtools
using pip:pip install mdvtools
By using a virtual environment, you ensure that mdvtools
and its dependencies are contained within your project, reducing the risk of version conflicts.
This provides access to the mdvtools
command-line interface.
Supported Input Formats
Input Type | File Format | CLI Command | Supports Zip | Web Upload |
---|---|---|---|---|
Scanpy | .h5ad | convert-scanpy | ✅ | ✅ |
MuData | .h5mu | convert-mudata | ✅ | ✅ |
VCF | .vcf | convert-vcf | ✅ | ✅ |
Converting Scanpy Files (.h5ad)
Convert a Scanpy .h5ad
file to an MDV project:
mdvtools convert-scanpy output_folder example.h5ad \
--chunk_data \
--add_layer_data=False \
--max_dims=3 \
--delete_existing
With ChatMDV support (enables natural language interaction):
mdvtools convert-scanpy output_folder example.h5ad \
--chatmdv \
--zip
Converting MuData Files (.h5mu)
Convert multi-modal data to MDV format:
mdvtools convert-mudata output_folder example.h5mu \
--chunk_data --max_dims=3 --delete_existing
Converting VCF Files
Convert genomic variant data:
mdvtools convert-vcf output_folder variants.vcf
Creating Zip Archives
Add --zip
to any conversion command to create a zip file for easy web upload:
mdvtools convert-scanpy output_folder example.h5ad --zip