Skip to main content

Creating Projects

Introduction

The MDV project structure contains all the information required to store data and views about the biological study in question. We will use the pbmc3k project as an example:

pbmc3k-mdv
├── datafile.h5
├── datasources.json
├── scanpy-pbmc3k.h5ad
├── state.json
├── tracks
└── views.json

Briefly:

  • datafile.h5
    • HDF5 data file containing the actual dataset (in this case a single-cell RNA sequencing data based on the context) Binary format file for efficient storage and access of large scientific datasets
  • datasources.json
    • Contains metadata and schema information for data columns/fields
    • Defines data types, statistical summaries (min/max, quantiles) for fields like n_genes, percent_mito, etc.
    • Describes cell-level and gene-level data characteristics for visualization purposes
  • scanpy-pbmc3k.h5ad
    • Would typically be a scanpy AnnData format file containing single-cell data (in this example, PBMC = Peripheral Blood Mononuclear Cells, 3k = ~3000 cells). Note this file type only need be included for ChatMDV projects as ChatMDV uses scanpy to do certain operations.
  • state.json
    • Simple configuration file defining available views ("default") and user permissions ("edit")
    • Controls initial view settings and access levels
  • tracks (only used in genomics based projects)
    • Directory containing genomic track files (hg38.bed.gz and index file)
    • Template files for genome browser functionality
  • views.json
    • Defines visualization view configurations with empty initial charts for cells and genes
    • Sets up the default view structure for the data visualization interface

This project structure can be generated using mdvtools which comprises a lightweight version of MDV without the catalogue interface and infrastructure you would get installing the full docker version. mdvtools is for advanced users and data processing workflows, providing command-line tools for creating projects directly from common bioinformatics file formats.

Installation of mdvtools

To ensure a clean and isolated environment for your project, it's recommended to create a Python virtual environment. This helps manage dependencies and avoid conflicts with other projects. Follow these steps to set up a virtual environment and install mdvtools:

  1. Create a Virtual Environment: Navigate to your project directory and run the following command to create a virtual environment named venv:

    python -m venv venv
  2. Activate the Virtual Environment:

    • On Windows, use:
      .\venv\Scripts\activate
    • On macOS and Linux, use:
      source venv/bin/activate
  3. Install mdvtools: With the virtual environment activated, install mdvtools using pip:

    pip install mdvtools

By using a virtual environment, you ensure that mdvtools and its dependencies are contained within your project, reducing the risk of version conflicts.

This provides access to the mdvtools command-line interface.

Supported Input Formats

Input TypeFile FormatCLI CommandSupports ZipWeb Upload
Scanpy.h5adconvert-scanpy
MuData.h5muconvert-mudata
VCF.vcfconvert-vcf

Converting Scanpy Files (.h5ad)

Convert a Scanpy .h5ad file to an MDV project:

mdvtools convert-scanpy output_folder example.h5ad \
--chunk_data \
--add_layer_data=False \
--max_dims=3 \
--delete_existing

With ChatMDV support (enables natural language interaction):

mdvtools convert-scanpy output_folder example.h5ad \
--chatmdv \
--zip

Converting MuData Files (.h5mu)

Convert multi-modal data to MDV format:

mdvtools convert-mudata output_folder example.h5mu \
--chunk_data --max_dims=3 --delete_existing

Converting VCF Files

Convert genomic variant data:

mdvtools convert-vcf output_folder variants.vcf

Creating Zip Archives

Add --zip to any conversion command to create a zip file for easy web upload:

mdvtools convert-scanpy output_folder example.h5ad --zip