Korean, Edit

HDF Files and filtered_feature_bc_matrix.h5

Recommended Post : 【Bioinformatics】 Bioinformatics Analysis Contents


1. HDF Files (hierarchical data format files)

2. filtered_feature_bc_matrix.h5



1. HDF Files (hierarchical data format files)

⑴ Overview

① Used for easy and fast I/O of large files without size limitations

② One of the methods for exchanging files independent of the environment

③ Used in aerospace, physics, engineering, finance, academic research, genomics, astronomy, electronics, medicine, etc.

④ HDF4 files (.h4) and HDF5 files (.h5) are commonly used

⑵ Creating, reading, and grouping HDF files using Python

① Writing using numpy

② Reading using numpy

③ Writing using pandas

○ Data writing seems to be possible only with 2D data

○ Trying to write 3D data results in the following error

④ Reading using pandas

⑤ References

h5py: reading and writing HDF5 files in Python

How to save a large dataset in an hdf5 file using python ? (Quick Guide)

⑶ HDF File Viewers

① Software for Windows : GDAL, Golden Software Surfer, Safe Software FME Desktop

② Software for Mac OS : NCSA HDFView, Basic ENVISAT Atmospheric Toolbox, WaveMetrics IGOR Pro

③ Software for Linux : GDAL, NCSA HDFView, Basic ENVISAT Atmospheric Toolbox

④ Others : HDFView, h5dump command (Linux)



2. filtered_feature_bc_matrix.h5

HDF5 file hierarchy

Issue 1. (ST) Most spatial transcriptomics pipelines proceed under the assumption that a .h5 file exists.

① Overview

○ Most data is focused on downstream analysis assuming the existence of .h5 files

○ Cases like scanpy.read_visium (Python) and Load10X_Spatial (R) only accept .h5 files as input

○ Some databases provide only derived files without exposing the .h5 file

○ barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz

Solution 1. R sceasy::convertFormat : Failure


library(Seurat)
library(reticulate)
library(sceasy)

pbmc.data <- Seurat::Read10X(data.dir = 
    "./Data/CID44971_spatial/CID44971_raw_feature_bc_matrix/")
pbmc <- Seurat::CreateSeuratObject(counts = pbmc.data)
sceasy::convertFormat(pbmc, from="seurat", to="anndata", 
    outFile='./Data/CID44971_spatial/CID44971_raw_feature_bc_matrix/filtered_feature_bc_matrix.h5')


○ Cause : Depending on v2 or v3 when creating .h5 files, the addition of genome information differs, leading to the aforementioned problem

Solution 2. When there is a tissue_dir directory containing matrix.mtx, barcodes.tsv, features.tsv, and the spatial folder, the code to read Visium data in R. : Success


library(Seurat)

b_data <-ReadMtx('./matrix.mtx.gz',
               './barcodes.tsv.gz',
               './features.tsv.gz',
               feature.column=1)
b_data = CreateSeuratObject(b_data, assay='Spatial')
b_image = Read10X_Image(paste('./spatial/',sep=''))
b_image_ = b_image[Cells(b_data)]
DefaultAssay(object = b_image_) <- 'Spatial'
b_data[["slice1"]]=b_image_
# you can change "slice1"

b_data <- SCTransform(b_data, assay = "Spatial", verbose = FALSE, variable.features.n = 1000)
# Some errors might occur when running SpatialFeaturePlot without SCTransform

SpatialFeaturePlot(b_data, rownames(b_data)[1])
# check whether it works properly


Solution 3. When there is a tissue_dir directory containing matrix.mtx, barcodes.tsv, features.tsv, and the spatial folder, the code to read Visium data in Python. : Success


import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.image import imread
import numpy as np
import cv2
import math

from pathlib import Path, PurePath
from typing import Union, Dict, Optional, Tuple, BinaryIO
import json

def to_spatial(adata, path, library_id = 'TNBC1'):
    
    adata.uns["spatial"][library_id] = dict()
    path = Path(path)
    
    files = dict(
        tissue_positions_file=path / 'spatial/tissue_positions_list.csv',
        scalefactors_json_file=path /  'spatial/scalefactors_json.json',
        hires_image=path / 'spatial/tissue_hires_image.png',
        lowres_image=path / 'spatial/tissue_lowres_image.png',
    )

    # check if files exists, continue if images are missing
    for f in files.values():
        if not f.exists():
            if any(x in str(f) for x in ["hires_image", "lowres_image"]):
                logg.warning(
                    f"You seem to be missing an image file.\n"
                    f"Could not find '{f}'."
                )
            else:
                raise OSError(f"Could not find '{f}'")

    adata.uns["spatial"][library_id]['images'] = dict()
    for res in ['hires', 'lowres']:
        try:
            adata.uns["spatial"][library_id]['images'][res] = imread(
                str(files[f'{res}_image'])
            )
        except Exception:
            raise OSError(f"Could not find '{res}_image'")

    # read json scalefactors
    adata.uns["spatial"][library_id]['scalefactors'] = json.loads(
        files['scalefactors_json_file'].read_bytes()
    )

    #adata.uns["spatial"][library_id]["metadata"] = {
    #    k: (str(attrs[k], "utf-8") if isinstance(attrs[k], bytes) else attrs[k])
    #    for k in ("chemistry_description", "software_version")
    #    if k in attrs
    #}

    # read coordinates
    positions = pd.read_csv(files['tissue_positions_file'], header=None)
    positions.columns = [
        'barcode',
        'in_tissue',
        'array_row',
        'array_col',
        'pxl_col_in_fullres',
        'pxl_row_in_fullres',
    ]
    positions.index = positions['barcode']

    adata.obs = adata.obs.join(positions, how="left")
    
    adata.obsm['spatial'] = adata.obs[
        ['pxl_row_in_fullres', 'pxl_col_in_fullres']
    ].to_numpy()
    
    adata.obs.drop(
        columns=['barcode', 'pxl_row_in_fullres', 'pxl_col_in_fullres'],
        inplace=True,
    )

    return adata

def load_adata(tissue_dir):
    adata1 = sc.read_mtx(tissue_dir + 'matrix.mtx') 
    adata1 = adata1.transpose() 
    cellname = pd.read_csv(tissue_dir + 'barcodes.tsv', sep="\t", header=None, index_col=1, names=['idx', 'barcode']) 
    cellname.index = cellname['idx']
    featurename = pd.read_csv(tissue_dir + 'features.tsv', sep='\t', header=None, index_col=1, names=['gene_ids', 'feature_types'])
    adata1.var = featurename
    adata1.obs = cellname
    adata1.uns["spatial"] = dict()
    adata1 = to_spatial(adata1, tissue_dir, library_id='Library_1')
    adata1 = adata1[adata1.obs.in_tissue == 1,:]
    adata1.var_names_make_unique()
    adata1.obs['unnormalized_counts'] = adata1.X.sum(axis=1).A1
    sc.pp.normalize_total(adata1, inplace=True)
    adata1.obs['normalized_counts'] = adata1.X.sum(axis=1).A1
    sc.pp.log1p(adata1)
    sc.pp.highly_variable_genes(adata1, flavor="seurat", n_top_genes=2000)

    return adata1


Trouble-shooting 1. Warning message: “Invalid name supplied, making object name syntactically valid. New object name is X1160920F; see ?make.names for more details on syntax validity”

○ Resolves by making the directory name start with a letter instead of a number

○ This doesn’t apply to the given code since the directory name isn’t specifically mentioned

Trouble-shooting 2. Exception: File is missing one or more required datasets. (Python)

○ If the .h5 file is not created according to the specified syntax, the above error message is displayed.

Issue 2. (sc or ST) In case matrix.mtx.gz, barcodes.tsv, or features.tsv are corrupted:

① If the matrix.mtx.gz file is corrupted, read it in the R environment using readMM and restore the corrupted file using writeMM.

② For barcodes.tsv and features.tsv, read them in the R environment using read.table and restore the corrupted files using write.table.

Issue 3. (sc or ST) Extracting raw data from R or Python objects

Solution 1. Code to extract matrix.mtx, barcodes.tsv, and features.tsv files from R Seurat object : Successful.


library(Seurat)

counts <- seurat_obj@assays$RNA@counts 
barcodes <- rownames(seurat_obj@meta.data)
features <- rownames(counts)

library(Matrix)
writeMM(counts, file = "~/Downloads/matrix.mtx")

write.table(barcodes, file = "~/Downloads/barcodes.tsv", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")

gene_ids <- features
gene_names <- features
features_df <- data.frame(gene_ids, gene_names)
write.table(features_df, file = "~/Downloads/features.tsv", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")


Application 1. Python code that reports the read count for each barcode from an h5 file.


def get_mapped_reads(h5_file):
    with h5py.File(h5_file, 'r') as f:
        # Access the matrix group in the HDF5 file
        matrix_group = f['matrix']

        # Read the barcode, gene, and data datasets
        barcodes = np.array(matrix_group['barcodes']).astype(str)
        genes = np.array(matrix_group['features/name']).astype(str)
        data = np.array(matrix_group['data'])

        # Sum the data for each barcode to get the number of mapped reads per barcode
        barcode_read_counts = np.array(matrix_group['indptr'][1:] - matrix_group['indptr'][:-1])

        return barcodes, genes, barcode_read_counts



Input : 2022.01.04 09:04

results matching ""

    No results matching ""