HDF Files and filtered_feature_bc_matrix.h5
Recommended Post : 【Bioinformatics】 Bioinformatics Analysis Contents
1. HDF Files (hierarchical data format files)
2. filtered_feature_bc_matrix.h5
1. HDF Files (hierarchical data format files)
⑴ Overview
① Used for easy and fast I/O of large files without size limitations
② One of the methods for exchanging files independent of the environment
③ Used in aerospace, physics, engineering, finance, academic research, genomics, astronomy, electronics, medicine, etc.
④ HDF4 files (.h4) and HDF5 files (.h5) are commonly used
⑵ Creating, reading, and grouping HDF files using Python
① Writing using numpy
② Reading using numpy
③ Writing using pandas
○ Data writing seems to be possible only with 2D data
○ Trying to write 3D data results in the following error
④ Reading using pandas
⑤ References
○ How to save a large dataset in an hdf5 file using python ? (Quick Guide)
⑶ HDF File Viewers
① Software for Windows : GDAL, Golden Software Surfer, Safe Software FME Desktop
② Software for Mac OS : NCSA HDFView, Basic ENVISAT Atmospheric Toolbox, WaveMetrics IGOR Pro
③ Software for Linux : GDAL, NCSA HDFView, Basic ENVISAT Atmospheric Toolbox
④ Others : HDFView, h5dump command (Linux)
2. filtered_feature_bc_matrix.h5
⑵ Issue 1. (ST) Most spatial transcriptomics pipelines proceed under the assumption that a .h5 file exists.
① Overview
○ Most data is focused on downstream analysis assuming the existence of .h5 files
○ Cases like scanpy.read_visium (Python) and Load10X_Spatial (R) only accept .h5 files as input
○ Some databases provide only derived files without exposing the .h5 file
○ barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz
② Solution 1. R sceasy::convertFormat : Failure
library(Seurat)
library(reticulate)
library(sceasy)
pbmc.data <- Seurat::Read10X(data.dir =
"./Data/CID44971_spatial/CID44971_raw_feature_bc_matrix/")
pbmc <- Seurat::CreateSeuratObject(counts = pbmc.data)
sceasy::convertFormat(pbmc, from="seurat", to="anndata",
outFile='./Data/CID44971_spatial/CID44971_raw_feature_bc_matrix/filtered_feature_bc_matrix.h5')
○ Cause : Depending on v2 or v3 when creating .h5 files, the addition of genome information differs, leading to the aforementioned problem
③ Solution 2. When there is a tissue_dir directory containing matrix.mtx, barcodes.tsv, features.tsv, and the spatial folder, the code to read Visium data in R. : Success
library(Seurat)
b_data <-ReadMtx('./matrix.mtx.gz',
'./barcodes.tsv.gz',
'./features.tsv.gz',
feature.column=1)
b_data = CreateSeuratObject(b_data, assay='Spatial')
b_image = Read10X_Image(paste('./spatial/',sep=''))
b_image_ = b_image[Cells(b_data)]
DefaultAssay(object = b_image_) <- 'Spatial'
b_data[["slice1"]]=b_image_
# you can change "slice1"
b_data <- SCTransform(b_data, assay = "Spatial", verbose = FALSE, variable.features.n = 1000)
# Some errors might occur when running SpatialFeaturePlot without SCTransform
SpatialFeaturePlot(b_data, rownames(b_data)[1])
# check whether it works properly
④ Solution 3. When there is a tissue_dir directory containing matrix.mtx, barcodes.tsv, features.tsv, and the spatial folder, the code to read Visium data in Python. : Success
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.image import imread
import numpy as np
import cv2
import math
from pathlib import Path, PurePath
from typing import Union, Dict, Optional, Tuple, BinaryIO
import json
def to_spatial(adata, path, library_id = 'TNBC1'):
adata.uns["spatial"][library_id] = dict()
path = Path(path)
files = dict(
tissue_positions_file=path / 'spatial/tissue_positions_list.csv',
scalefactors_json_file=path / 'spatial/scalefactors_json.json',
hires_image=path / 'spatial/tissue_hires_image.png',
lowres_image=path / 'spatial/tissue_lowres_image.png',
)
# check if files exists, continue if images are missing
for f in files.values():
if not f.exists():
if any(x in str(f) for x in ["hires_image", "lowres_image"]):
logg.warning(
f"You seem to be missing an image file.\n"
f"Could not find '{f}'."
)
else:
raise OSError(f"Could not find '{f}'")
adata.uns["spatial"][library_id]['images'] = dict()
for res in ['hires', 'lowres']:
try:
adata.uns["spatial"][library_id]['images'][res] = imread(
str(files[f'{res}_image'])
)
except Exception:
raise OSError(f"Could not find '{res}_image'")
# read json scalefactors
adata.uns["spatial"][library_id]['scalefactors'] = json.loads(
files['scalefactors_json_file'].read_bytes()
)
#adata.uns["spatial"][library_id]["metadata"] = {
# k: (str(attrs[k], "utf-8") if isinstance(attrs[k], bytes) else attrs[k])
# for k in ("chemistry_description", "software_version")
# if k in attrs
#}
# read coordinates
positions = pd.read_csv(files['tissue_positions_file'], header=None)
positions.columns = [
'barcode',
'in_tissue',
'array_row',
'array_col',
'pxl_col_in_fullres',
'pxl_row_in_fullres',
]
positions.index = positions['barcode']
adata.obs = adata.obs.join(positions, how="left")
adata.obsm['spatial'] = adata.obs[
['pxl_row_in_fullres', 'pxl_col_in_fullres']
].to_numpy()
adata.obs.drop(
columns=['barcode', 'pxl_row_in_fullres', 'pxl_col_in_fullres'],
inplace=True,
)
return adata
def load_adata(tissue_dir):
adata1 = sc.read_mtx(tissue_dir + 'matrix.mtx')
adata1 = adata1.transpose()
cellname = pd.read_csv(tissue_dir + 'barcodes.tsv', sep="\t", header=None, index_col=1, names=['idx', 'barcode'])
cellname.index = cellname['idx']
featurename = pd.read_csv(tissue_dir + 'features.tsv', sep='\t', header=None, index_col=1, names=['gene_ids', 'feature_types'])
adata1.var = featurename
adata1.obs = cellname
adata1.uns["spatial"] = dict()
adata1 = to_spatial(adata1, tissue_dir, library_id='Library_1')
adata1 = adata1[adata1.obs.in_tissue == 1,:]
adata1.var_names_make_unique()
adata1.obs['unnormalized_counts'] = adata1.X.sum(axis=1).A1
sc.pp.normalize_total(adata1, inplace=True)
adata1.obs['normalized_counts'] = adata1.X.sum(axis=1).A1
sc.pp.log1p(adata1)
sc.pp.highly_variable_genes(adata1, flavor="seurat", n_top_genes=2000)
return adata1
⑤ Trouble-shooting 1. Warning message: “Invalid name supplied, making object name syntactically valid. New object name is X1160920F; see ?make.names for more details on syntax validity”
○ Resolves by making the directory name start with a letter instead of a number
○ This doesn’t apply to the given code since the directory name isn’t specifically mentioned
⑥ Trouble-shooting 2. Exception: File is missing one or more required datasets. (Python)
○ If the .h5 file is not created according to the specified syntax, the above error message is displayed.
⑶ Issue 2. (sc or ST) In case matrix.mtx.gz, barcodes.tsv, or features.tsv are corrupted:
① If the matrix.mtx.gz file is corrupted, read it in the R environment using readMM and restore the corrupted file using writeMM.
② For barcodes.tsv and features.tsv, read them in the R environment using read.table and restore the corrupted files using write.table.
⑷ Issue 3. (sc or ST) Extracting raw data from R or Python objects
① Solution 1. Code to extract matrix.mtx, barcodes.tsv, and features.tsv files from R Seurat object : Successful.
library(Seurat)
counts <- seurat_obj@assays$RNA@counts
barcodes <- rownames(seurat_obj@meta.data)
features <- rownames(counts)
library(Matrix)
writeMM(counts, file = "~/Downloads/matrix.mtx")
write.table(barcodes, file = "~/Downloads/barcodes.tsv", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")
gene_ids <- features
gene_names <- features
features_df <- data.frame(gene_ids, gene_names)
write.table(features_df, file = "~/Downloads/features.tsv", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")
⑸ Application 1. Python code that reports the read count for each barcode from an h5 file.
def get_mapped_reads(h5_file):
with h5py.File(h5_file, 'r') as f:
# Access the matrix group in the HDF5 file
matrix_group = f['matrix']
# Read the barcode, gene, and data datasets
barcodes = np.array(matrix_group['barcodes']).astype(str)
genes = np.array(matrix_group['features/name']).astype(str)
data = np.array(matrix_group['data'])
# Sum the data for each barcode to get the number of mapped reads per barcode
barcode_read_counts = np.array(matrix_group['indptr'][1:] - matrix_group['indptr'][:-1])
return barcodes, genes, barcode_read_counts
Input : 2022.01.04 09:04