Xarray 使用 pandas 索引/列标签作为默认元数据。当所有变量共享相同的维度时,您可以在单个函数调用中进行转换,但如果不同的变量具有不同的维度,则需要分别从 pandas 转换它们,然后将它们放在 xarray 端。例如:
import pandas as pd
import io
import xarray
# read your data
cell_metadata = pd.read_csv(io.StringIO(u"""\
cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F"""))
counts = pd.read_csv(io.StringIO(u"""\
cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65"""))
# build the output
xarray_counts = xarray.DataArray(counts.set_index('cell'), dims=['cell', 'gene'])
xarray_counts.coords.update(cell_metadata.set_index('cell').to_xarray())
print(xarray_counts)
这会产生一个漂亮、整洁xarray.DataArray
的计数:
<xarray.DataArray (cell: 5, gene: 20)>
array([[308, 289, 81, 0, 4, 88, 52, 0, 0, 104, 65, 0, 1, 0,
9, 8, 12, 283, 12, 37],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0],
[375, 325, 70, 0, 2, 72, 36, 13, 0, 60, 105, 0, 13, 0,
0, 29, 15, 264, 0, 65]])
Coordinates:
* cell (cell) object 'A1-MAA100140-3_57_F-1-1' ...
* gene (gene) object '0610005C13Rik' ...
Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717
Number of input reads (cell) int64 502312 360285 431800 446705 918
EXP_ID (cell) object '170928_A00111_0068_AH3YKKDMXX' ...
TAXON (cell) object 'mus' 'mus' 'mus' 'mus' 'mus'
WELL_MAPPING (cell) object 'MAA100140' 'MAA100140' ...
Lysis Plate Batch (cell) float64 nan nan nan nan nan
dNTP.batch (cell) float64 nan nan nan nan nan
oligodT.order.no (cell) float64 nan nan nan nan nan
plate.type (cell) object 'Biorad 96well' ...
preparation.site (cell) object 'Stanford' 'Stanford' ...
date.prepared (cell) float64 nan nan nan nan nan
date.sorted (cell) int64 170720 170720 170720 170720 ...
tissue (cell) object 'Liver' 'Liver' 'Liver' ...
subtissue (cell) object 'Hepatocytes' 'Hepatocytes' ...
mouse.id (cell) object '3_57_F' '3_57_F' '3_57_F' ...
FACS.selection (cell) float64 nan nan nan nan nan
nozzle.size (cell) float64 nan nan nan nan nan
FACS.instument (cell) float64 nan nan nan nan nan
Experiment ID (cell) float64 nan nan nan nan nan
Columns sorted (cell) float64 nan nan nan nan nan
Double check (cell) float64 nan nan nan nan nan
Plate (cell) float64 nan nan nan nan nan
Location (cell) float64 nan nan nan nan nan
Comments (cell) float64 nan nan nan nan nan
mouse.age (cell) int64 3 3 3 3 3
mouse.number (cell) int64 57 57 57 57 57
mouse.sex (cell) object 'F' 'F' 'F' 'F' 'F'
如果您想要一个 Dataset,请将 DataArray 对象放入 Dataset 构造函数中,例如,
# shouldn't really need to use .data_vars here, that might be an xarray bug
>>> xarray.Dataset({'counts': xarray.DataArray(counts.set_index('cell'),
... dims=['cell', 'gene'])},
... coords=cell_metadata.set_index('cell').to_xarray().data_vars) <xarray.Dataset>
Dimensions: (cell: 5, gene: 20)
Coordinates:
* cell (cell) object 'A1-MAA100140-3_57_F-1-1' ...
* gene (gene) object '0610005C13Rik' ...
Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717
Number of input reads (cell) int64 502312 360285 431800 446705 918
EXP_ID (cell) object '170928_A00111_0068_AH3YKKDMXX' ...
TAXON (cell) object 'mus' 'mus' 'mus' 'mus' 'mus'
WELL_MAPPING (cell) object 'MAA100140' 'MAA100140' ...
Lysis Plate Batch (cell) float64 nan nan nan nan nan
dNTP.batch (cell) float64 nan nan nan nan nan
oligodT.order.no (cell) float64 nan nan nan nan nan
plate.type (cell) object 'Biorad 96well' ...
preparation.site (cell) object 'Stanford' 'Stanford' ...
date.prepared (cell) float64 nan nan nan nan nan
date.sorted (cell) int64 170720 170720 170720 170720 ...
tissue (cell) object 'Liver' 'Liver' 'Liver' ...
subtissue (cell) object 'Hepatocytes' 'Hepatocytes' ...
mouse.id (cell) object '3_57_F' '3_57_F' '3_57_F' ...
FACS.selection (cell) float64 nan nan nan nan nan
nozzle.size (cell) float64 nan nan nan nan nan
FACS.instument (cell) float64 nan nan nan nan nan
Experiment ID (cell) float64 nan nan nan nan nan
Columns sorted (cell) float64 nan nan nan nan nan
Double check (cell) float64 nan nan nan nan nan
Plate (cell) float64 nan nan nan nan nan
Location (cell) float64 nan nan nan nan nan
Comments (cell) float64 nan nan nan nan nan
mouse.age (cell) int64 3 3 3 3 3
mouse.number (cell) int64 57 57 57 57 57
mouse.sex (cell) object 'F' 'F' 'F' 'F' 'F'
Data variables:
counts (cell, gene) int64 308 289 81 0 4 88 52 0 ...