1

我正在使用 Python 包ete3。我有树,例如:

((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);

我想查看与树中特定节点最密切相关的叶子(这里的树是Species1_order1)。在示例中,最密切相关的叶子是Species2_order2/Species3_order2Species4_order3/ Species5_order5

代码:

tree = ete3.Tree('((Species1_order1, \
                    (Species2_order2, Species3_order2)), \
                   Species4_order3, Species5_order5);')

新示例:

    tree=ete3.Tree('((((((A,B),C),D),(E,F)),G),(H,I));')

我得到的结果是:

         A    B    C    D    E    F    G    H    I
    A  0.0  2.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    B  2.0  0.0  3.0  4.0  6.0  6.0  6.0  8.0  8.0
    C  3.0  3.0  0.0  3.0  5.0  5.0  5.0  7.0  7.0
    D  4.0  4.0  3.0  0.0  4.0  4.0  4.0  6.0  6.0
    E  6.0  6.0  5.0  4.0  0.0  2.0  4.0  6.0  6.0
    F  6.0  6.0  5.0  4.0  2.0  0.0  4.0  6.0  6.0
    G  6.0  6.0  5.0  4.0  4.0  4.0  0.0  4.0  4.0
    H  8.0  8.0  7.0  6.0  6.0  6.0  4.0  0.0  2.0
    I  8.0  8.0  7.0  6.0  6.0  6.0  4.0  2.0  0.0

但是例如 E 和 F 在树中与 A、B、C 和 D 的距离相等,因此它们似乎比 D 更衣。

一个好的矩阵结果应该是:


        A   B   C   D   E   F   G   H   I
    A   0   1   2   3   4   4   5   6   6
    B   1   0   2   3   4   4   5   6   6
    C   2   2   0   3   4   4   5   6   6
    D   3   3   3   0   4   4   5   6   6
    E   4   4   4   4   0   1   5   6   6
    F   4   4   4   4   1   0   5   6   6
    G   5   5   5   5   5   5   0   6   6
    H   6   6   6   6   6   6   6   0   1
    I   6   6   6   6   6   6   6   1   0

是不是 ?

4

2 回答 2

2

正如评论中所讨论的,ete3给了我们一个名为 的函数Tree.get_closest_leaf,但它的输出不是预期的(我不确定这个值在这里代表什么):

>>> t=ete3.Tree('((Species1_order1,(Species2_order2,Species3_order2)),Species4_order3,Species5_order5);')
>>> t.get_closest_leaf('Species2_order2')
(Tree node 'Species4_order3' (0x115b2f29), 0.0)

相反,您可以像这样获得节点距离:

import ete3
import pandas as pd

def make_matrix(tree):
    def get_root_path(node):
        root_path = [node]
        if node.up:
            root_path.extend(get_root_path(node.up))
        return root_path
    leaves = tree.get_leaves()
    leaf_ct = len(leaves)
    paths = {node.name: set(get_root_path(node)) for node in leaves}
    col_lbls = [leaf.name for leaf in leaves]
    dist_matrix = pd.np.array([pd.np.zeros(leaf_ct)] * leaf_ct)
    df = pd.DataFrame(dist_matrix, index=col_lbls, columns=col_lbls)
    for node1_name, col in df.iteritems():
        for node2_name in col.keys():
            path = paths[node2_name].symmetric_difference(paths[node1_name])
            dist = sum(node.dist for node in path)
            df.at[node1_name, node2_name] = dist
            df.at[node2_name, node1_name] = dist
    return df

注意:由于几个原因,这是一个次优的解决方案,但这个问题并不是要求最有效的解决方案。有关系统发育距离矩阵方法的更多信息,请参阅此链接。

该解决方案还使用pandaswhich 是多余的,因为它实际上只是为了方便行/列标签。pandas删除依赖项并使用本机列表来代替并不难。

这是输出:

>>> tree=ete3.Tree('((Species1_order1, (Species2_order2, Species3_order2)), Species4_order3, Species5_order5);')
>>> make_matrix(tree)
                 Species1_order1  Species2_order2  Species3_order2  Species4_order3  Species5_order5
Species1_order1              0.0              3.0              3.0              3.0              3.0
Species2_order2              3.0              0.0              2.0              4.0              4.0
Species3_order2              3.0              2.0              0.0              4.0              4.0
Species4_order3              3.0              4.0              4.0              0.0              2.0
Species5_order5              3.0              4.0              4.0              2.0              0.0

对于发布的更新,我没有发现任何问题。它似乎给出了正确的结果。Interest_sequence这是由 ete3 渲染的树(我突出显示了从to的距离中计算的 4 个跃点Rhopalosiphum_maidis_Hemiptera):

Interest_sequence这是与之对应的矩阵列:

>>> m['Interest_sequence']
Rhopalosiphum_maidis__Hemiptera            4.0
Drosophila_novamexicana__Hemiptera         5.0
Drosophila_arizonae__Hemiptera             6.0
Drosophila_navojoa__Hemiptera              6.0
Interest_sequence                          0.0
Heliothis_virescens_droso_3a__nan          5.0
Mythimna_separata_droso__nan               6.0
Heliothis_virescens_droso_3i__nan          6.0
Scaptodrosophila_lebanonensis__Diptera     5.0
Mythimna_unipuncta_droso_A__nan            6.0
Xestia_c-nigrum_droso__nan                 8.0
Helicoverpa_armigera_droso__nan            8.0
Mocis_latipes_droso__nan                   7.0
Drosophila_busckii__Diptera                4.0
Drosophila_bipectinata__Diptera            5.0
Drosophila_mojavensis__Diptera             7.0
Drosophila_yakuba__Diptera                 7.0
Drosophila_hydei__Diptera                  7.0
Drosophila_serrata__Diptera                8.0
Drosophila_takahashii__Diptera             9.0
Drosophila_eugracilis__Diptera            11.0
Drosophila_ficusphila__Diptera            11.0
Drosophila_erecta__Diptera                12.0
Drosophila_melanogaster__Diptera          13.0
Sequence_A_nan__nan                       14.0
Drosophila_sechellia__Diptera             15.0
Drosophila_simulans__Diptera              15.0
Drosophila_suzukii__Diptera               12.0
Drosophila_biarmipes__Diptera             12.0
Name: Interest_sequence, dtype: float64
于 2020-02-27T13:49:09.507 回答
0

ete3 函数按预期工作,您只需要遍历树的节点,如下所示:

from ete3 import Tree t = Tree() # Creates an empty tree 
A = t.add_child(name="A") # Adds a new child to the current tree root and returns it 
B = t.add_child(name="B") # Adds a second child to the current tree root and returns it 
C = A.add_child(name="C") # Adds a new child to one of the branches D =
C.add_sister(name="D") # Adds a second child to same branch as before, but using a sister as the starting point 
R = A.add_child(name="R") # Adds a third child to the branch. Multifurcations are supported
# Next, I add 6 random leaves to the R branch names_library is an optional argument. If no names are provided, they will be generated randomly. 
R.populate(6, names_library=["r1","r2","r3","r4","r5","r6"])
for node in t.traverse():
    print(node.get_closest_leaf())

注意:如果内部节点是终端节点,它将有 2 个最近的叶子,使用 node.get_children() 获取它们

于 2021-03-22T15:12:16.213 回答