keras - RNN 变分自动编码器中的字符串输入输出表示

Question

我在看.. 分子自动编码器让我们可以对化合物进行插值和基于梯度的优化https://arxiv.org/pdf/1610.02415.pdf

该论文采用输入的 Smiles 字符串（分子的文本表示），然后使用变分编码器将其映射到 2D 潜在空间。

hexan-3-ol "CCCC(O)CC" 的示例微笑字符串

在论文中，他们用空格将短字符串填充到 120 个字符。

该论文使用一堆一维卷积网络将字符串编码为微笑字符串的潜在表示

然后它使用 3 Gated 循环单元 GRU 将潜在空间中的位置映射回微笑字符串。

我在理解本文时遇到的问题是确定输入和输出结构是什么样的。

这篇论文对输入和输出结构有点模糊。从一维卷积网络的使用中，我怀疑输入是一个向量化的表示，类似于

'C' = 1
'O' = 2
'(' = 3
')' =4
' ' = 0 #for padding

#so the hexan-3-ol smiles above would be 

[1,1,1,1,3,2,4,1,1,0...padding to fixed length]

在输出纸上说

RNN 解码器的最后一层定义了 SMILES 字符串中每个位置的所有可能字符的概率分布

那么对于论文中使用的最大微笑长度 120 和 35 个可能的微笑字符，这是否意味着输出是 [120x35] 数组？

向前推进该逻辑是否表明输入是一个扁平的 [120*35] 数组 - 请记住它是一个自动编码器。

我的问题是 1dConv，它使用的最大长度为 9，如果它是扁平的 [120*35] 数组，则不足以覆盖序列中的下一个原子

谢谢你的帮助...

score 2 · Accepted Answer

SMILES 的定义比您想象的要复杂，因为它是图形的线性表示。

https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system

简而言之，一个字母表示一个原子，例如 C=碳，O=氧。该图可以用括号进行分支，即 C(C)C 将形成“Y”结构。最后，可以使用以数字表示的闭包来创建循环。即“C1CCC1”形成一个正方形（即字母1 与另一个字母1 结合）。

请注意，这个描述并不完整，但应该是一个很好的基础。

如果一个字符串是一个有效的微笑字符串，简单地将它添加到另一个有效的微笑字符串中通常会生成另一个有效的字符串。即“C1CC1”+“C1CC1”=>“C1CC1C1CC1”有效。

通常，on 可以提取微笑字符串的线性部分并将其“嵌入”到另一个中，从而形成有效的微笑字符串。

我相信自动编码器正在学习的是如何进行这些转换。上面示例中替换卤化物（氯、溴、碘）的愚蠢示例可能是：

C1CCC1Cl C1CCC1Br C1CCC1I

自动编码器学习常量部分和可变部分 - 但在线性字符串空间中。现在这并不完美，如果你在论文中注意到，在探索连续可微空间时，他们需要找到最接近的有效微笑字符串。

如果您想探索微笑字符串，本文中使用的所有字符串都是使用 rdkit 生成的：

https://github.com/rdkit/

在充分披露的情况下，我帮助维护了这一点。希望这会有所帮助。

score 1 · Accepted Answer

你可以在这里找到源代码：

https://github.com/maxhodak/keras-molecules

我一直在玩它，输入和输出结构是 MxN 矩阵，其中 M 是 SMILES 字符串的最大长度（在本例中为 120），N 是字符集的大小。除了位置 M_i 处的字符与字符 N_j 匹配的位置外，每一行 M 都是一个零向量。要将输出矩阵解码为 SMILE，然后逐行匹配字符集中的字符位置。

这种编码的一个问题是它占用了大量的内存。使用 keras 图像迭代器方法，您可以执行以下操作：

首先将所有微笑编码为“稀疏”格式，这是您集中每个微笑的字符集位置列表。

现在，您在所有 SMILES（字符集）上定义了一个字符集，并且每个 SMILE 现在是一个数字列表，表示每个字符在字符集中的位置。然后，您可以在使用 fit_generator 函数训练 keras 模型的同时开始使用迭代器进行动态处理。

import numpy as np
import threading
import collections

class SmilesIterator(object):
    def __init__(self, X, charset, max_length, batch_size=256, shuffle=False, seed=None):
        self.X = X
        self.charset = charset
        self.max_length = max_length
        self.N = len(X)
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.batch_index = 0
        self.total_batches_seen = 0
        self.lock = threading.Lock()
        self.index_generator = self._flow_index(len(X), batch_size, shuffle, seed)

    def reset(self):
        self.batch_index = 0

    def __iter__(self):
        return self

    def _flow_index(self, N, batch_size, shuffle=False, seed=None):
        self.reset()
        while True:
            if self.batch_index == 0:
            index_array = np.arange(N)
            if shuffle:
                if seed is not None:
                    np.random.seed(seed + total_batches_seen)
                index_array = np.random.permutation(N)
            current_index = (self.batch_index * batch_size) % N
            if N >= current_index + batch_size:
                current_batch_size = batch_size
                self.batch_index += 1
            else:
                current_batch_size = N - current_index
                self.batch_index = 0
            self.total_batches_seen += 1
            yield(index_array[current_index: current_index +    current_batch_size],
            current_index, current_batch_size)

    def next(self):
        with self.lock:
            index_array, current_index, current_batch_size = next(self.index_generator)
        #one-hot encoding is not under lock and can be done in parallel
        #reserve room for the one-hot encoded
        #batch, max_length, charset_length
        batch_x = np.zeros(tuple([current_batch_size, self.max_length, len(self.charset)]))
        for i, j in enumerate(index_array):
            x = self._one_hot(self.X[j])
            batch_x[i] = x
        return (batch_x, batch_x) #fit_generator returns input and target

    def _one_hot(self, sparse_smile):
        ss = []
        counter = 0
        for s in sparse_smile:
            cur = [0] * len(self.charset)
            cur[s] = 1
            ss.append(cur)
            counter += 1
        #handle end of line, make sure space ' ' is first in the charset
        for i in range(counter, len(self.charset)):
            cur = [0] * len(self.charset)
            cur[0] = 1
            ss.append(cur)
        ss = np.array(ss)
        return(ss)

keras - RNN 变分自动编码器中的字符串输入输出表示

2 回答 2

Related

Reference