我有一个 Pyspark UDF 定义如下 -
from rdkit import Chem
input_smile = 'CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'
converted_smile_in = Chem.MolToSmiles(Chem.MolFromSmiles(input_smile)
def convertSmile(smile):
return (Chem.MolToSmiles(Chem.MolFromSmiles(smile)))
applyconvertSmileUdf = udf(convertSmile)
data_converted = data_converted.withColumn("converted_smile", applyconvertSmileUdf(data_filtered.smiles))
if __name__ == "__main__":
# using the new approach
data_converted.filter(data_converted.converted_smile == converted_smile_in ).select("id","smiles").show()
else:
print("Cannot convert!")
data_converted.converted_smile 和 converted_smile_in 之间的比较会引发错误。我为 convert_smile 打印了大约 20 个值,看起来不错。我们不能这样进行字符串比较吗?
Boost.Python.ArgumentError:rdkit.Chem.rdmolfiles.MolToSmiles(NoneType) 中的 Python 参数类型与 C++ 签名不匹配:MolToSmiles(RDKit::ROMol mol, bool isomericSmiles=True, bool kekuleSmiles=False, int rootedAtAtom=-1, bool canonical=True, bool allBondsExplicit=False, bool allHsExplicit=False, bool doRandom=False)