13

The problem: A website I am trying to gather data from uses Javascript to produce a graph. I'd like to be able to pull the data that is being used in the graph, but I am not sure where to start. For example, the data might be as follows:

var line1=
[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],
["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],
["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],
["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],
["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],
["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],
["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];

This is pricing data (Date, Price, Volume). I've found another question here - Parsing variable data out of a js tag using python - which suggests that I use JSON and BeautifulSoup, but I am unsure how to apply it to this particular problem because the formatting is slightly different. In fact, in this problem the code looks more like python than any type of JSON dictionary format.

I suppose I could read it in as a string, and then use XPATH and some funky string editing to convert it, but this seems like too much work for something that is already formatted as a Javascript variable.

So, what can I do here to pull this type of organized data from this variable while using python? (I am most familiar with python and BS4)

4

4 回答 4

11

如果您的格式确实只有一个或多个var foo = [JSON array or object literal];,您可以编写一个 dotall 正则表达式来提取它们,然后将每个解析为 JSON。例如:

>>> j = '''var line1=
[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],
["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],
["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],
["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],
["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],
["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],
["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];\s*$'''
>>> values = re.findall(r'var.*?=\s*(.*?);', j, re.DOTALL | re.MULTILINE)
>>> for value in values:
...     print(json.loads(value))
[[['Wed, 12 Jun 2013 01:00:00 +0000', 22.4916114807, '2 sold'],
  ['Fri, 14 Jun 2013 01:00:00 +0000', 27.4950008392, '2 sold'],
  ['Sun, 16 Jun 2013 01:00:00 +0000', 19.5499992371, '1 sold'],
  ['Tue, 18 Jun 2013 01:00:00 +0000', 17.25, '1 sold'],
  ['Sun, 23 Jun 2013 01:00:00 +0000', 15.5420341492, '2 sold'],
  ['Thu, 27 Jun 2013 01:00:00 +0000', 8.79045295715, '3 sold'],
  ['Fri, 28 Jun 2013 01:00:00 +0000', 10, '1 sold']]]

当然,这做了一些假设:

  • 行尾的分号必须是实际的语句分隔符,而不是字符串的中间。这应该是安全的,因为 JS 没有 Python 风格的多行字符串。
  • 代码实际上在每条语句的末尾都有分号,即使它们在 JS 中是可选的。大多数 JS 代码都有这些分号,但显然不能保证。
  • 数组和对象字面量确实是 JSON 兼容的。这绝对不能保证;例如,JS 可以使用单引号字符串,但 JSON 不能。但它确实适用于您的示例。
  • 您的格式确实是这样定义明确的。例如,如果var line2 = [[1]] + line1;在你的代码中间可能有一个语句,它就会导致问题。

请注意,如果数据可能包含 JavaScript 文字,这些文字并非都是有效的 JSON,但都是有效的 Python 文字(这不太可能,但也不是不可能),您可以ast.literal_eval在它们上使用而不是json.loads. 但除非你知道情况是这样,否则我不会那样做。

于 2013-08-21T22:08:33.517 回答
5

好的,有几种方法可以做到这一点,但我最终只是使用正则表达式来查找和之间的所有line1=内容;

#Read page data as a string
pageData = sock.read()
#set p as regular expression
p = re.compile('(?<=line1=)(.*)(?=;)')
#find all instances of regular expression in pageData
parsed = p.findall(pageData)
#evaluate list as python code => turn into list in python
newParsed = eval(parsed[0])

当您有良好的编码时,正则表达式很好,但是这种方法比这里的任何其他答案更好(编辑:或更糟!)?

编辑:我最终使用了以下内容:

#Read page data as a string
pageData = sock.read()
#set p as regular expression
p = re.compile('(?<=line1=)(.*)(?=;)')
#find all instances of regular expression in pageData
parsed = p.findall(pageData)
#load as JSON instead of using evaluate to prevent risky execution of unknown code
newParsed = json.loads(parsed[0])
于 2013-08-21T22:14:20.273 回答
0

下面做了一些假设,例如知道页面是如何格式化的,但是在Python上将示例放入内存的一种方法是这样的

# example data
data = 'foo bar foo bar foo bar foo bar\r\nfoo bar foo bar foo bar foo bar \r\nvar line1=\r\n[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],\r\n["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],\r\n["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],\r\n["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],\r\n["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],\r\n["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],\r\n["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];\r\nfoo bar foo bar foo bar foo bar\r\nfoo bar foo bar foo bar foo bar'
# find your variable's start and end
x = data.find('line1=') + 6
y = data.find(';', x)
# so you can get just the relevant bit
interesting = data[x:y].strip()
# most dangerous step! don't do this on unknown sources
parsed = eval(interesting)
# maybe you'd want to use JSON instead, if the data has the right syntax
from json import loads as JSON
parsed = JSON(interesting)
# now parsed is your data
于 2013-08-21T22:01:22.937 回答
0

假设您有一个带有 javascript 行/块作为字符串的 python 变量"var line1 = [[a,b,c], [d,e,f]];",您可以使用以下几行代码。

>>> code = """var line1 = [['a','b','c'], ['d','e','f'], ['g','h','i']];"""
>>> python_readable_code = code.strip("var ;")
>>> exec(python_readable_code)
>>> print(line1)
[['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']]

exec()将运行格式化为字符串的代码。在这种情况下,它会将变量设置line1为带有列表的列表。

而且你可以使用这样的东西:

for list in line1:
    print(list[0], list[1], list[2])
    # Or do something else with those values, like save them to a file
于 2013-08-21T22:03:44.890 回答