10

我最初问这个关于使用包执行此任务的问题httr,但我认为不可能使用httr. 所以我已经重写了我的代码来RCurl代替使用——但我仍然在绊倒可能与writefunction..有关的东西,但我真的不明白为什么。

您应该能够使用 32 位版本的 R 来重现我的工作,因此如果您将任何内容读入 RAM,就会达到内存限制。我需要一个直接下载到硬盘的解决方案。

首先,这段代码可以工作——压缩文件被适当地保存到磁盘上。

library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f@ref)
close(f)
# 2.1 GB file successfully written to disk

现在这里有一些RCurl不起作用的代码。如上一个问题所述,要准确复制它需要在ipums上创建一个提取。

your.email <- "email@address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"

library(RCurl)

values <- 
    list(
        "login[email]" = your.email , 
        "login[password]" = your.password , 
        "login[is_for_login]" = 1
    )

curl = getCurlHandle()

curlSetOpt(
    cookiejar = 'cookies.txt', 
    followlocation = TRUE, 
    autoreferer = TRUE, 
    ssl.verifypeer = FALSE,
    curl = curl
)

params <- 
    list(
        "login[email]" = your.email , 
        "login[password]" = your.password , 
        "login[is_for_login]" = 1
    )

html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)

现在我已经登录,尝试与上面相同的命令,但使用curl对象来保留 cookie。

filename <- tempfile()
f <- CFILE(filename, mode = "wb")

这条线断了——

curlPerform(url = extract.path, writedata = f@ref, curl = curl)
close(f)

# the error is:
Error in curlPerform(url = extract.path, writedata = f@ref, curl = curl) : 
  embedded nul in string: [[binary jibberish here]]

我上一篇文章的答案让我参考了这个 c 级 writefunction答案,但我对如何重新创建 curl_writer C 程序(在 Windows 上?)一无所知。

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)

..或者为什么它甚至是必要的,因为这个问题顶部的五行代码没有任何像getNativeSymbolInfo. 我只是不明白为什么传入curl那个存储身份验证/cookies并告诉它不要验证SSL的额外对象会导致原本可以工作的代码..中断?

4

2 回答 2

3
  1. 从此链接创建一个名为的文件curl_writer.c并将其保存到C:\<folder where you save your R files>

    #include <stdio.h>
    
    /**
     * Original code just sent some message to stderr
     */
    size_t writer(void *buffer, size_t size, size_t nmemb, void *stream) {
        fwrite(buffer,size,nmemb,(FILE *)stream);
        return size * nmemb;
    }
    
  2. 打开命令窗口,转到保存的文件夹curl_writer.c并运行 R 编译器

    c:> cd "C:\<folder where you save your R files>"
    c:> R CMD SHLIB -o curl_writer.dll curl_writer.c
    
  3. 打开 R 并运行您的脚本

    C:> R
    
    your.email <- "email@address.com"
    your.password <- "password"
    extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
    
    library(RCurl)
    
    values <- 
        list(
            "login[email]" = your.email , 
            "login[password]" = your.password , 
            "login[is_for_login]" = 1
        )
    
    curl = getCurlHandle()
    
    curlSetOpt(
        cookiejar = 'cookies.txt', 
        followlocation = TRUE, 
        autoreferer = TRUE, 
        ssl.verifypeer = FALSE,
        curl = curl
    )
    
    params <- 
        list(
            "login[email]" = your.email , 
            "login[password]" = your.password , 
            "login[is_for_login]" = 1
        )
    
    html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
    dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
    
    # Load the DLL you created
    # "writer" is the name of the function
    # "curl_writer" is the name of the dll
    dyn.load("curl_writer.dll")
    writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
    
    # Note that "URL" parameter is upper case, in your code it is lowercase
    # I'm not sure if that has something to do
    # "writer" is the symbol defined above
    f <- CFILE(filename <- tempfile(), "wb")
    curlPerform(URL=url, writedata=f@ref, writefunction=writer, curl=curl)
    close(f)
    
于 2013-07-06T20:07:31.627 回答
1

现在可以通过httr软件包实现这一点。谢谢哈德利!

https://github.com/hadley/httr/issues/44

于 2014-10-02T08:59:54.543 回答