1

我有个问题。对于一个错误,我有很多无效的 JSON 字符串,如下所示:

{
    "d": {
        "results": [
            {
                "__metadata": {
                    "uri": "https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query=u0027non supporting iframesu0027&Market=u0027it-ITu0027&Adult=u0027Offu0027&Options=u0027DisableLocationDetectionu0027&WebSearchOptions=u0027DisableQueryAlterationsu0027&$skip=0&$top=1",
                    "type": "WebResult"
                },
                "ID": "7858fc9f-6bd5-4102-a835-0fa89e9f992a",
                "Title": "something good",
                "Description": "something "WRONG" here!",
                "DisplayUrl": "www.devx.com/Java/Article/27685/1954",
                "Url": "http://www.devx.com/Java/Article/27685/1954"
            }
        ],
        "__next": "https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query=u0027non%20supporting%20iframesu0027&Market=u0027it-ITu0027&Adult=u0027Offu0027&Options=u0027DisableLocationDetectionu0027&WebSearchOptions=u0027DisableQueryAlterationsu0027&$skip=50"
    }
}

如您所见,描述字段包含一个错误的字符串(“转换为”),因此我无法使用 php 的 json_decode 解析 json,实际上它返回 NULL。我有 100 万个错误的 json,比这大得多(10 倍)。我怎么能在php中做?

4

1 回答 1

1

在您的情况下,您可以利用 json 中的字符串不能超过一行的事实。使用多行感知搜索并用preg_match_callbackPHP 中的正则表达式函数替换这是一个快速的点。

 /^\s+"[a-z_"]+": "([^"]*".*)",?$/mi

行首的空格;member-name 形式的有效名称(此处只有字符和下划线)作为字符串;the :,然后是断开的字符串,直到行尾可选地后跟一个逗号,?

此正则表达式已仅匹配无效行。但是,如果您的 json 还包含一个有效的字符串\",则此正则表达式实际上不起作用

因此,最好进行一些检查以确保替代品会按照预期进行。

$like = '... json-like but broken json string as in question ...';

// Fixing #1: member strings containing double-quotes on the same line.

$fix1Pattern   = '/^(\s+"[a-z_]+": ")([^"]*".*)(",?)$/mi';

$fix1Callback  = function ($matches) {
    list($full, $prefix, $string, $postfix) = $matches;
    $fixed = strtr($string, ['"' => '\"']);
    if (!is_string(json_decode("\"$fixed\""))) {
        throw new Exception('Fix #1 did not work as intended');
    }
    return "$prefix$fixed$postfix";
};


// apply fix1 onto the string

$buffer = preg_replace_callback($fix1Pattern, $fix1Callback, $like);


// test if it finally works

print_r(json_decode($buffer));

请记住,这是有限的。您可能需要先了解正则表达式,这是它自己的世界。但原理通常非常相似:您在字符串中搜索损坏部分的模式,然后进行一些字符串操作来修复这些部分。

如果 json 字符串更坏,那么这需要更多的爱,可能仅用正则表达式不容易解决。

代码示例的示例输出和提供的数据:

stdClass Object
(
    [d] => stdClass Object
        (
            [results] => Array
                (
                    [0] => stdClass Object
                        (
                            [__metadata] => stdClass Object
                                (
                                    [uri] => https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query=u0027non supporting iframesu0027&Market=u0027it-ITu0027&Adult=u0027Offu0027&Options=u0027DisableLocationDetectionu0027&WebSearchOptions=u0027DisableQueryAlterationsu0027&$skip=0&$top=1
                                    [type] => WebResult
                                )

                            [ID] => 7858fc9f-6bd5-4102-a835-0fa89e9f992a
                            [Title] => something good
                            [Description] => something "WRONG" here!
                            [DisplayUrl] => www.devx.com/Java/Article/27685/1954
                            [Url] => http://www.devx.com/Java/Article/27685/1954
                        )

                )

            [__next] => https://api.datamarket.azure.com/Data.ashx/Bing/Search/Web?Query=u0027non%20supporting%20iframesu0027&Market=u0027it-ITu0027&Adult=u0027Offu0027&Options=u0027DisableLocationDetectionu0027&WebSearchOptions=u0027DisableQueryAlterationsu0027&$skip=50
        )

)
于 2012-11-07T15:25:39.460 回答