2

我的任务是标准化一些地址信息。为了实现这个目标,我将地址字符串分解为粒度值(我们的地址架构与Google 的格式非常相似)。

到目前为止的进展:
我正在使用 PHP,目前正在打破 Bldg、Suite、Room# 等...信息。在我遇到Floors
之前,一切都很顺利。 大多数情况下,楼层信息表示为"Floor 10""Floor 86"。很好很容易。 对于这一点的一切,我可以简单地打破字符串上的字符串("room""floor"等..)

问题:
但后来我注意到我的测试数据集中有一些东西。在某些情况下,楼层的表示更像"2nd Floor"
这让我意识到我需要为FLOOR 信息的大量变化做好准备。
“3rd Floor”“22nd floor”“1ST FLOOR”等选项。那么拼出诸如“十二楼”之类的变体呢?
男人!!这很快就会变得一团糟。

我的目标:
希望有人知道图书馆或已经解决了这个问题的东西。
不过,在现实中,我会非常高兴有一些好的建议/指导,说明如何优雅地处理根据如此多样化的标准拆分字符串(注意避免误报,例如"3rd St")。

4

2 回答 2

0

I would start by reading up on regex in PHP. For example:

$floorarray = preg_split("/\sfloor\s/i", $floorstring)

Other useful functions are preg_grep, preg_match, etc

Edit: added a more complete solution.

This solution takes as an input a string describing the floor. It can be of various formats such as:

  • Floor 102
  • Floor One-hundred two
  • Floor One hundred and two
  • One-hundred second floor
  • 102nd floor
  • 102ND FLOOR
  • etc

Until I can look at an example input file, I am just guessing from your post that this will be adequate.

<?php

$errorLog = 'error-log.txt'; // a file to catalog bad entries with bad floors

// These are a few example inputs
$addressArray = array('Fifty-second Floor', 'somefloor', '54th floor', '52qd floor',
  'forty forty second floor', 'five nineteen hundredth floor', 'floor fifty-sixth second ninth');

foreach ($addressArray as $id => $address) {
  $floor = parseFloor($id, $address);
  if ( empty($floor) ) {
    error_log('Entry '.$id.' is invalid: '.$address."\n", 3, $errorLog);
  } else {
    echo 'Entry '.$id.' is on floor '.$floor."\n";
  }
}

function parseFloor($id, $address)
{
  $floorString = implode(preg_split('/(^|\s)floor($|\s)/i', $address));

  if ( preg_match('/(^|^\s)(\d+)(st|nd|rd|th)*($|\s$)/i', $floorString, $matchArray) ) {
    // floorString contained a valid numerical floor
    $floor = $matchArray[2];
  } elseif ( ($floor = word2num($floorString)) != FALSE ) { // note assignment op not comparison
    // floorString contained a valid english ordinal for a floor
    ; // No need to do anything
  } else {
     // floorString did not contain a properly formed floor
    $floor = FALSE;
  }
  return $floor;
}

function word2num( $inputString )
{
  $cards = array('zero',
    'one',    'two',    'three',    'four',     'five',    'six',     'seven',     'eight',    'nine',     'ten',
    'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty');
  $cards[30] = 'thirty';  $cards[40] = 'forty';  $cards[50] = 'fifty'; $cards[60] = 'sixty';
  $cards[70] = 'seventy'; $cards[80] = 'eighty'; $cards[90] = 'ninety'; $cards[100] = 'hundred';
  $ords  = array('zeroth',
    'first',    'second',  'third',      'fourth',     'fifth',     'sixth',     'seventh',     'eighth',     'ninth',      'tenth',
    'eleventh', 'twelfth', 'thirteenth', 'fourteenth', 'fifteenth', 'sixteenth', 'seventeenth', 'eighteenth', 'nineteenth', 'twentieth');
  $ords[30] = 'thirtieth';  $ords[40] = 'fortieth';  $ords[50] = 'fiftieth';  $ords[60] =  'sixtieth';
  $ords[70] = 'seventieth'; $ords[80] = 'eightieth'; $ords[90] = 'ninetieth'; $ords[100] = 'hundredth';

  // break the string at any whitespace, dash, comma, or the word 'and'
  $words = preg_split( '/([\s-,](?!and\s)|\sand\s)/i', $inputString );

  $sum = 0;
  foreach ($words as $word) {
    $word = strtolower($word);
    $value = array_search($word, $ords); // try the ordinal words
    if (!$value) { $value = array_search($word, $cards); } // try the cardinal words
    if (!$value) {
      // if temp is still false, it's not a known number word, fail and exit
      return FALSE;
    }
    if ($value == 100) { $sum *= 100; }
    else { $sum += $value; }
  }

  return $sum;
}
?>

In the general case, parsing words into numbers is not easy. The best thread that I could find that discusses this is here. It is not nearly as easy as the inverse problem of converting numbers into words. My solution only works for numbers <2000, and it liberally interprets poorly formed constructs rather than tossing an error. Also, it is not resilient against spelling mistakes at all. For example:

  • forty forty second = 82
  • five nineteen hundredth = 2400
  • fifty-sixth second ninth = 67

If you have a lot of inputs and most of them are well formed, throwing errors for spelling mistakes is not really a big deal because you can manually correct the short list of problem entries. Silently accepting bad input, however, could be a real problem depending on your application. Just something to think about when deciding if it is worth it to make the conversion code more robust.

于 2013-02-26T17:46:28.240 回答
0

首先,您需要详细列出所有可能的输入格式,并决定您想要多深。如果您认为拼写变体是无效的情况,您可以应用简单的正则表达式来捕获数字并检测令牌(房间、地板......)

于 2013-02-26T17:16:32.647 回答