1

I, um, seem to have gotten lost.

I believe my problem is in parsing a PHP DOMDocument class correctly.

I have an XML spreadsheet coming from Excel which has headers for different columns. (It also has multiple worksheets, to help the end user in organizing the data.)

My end goal is markers on a map utilizing JavaScript.

A simplified example of the XML file is here: Note: some of the data is strings, some is numeric, and some is HTML.

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook>
 <Worksheet ss:Name="data">
  <Table>
   <Row>
    <Cell><Data ss:Type="String">lat</Data></Cell>
    <Cell><Data ss:Type="String">lng</Data></Cell>
    <Cell><Data ss:Type="String">boolean_1</Data></Cell>
    <Cell><Data ss:Type="String">boolean_2</Data></Cell>
    <Cell><Data ss:Type="String">Source_documents</Data></Cell>
    <Cell><Data ss:Type="String">description</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">35.032139998</Data></Cell>
    <Cell><Data ss:Type="Number">-117.346952</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="Number">0</Data></Cell>
    <Cell><ss:Data ss:Type="String" xmlns="http://www.w3.org/TR/REC-html40"><Font html:Color="#000000">Copy here inside HTML </Font><I><Font html:Color="#000000">with more copy</Font></I></ss:Data></Cell>
    <Cell><Data ss:Type="String">Copy here without HTML</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">43.444</Data></Cell>
    <Cell><Data ss:Type="Number">-112.005</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="String">Diff Marker Src</Data></Cell>
    <Cell><Data ss:Type="String">Diff Marker Desc</Data></Cell>
   </Row>
  </Table>
 </Worksheet>
 <Worksheet ss:Name="tags">
  <Table>
   <Row>
    <Cell><Data ss:Type="String">tag_label</Data></Cell>
    <Cell><Data ss:Type="String">tag_category</Data></Cell>
    <Cell><Data ss:Type="String">tag_description</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="String">boolean_1</Data></Cell>
    <Cell><Data ss:Type="String">tag_cat_A</Data></Cell>
    <Cell><Data ss:Type="String">bool_1 desc</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="String">boolean_2</Data></Cell>
    <Cell><Data ss:Type="String">tag_cat_B</Data></Cell>
    <Cell><Data ss:Type="String">bool_2 desc</Data></Cell>
   </Row>
  </Table>
 </Worksheet>
</Workbook>

I've been assuming that I need to convert the spreadsheet into either a JSON array, or a better-structured XML doc, that I can parse to create markers for a map. (JSON seems preferable to reduce data being transferred)

If that assumption is correct, I'd like to have a structure which looks kinda like this:

array => {
  data => {
    [0] => {
        lat => '35.032139998',
        lng => '-117.346952',
        booleans => {
            boolean_1 => true
        },
        Source_documents => '<Font html:Color="#000000">Copy here inside HTML </Font><I><Font html:Color="#000000">with more copy</Font></I>',
        'description' => 'Copy here without HTML'
    },
    [1] => {
        lat => '43.444',
        lng => '-112.005',
        booleans => {
            boolean_1 => true,
            boolean_2 => true
        },
        Source_documents => 'Diff Marker Src',
        'description' => 'Diff Marker Desc'
    }
  },
  tags = {
    'boolean_1' => {
        tag_category => 'tag_cat_A',
        'tag_description' => 'bool_1 desc'
    },
    'boolean_2' => {
        tag_category => 'tag_cat_B',
        'tag_description' => 'bool_2 desc'
    }
  }
}

I'm working in PHP, and attempting to transform the XML into JSON utilizing the DOMDocument class. SimpleXML worked fine for me until a new Excel doc was loaded which included the occasional HTML.

I have this PHP code so far:

function get_worksheet_table($file, $worksheet_name) {
  $dom = new DOMDocument;
  $dom->load($file);

  // returns a new instance of class DOMNodeList
  $worksheets = $dom->getElementsByTagName( 'Worksheet' );

  foreach($worksheets as $worksheet) {

    // check if right sheet
    if( $worksheet->getAttribute('ss:Name') == $worksheet_name) { 

      // trying to get entire node, or childNodeList, or ... ?
      // About here I am getting lost.
      $nodes = $worksheet->getElementsByTagName('Table')->item(0); 

      $table = new DOMDocument;
      $table->preserveWhiteSpace = false;
      $table->formatOutput = true;
      $table->createElement('Table');

      /*
         ITERATE THROUGH $nodes, ADD EACH CELL NODE'S CONTENTS 
         TO $table -- UNLESS IT HAS HTML, THEN USE DOMinnerHTML(node) 
         (DOMinnerHTML function @ http://php.net/manual/en/book.dom.php#89718)
       */

      return $table;
    }
  }
  return false;
}

$data = get_worksheet_table($file, 'data');
$tags = get_worksheet_table($file, 'tags');

From there, I'm trying to create associative arrays from $data and $tags, then output a big JSON statement to pass to my application.

But it is really a mess, and I'm, well like I said, I'm lost.

Questions:

  1. Does this look like I'm at least on the right track?
  2. How do I get access the nodes properly?—I seem to be getting all subnodes as one big text value.
  3. How do I iterate through the DOM to access the cells' text content where appropriate, and accessing any children of the <data> nodes as a string, rather than a child node?

Any pointers you might have toward better understanding how to parse the DOMDocument class would be appreciated. I keep reading through the documentation, but it's eluding me.

Thank you so much for your time.

4

1 回答 1

0

After considerably more research, I found a way to achieve what I want. I am not going to claim that this is the best possible method, by a long shot.

However, I was able to:

  1. parse an XML Spreadsheet, generated from Excel, into an array structured as I wanted;
  2. output that as JSON; and
  3. maintain any text styling as HTML within the generated output.

To be fair, I have not pushed the limits of the HTML—for example, we're really only messing with <b> and <i> tags. Font tags were coming in as well, and I decided to strip them.

I would not be surprised if there are cleaner, more elegant ways to do this—I'm pretty much getting out of an object into an array as soon as possibled—and I should also note that in my case, I'm dealing with a relatively small data load. YMMV for larger projects, but if you are reading this far, than I hope this helps.

Here, then, is my function to generate an array of data from an XML Worksheet table:

/* array_from_worksheet_table()
 * Generate an array from an XML Worksheet
 * $file needs to be the full path to your file (e.g., '/Users/jeremy/www/cms/files/yourfile.xml')
 * $worksheet_name = the name of the worksheet tab
 */
function array_from_worksheet_table($file, $worksheet_name) {

  // https://stackoverflow.com/questions/7082401/avoid-domdocument-xml-warnings-in-php
  $previous_errors = libxml_use_internal_errors(true);

  $dom = new DOMDocument;
  if( !$dom->load($file) ) {
    foreach (libxml_get_errors() as $error) {
      // print_r($error);
    }
  }

  libxml_clear_errors();
  libxml_use_internal_errors($previous_errors);


  // returns a new instance of class DOMNodeList
  $worksheets = $dom->getElementsByTagName( 'Worksheet' );

  foreach($worksheets as $worksheet) {
    if( $worksheet->getAttribute('ss:Name') == $worksheet_name) {

      // When we get a DOMNodeList, if we want to access the first item, we have to
      // then use ->item(0). Important once we want to access a deeper-level DOMNodeList
      $rows = $worksheet->getElementsByTagName('Table')->item(0)->getElementsByTagName('Row');

      $table = array();

      // Get our headings.
      // This assumes that the first row HAS our headings!
      $headings = $rows->item(0)->getElementsByTagName('Cell');

      // loop through table rows. Setting $i=1 instead of 0 means we skip the first row
      for( $i = 1; $i < $rows->length; $i++ ) {

        // this is our row of data
        $cells = $rows->item($i)->getElementsByTagName('Cell'); 

        // loop through each cell
        for( $c = 0; $c < $cells->length; $c++ ) {

          // check for data element in cell
          $celldata = $cells->item($c)->getElementsByTagName('Data');

          // If the cell has data, proceed
          if( $celldata->length ) {

            // Get HTML content of any strings
            if( $celldata->item(0)->getAttribute('ss:Type')== 'String' ) {

              // Does not work for PHP < 5.3.6
              // If you HAVE PHP 5.3.6 then use function @ https://stackoverflow.com/questions/2087103/
              // $value = xml_to_json::DOMinnerHTML( $celldata->item(0) );

              // DOMNode::C14N canonicalizes nodes into strings
              // This workaround is required for PHP < 5.3.6
              $value = $celldata->item(0)->C14N();

              // hack. remove tags like <ss:Data foo...> and </Data>
              // Necessary because C14N leaves outer tags (saveHTML did not)
              $value = preg_replace('/<([s\/:]+)?Data([^>]+)?>/i', '', $value);

              // Remove font tags from HTML. Bleah.
              $value = preg_replace('/<\/?font([^>]+)?>/i', '', $value);
            } else {
              $value = $cells->item($c)->nodeValue;
            }

            // grab label from first row
            $label = $headings->item($c)->nodeValue;

            $table[$i][$label] = $value;
          }
        }
      }
    return $table;
    }
  }
  return false;
}

This returned an array for a worksheet table, which I was then able to further manipulate.

One task was re-organizing the resulting array so that my boolean values were all in a sub-array. First I removed all zero values, using remove_element_by_value($data, '0') (Found that function @ https://stackoverflow.com/a/4466181/156645)

Then I compared array keys to the values found in my tags array, and appended them to each subarray, something like this ($long_codes was my simple array of the tag values):

if($data_array) {
  foreach($data_array as $key => $array) {
    foreach($array as $k => $val) {
      if( in_array($k, $long_codes)) {
        $data_array[$key]['Classify'][] = $k;
        unset($data_array[$key][$k]);
      }
    }
  }
}

Output was just echo json_encode($the_big_array), where the big array was just array('data' => $data_array, 'tags' => $tags_array).

Hope that helps somebody else!

于 2013-11-15T16:27:13.343 回答