java - Parse fire dispatch website feed to use discrete elements contained within

Question

I would like to be able to parse the following website and separate each dispatch page into discrete elements, such as time, data, address, and each individual unit dispatched to a call, etc.

http://lebanonema.org/pager/html/monitor.html

I would like to be able to use the discrete elements of a page and display them on a different website and such.

For example I would like to turn

this:

20:15:09 22-07-13 POCSAG-1 West Cornwall Township SPANGLER RD HORSESHOE PIKE MV - Accident w/Injuries **NON EMERGENCY RESPONSE* TK5 Fire-Box 37-03 EMS-Box 190-7 Station 05**

<tr>
<td class="COL2">20:15:09</td>
<td class="COL3">22-07-13</td>
<td class="COL4">POCSAG-1</td>
<td class="COL7">
West Cornwall Township SPANGLER RD HORSESHOE PIKE MV - Accident w/Injuries **NON EMERGENCY RESPONSE*** TK5 Fire-Box 37-03 EMS-Box 190-7
<span class="M">Station 05</span>
</td>
</tr>

into individual elements that I could somehow use on another website, such as the following:

time:20:15:09
date:22-07-13
pageid:POCSAG-1
address:West Cornwall Township SPANGLER RD HORSESHOE PIKE
incident:MV - Accident w/Injuries
additional_details:**NON EMERGENCY RESPONSE***
responding_unit_1:TK5
responding_unit_2:
responting_unit_3:
etc...
fire_box:37-03 
ems_box:190-7
station:7

I have moderate experience in HTML, CSS, and Java. I am open to learning much more. If someone can provide me with a snippet of code doing what I am asking, I should be able to learn enough from that in order to learn to complete what I am asking.

Please keep in mind that the page is constantly updated with pages, and that whatever method is used to do what I am asking, would need to accommodate such an environment.

score 1 · Accepted Answer

You are actually asking two questions here. One is how to parse HTML (you find that outlined in How do you parse and process HTML/XML in PHP? and as this has been answered extensively, I skip that part). The other is how to parse a string.

Parsing a string totally depends on the format the string has. This is normally done with PHP's string functions and also with PHP's regular expression functions. Consult the PHP manual for more information about these.

Next to the functions used as I have already outlined, you need as well the format specification of the string. So far, your question only contains examples of the strings, however, the specification is missing which part is what and what the decision criteria is.

You need to specify first, and I would do that before writing the first line of code. In the end, you can then write it in any programming language you like. So it's not that important if PHP or Java, it's much more important you have properly specified how it works. You then encode that processing into code.

Some rough example code (excerpt), to demonstrate how it could be done in PHP:

$url = 'http://lebanonema.org/pager/html/monitor.html';

$buffer = file_get_contents($url);

$buffer = utf8_encode($buffer);

$config = [
    'doctype'    => 'omit',
    'output-xml' => 1,
];

$buffer = tidy_repair_string($buffer, $config, 'utf8');

$xml = simplexml_load_string($buffer);

$nodes = new DecoratingIterator(
    new SimpleXMLXPathIterator($xml, '//tr[count(td) > 1]'),
    'NodeParser'
);

foreach ($nodes as $index => $node) {
    echo $index, ': ', json_encode($node, JSON_PRETTY_PRINT), "\n";
}

Exemplary output:

0: {
    "date": "23-07-13",
    "time": "07:56:28",
    "pageid": "POCSAG-1",
    "text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -\nAccident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
    "station": "Station 31"
}
1: {
    "date": "23-07-13",
    "time": "07:56:26",
    "pageid": "POCSAG-1",
    "text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -\nAccident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
    "station": "Station 30"
}
2: {
    "date": "23-07-13",
    "time": "07:56:25",
    "pageid": "POCSAG-1",
    "text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -\nAccident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
    "station": "Sta 31 Siren"
}

...

497: {
    "date": "22-07-13",
    "time": "12:21:27",
    "pageid": "POCSAG-1",
    "text": "South Lebanon Township 1700 S LINCOLN AVE VA\nMedical CenterAFA - Auto Fire Alarm FG-4 E25 E26 W36 R25 TK26 TK36\nAmbCo190 Fire-Box 25-08 EMS-Box 190-4",
    "station": "Station 26"
}
498: {
    "date": "22-07-13",
    "time": "12:21:20",
    "pageid": "POCSAG-1",
    "text": "South Lebanon Township 1700 S LINCOLN AVE VA\nMedical CenterAFA - Auto Fire Alarm FG-4 E25 E26 W36 R25 TK26 TK36\nAmbCo190 Fire-Box 25-08 EMS-Box 190-4",
    "station": "Station 25"
}
499: {
    "date": "22-07-13",
    "time": "12:18:19",
    "pageid": "POCSAG-1",
    "text": "Company 34 Correction..No Training TOMORROW\nnight..Training Will Be Held Thursday At 1830",
    "station": "Station 34"
}

This example also shows, that you need to deal with more than just the parsing, this is for example cleaning up invalid HTML (in PHP Tidy can be used for this) and dealing with charset encodings.

The NodeParser object is just overloading a concrete <TR> element given back by the xpath() operation - this is basic SimpleXML parsing and has been outlined previously. As a bonus this object implements the JsonSerializable interface so that it can be easily converted / displayed.

Using a parser-object allows you to change and tweak the parsing over time. E.g. as this example code shows, the text so far is not been parsed further on (as the specification is missing).

I hope this is helpful and showing how it could be done at least.

java - Parse fire dispatch website feed to use discrete elements contained within

1 回答 1

Related

Reference