You are actually asking two questions here. One is how to parse HTML (you find that outlined in How do you parse and process HTML/XML in PHP? and as this has been answered extensively, I skip that part). The other is how to parse a string.
Parsing a string totally depends on the format the string has. This is normally done with PHP's string functions and also with PHP's regular expression functions. Consult the PHP manual for more information about these.
Next to the functions used as I have already outlined, you need as well the format specification of the string. So far, your question only contains examples of the strings, however, the specification is missing which part is what and what the decision criteria is.
You need to specify first, and I would do that before writing the first line of code. In the end, you can then write it in any programming language you like. So it's not that important if PHP or Java, it's much more important you have properly specified how it works. You then encode that processing into code.
Some rough example code (excerpt), to demonstrate how it could be done in PHP:
$url = 'http://lebanonema.org/pager/html/monitor.html';
$buffer = file_get_contents($url);
$buffer = utf8_encode($buffer);
$config = [
'doctype' => 'omit',
'output-xml' => 1,
];
$buffer = tidy_repair_string($buffer, $config, 'utf8');
$xml = simplexml_load_string($buffer);
$nodes = new DecoratingIterator(
new SimpleXMLXPathIterator($xml, '//tr[count(td) > 1]'),
'NodeParser'
);
foreach ($nodes as $index => $node) {
echo $index, ': ', json_encode($node, JSON_PRETTY_PRINT), "\n";
}
Exemplary output:
0: {
"date": "23-07-13",
"time": "07:56:28",
"pageid": "POCSAG-1",
"text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -\nAccident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
"station": "Station 31"
}
1: {
"date": "23-07-13",
"time": "07:56:26",
"pageid": "POCSAG-1",
"text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -\nAccident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
"station": "Station 30"
}
2: {
"date": "23-07-13",
"time": "07:56:25",
"pageid": "POCSAG-1",
"text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -\nAccident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
"station": "Sta 31 Siren"
}
...
497: {
"date": "22-07-13",
"time": "12:21:27",
"pageid": "POCSAG-1",
"text": "South Lebanon Township 1700 S LINCOLN AVE VA\nMedical CenterAFA - Auto Fire Alarm FG-4 E25 E26 W36 R25 TK26 TK36\nAmbCo190 Fire-Box 25-08 EMS-Box 190-4",
"station": "Station 26"
}
498: {
"date": "22-07-13",
"time": "12:21:20",
"pageid": "POCSAG-1",
"text": "South Lebanon Township 1700 S LINCOLN AVE VA\nMedical CenterAFA - Auto Fire Alarm FG-4 E25 E26 W36 R25 TK26 TK36\nAmbCo190 Fire-Box 25-08 EMS-Box 190-4",
"station": "Station 25"
}
499: {
"date": "22-07-13",
"time": "12:18:19",
"pageid": "POCSAG-1",
"text": "Company 34 Correction..No Training TOMORROW\nnight..Training Will Be Held Thursday At 1830",
"station": "Station 34"
}
This example also shows, that you need to deal with more than just the parsing, this is for example cleaning up invalid HTML (in PHP Tidy can be used for this) and dealing with charset encodings.
The NodeParser
object is just overloading a concrete <TR>
element given back by the xpath()
operation - this is basic SimpleXML parsing and has been outlined previously. As a bonus this object implements the JsonSerializable
interface so that it can be easily converted / displayed.
Using a parser-object allows you to change and tweak the parsing over time. E.g. as this example code shows, the text so far is not been parsed further on (as the specification is missing).
I hope this is helpful and showing how it could be done at least.