php - Regular expression Getting HTML Doctype

Question

My Html code is like this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

or this can be like this

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">

I want to get the Doc Type which will be like "XHTML 1.0 Strict" (for the first one), and "HTML 4.0" (for the second one) from it. What will be the regular expression code for this? I like to use it in PHP preg_match() function.

Please help me in this case.

score 3 · Accepted Answer

使用DOMDocumentand怎么样DOMDocumentType？

$xml = new DOMDocument(); 
$xml->loadHTMLFile($url);

$name = $xml->doctype->publicId; // -//W3C//DTD XHTML 1.0 Strict//EN

$doctype现在包含以下值：

DOMDocumentType Object
(
    [name] => html
    [entities] => (object value omitted)
    [notations] => (object value omitted)
    [publicId] => -//W3C//DTD XHTML 1.0 Strict//EN
    [systemId] => http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
    [internalSubset] => 
    [nodeName] => html
    [nodeValue] => 
    [nodeType] => 10
    [parentNode] => (object value omitted)
    [childNodes] => 
    [firstChild] => 
    [lastChild] => 
    [previousSibling] => 
    [nextSibling] => (object value omitted)
    [attributes] => 
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => 
    [baseURI] => 
    [textContent] => 
)

所以您现在可以轻松提取类型：

$name = $xml->doctype->publicId;
$name = preg_replace('~.*//DTD(.*?)//.*~', '$1', $name);
echo $name;

这将导致XHTML 1.0 Strict. 工作 phpfiddle 示例在这里。

score 3 · Accepted Answer

如果文档类型将采用显示的形式，您可以使用

'#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i'

所以

preg_match('#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i', html, $match);  
echo $match[0];

score 1 · Accepted Answer

function contains($haystack, $needle){
    if (strpos($haystack,$needle) !== false) {
        return true;
    }else{
        return false;
    }
}
                $theDocType = "";
                $stringWithHTML = ""; // load some HTML in here from somewhere

                // Create DOM from HTML 
                $doc = new DOMDocument();
                //@$doc->loadHTMLFile("just_a_file.html");
                @$doc->loadHTML($stringWithHTML);

                // Grab document type
                $dtName = $doc->doctype->name;
                $dtPublic = $doc->doctype->publicId;
                if( $dtName="html" && $dtPublic!=""){           
                    // HTML or XHTML?
                    if(contains($dtPublic,"xhtml")){
                        $theDocType = "XHTML 1.0";
                    }else{
                        $theDocType = "HTML 4.01";
                    }
                    // Which type?
                    if(contains($dtPublic,"strict")){
                        $theDocType .= " (Strict)";
                    }elseif(contains($dtPublic,"transitional")){
                        $theDocType .= " (Transitional)";
                    }elseif(contains($dtPublic,"frameset")){
                        $theDocType .= " (Frameset)";
                    }else{
                        $theDocType = "XHTML 1.1"; // XHTML 1.1
                    }
                }else{
                    $theDocType = "HTML 5";
                }

                // Result
                echo $theDocType;

这将输出如下内容：
XHTML 1.1
HTML 5
HTML 4.01 (Strict)

score 0 · Accepted Answer

我过去使用过这个线程，但在测试过程中，我发现一些大型文档类型存在问题。有时，开发人员将 doctype 分成 2 或 3 行。在这种情况下，使用正则表达式并不是最好的方法。

我在一行或几行中粘贴了一种用于文档类型的方法：

<?
class Doctype {
    var $html;
    var $doctype;
    var $version;
    function Doctype($html){
       $this->html = $html;
       $this->extractDoctype();
       $this->processDoctype();
    }
    private function extractDoctype(){
        $preDoctype = "";
        $preDoctypeValid = false;
        $lines = explode(PHP_EOL, $this->html);
        foreach ($lines as &$line) {
            $preDoctype = $preDoctype . $line;
            if(
                (strpos(strtolower($preDoctype), "<!doctype") !== false) && 
                (strpos(strtolower($preDoctype), ">") !== false)){
                $preDoctypeValid = true;
                break;
            }
        }
        if($preDoctypeValid){
            //Store only the pattern: <! doctype >
            $pos1 = strpos(strtolower($preDoctype), "<!doctype");
            $pos2 = strpos($preDoctype, ">", $pos1) + 1;
            $preDoctype = substr($preDoctype, $pos1, $pos2);            
        }else{
            $preDoctype = "";
        }
        $this->doctype = $preDoctype;
    }
    private function processDoctype(){
        $version = "";

        $pattern_html5 = "/<!doctype\s+?html\s?>/i";
        if (preg_match($pattern_html5, strtolower($this->doctype))) {
            $version = "HTML5";
        }else if(strpos(strtolower($this->doctype), "xhtml") !== false){
            $version = "XHTML";     
        }else if(strpos(strtolower($this->doctype), "html") !== false){
            if(strpos(strtolower($this->doctype), "3.2") !== false){
                $version = "HTML 3.2";  
            }
            if(strpos(strtolower($this->doctype), "4.01") !== false){
                $version = "HTML 4.01"; 
            }
            if(strpos(strtolower($this->doctype), "2.0") !== false){
                $version = "HTML 2.0";  
            }
        }else{
            $version = "OTHER";
        }
        $this->version = $version;
    }
    public function getDoctype(){
        return $this->doctype;
    }
    public function getDoctypeVersion(){
        return $this->version;
    }
}
?>

https://github.com/jabrena/WTAnalyzer/blob/master/r_php/document/Doctype.class.php

score 0 · Accepted Answer

尝试这个

<?php
   $html = file_get_contents("http://google.com");
   $html = str_replace("\n","",$html);
   $get_doctype = preg_match_all("/(<!DOCTYPE.+\">)<html/i",$html,$matches);
   $doctype = $matches[1][0];
?>

score 0 · Accepted Answer

0

'<!doctype.*?//dtd\s+([^/]*)//EN.*?dtd">'

这应该作为您的示例的模式。

于 2013-04-24T14:28:28.327 回答

score 0 · Accepted Answer

这个正则表达式提取“DTD”和“/”之间的所有内容，而不进行任何语法检查：

.*DTD\s+([^/]+)

这个正则表达式提取文档类型并检查字符串中的一些语法：

<!DOCTYPE\s+\w*\s*\w*\s*"[-//\w\d]*DTD\s+([\w\d\s.]*)[^"]*[^>]*>

php - Regular expression Getting HTML Doctype

7 回答 7

Related

Reference