我正在寻找一个开源文档管理系统,以索引所有类型的文件(文本:[pdf,doc...],图像[jpg,png,bmp...],视频[mov,mp4... ]) 我偶然发现了Datafari
它使用Solr搜索引擎和ManifoldCF来管理内容存储库连接,并具有Tika 连接器来帮助搜索元数据。
我安装了它,我正在尝试进行设置,以便让它找到根据元数据标准搜索的图像,但到目前为止还没有运气。
我添加了一个带有一些元数据的图像的本地存储库:
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Artist" content="tarzan"/>
<meta name="date" content="2015-03-28T09:47:45"/>
<meta name="Print flags information" content="0 1 0 0 0 0 0 0 0 2"/>
<meta name="Slices" content="zebre (0,0,500,500) 1 Slices"/>
<meta name="ICC Untagged Profile" content="1"/>
<meta name="Compression Type" content="Baseline"/>
<meta name="subject" content="legs"/>
<meta name="subject" content="mammal"/>
<meta name="Image Description" content="this kind of animal is hard to see behind bar"/>
<meta name="Thumbnail Compression" content="JPEG (old-style)"/>
<meta name="Print flags" content="0 0 0 0 0 0 0 0 1"/>
<meta name="By-line" content="tarzan"/>
<meta name="Number of Components" content="3"/>
<meta name="Component 2" content="Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert"/>
<meta name="Component 1" content="Y component: Quantization table 0, Sampling factors 1 horiz/1 vert"/>
<meta name="tiff:ResolutionUnit" content="Inch"/>
<meta name="Object Name" content="king of disguise"/>
<meta name="Seed number" content="1"/>
<meta name="X Resolution" content="72 dots per inch"/>
<meta name="IPTC-NAA record" content="160 bytes binary data"/>
<meta name="Unknown tag (0x043a)" content="[239 bytes]"/>
<meta name="Version Info" content="1 (Adobe Photoshop, Adobe Photoshop CS6) 1"/>
<meta name="Component 3" content="Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert"/>
<meta name="dc:title" content="king of disguise"/>
<meta name="modified" content="2015-03-28T09:47:45"/>
<meta name="Thumbnail Data" content="JpegRGB, 160x160, Decomp 76800 bytes, 1572865 bpp, 6513 bytes"/>
<meta name="tiff:BitsPerSample" content="8"/>
<meta name="Application Record Version" content="42432"/>
<meta name="Resolution Info" content="72.0x72.0 DPI"/>
<meta name="meta:author" content="tarzan"/>
<meta name="meta:creation-date" content="2015-03-28T09:47:45"/>
<meta name="Caption digest" content="[16 bytes]"/>
<meta name="Creation-Date" content="2015-03-28T09:47:45"/>
<meta name="resourceName" content="zebre.jpg"/>
<meta name="Orientation" content="Top, left side (Horizontal / normal)"/>
<meta name="tiff:Orientation" content="1"/>
<meta name="tiff:Software" content="Adobe Photoshop CS6 (Windows)"/>
<meta name="Thumbnail Offset" content="354 bytes"/>
<meta name="Color Transform" content="YCbCr"/>
<meta name="Global Angle" content="120"/>
<meta name="Author" content="tarzan"/>
<meta name="Exif Image Height" content="500 pixels"/>
<meta name="Software" content="Adobe Photoshop CS6 (Windows)"/>
<meta name="tiff:YResolution" content="72.0"/>
<meta name="Y Resolution" content="72 dots per inch"/>
<meta name="dc:description" content="this kind of animal is hard to see behind bars"/>
<meta name="Color transfer functions" content="[112 bytes]"/>
<meta name="Keywords" content="legs"/>
<meta name="Keywords" content="mammal"/>
<meta name="Data Precision" content="8 bits"/>
<meta name="Coded Character Set" content="%G"/>
<meta name="dc:creator" content="tarzan"/>
<meta name="tiff:ImageLength" content="500"/>
<meta name="description" content="this kind of animal is hard to see behind bars"/>
<meta name="JPEG quality" content="12 (Maximum), Standard format, 3 scans"/>
<meta name="dcterms:created" content="2015-03-28T09:47:45"/>
<meta name="dcterms:modified" content="2015-03-28T09:47:45"/>
<meta name="Last-Modified" content="2015-03-28T09:47:45"/>
<meta name="Last-Save-Date" content="2015-03-28T09:47:45"/>
<meta name="Thumbnail Length" content="6513 bytes"/>
<meta name="Color Space" content="Undefined"/>
<meta name="Credit" content="tarzan"/>
<meta name="Global Altitude" content="30"/>
<meta name="meta:save-date" content="2015-03-28T09:47:45"/>
<meta name="Country/Primary Location Name" content="kenya"/>
<meta name="Content-Length" content="93123"/>
<meta name="Content-Type" content="image/jpeg"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.jpeg.JpegParser"/>
<meta name="creator" content="tarzan"/>
<meta name="Color halftoning information" content="[72 bytes]"/>
<meta name="dc:subject" content="legs"/>
<meta name="dc:subject" content="mammal"/>
<meta name="tiff:XResolution" content="72.0"/>
<meta name="Date/Time" content="2015:03:28 09:47:45"/>
<meta name="Grid and guides information" content="[16 bytes]"/>
<meta name="Caption/Abstract" content="this kind of animal is hard to see behind bars"/>
<meta name="DCT Encode Version" content="1"/>
<meta name="Exif Image Width" content="500 pixels"/>
<meta name="Image Height" content="500 pixels"/>
<meta name="Pixel Aspect Ratio" content="1.0"/>
<meta name="Supplemental Category(s)" content="earthly creature"/>
<meta name="Image Width" content="500 pixels"/>
<meta name="Flags 0" content="64"/>
<meta name="Resolution Unit" content="Inch"/>
<meta name="Unknown tag (0x043b)" content="[557 bytes]"/>
<meta name="URL List" content="0"/>
<meta name="meta:keyword" content="legs"/>
<meta name="meta:keyword" content="mammal"/>
<meta name="Print Scale" content="Centered, Scale 1.0"/>
<meta name="tiff:ImageWidth" content="500"/>
<meta name="Flags 1" content="0"/>
<title>king of disguise</title>
</head>
<body/></html>
在 solr schema.xml 我添加了我需要的字段:
<fields>
...
<field name="subject" type="string" indexed="true" stored="true" multiValued="true" />
然后我重新启动了服务器
在 Job 列表的 ManifoldCF 管理中,我在 Job 中添加了 Tika 提取器转换:管道是:我的存储库 -> Tika Extractor -> DatafariSolr
我尝试在 Solr 界面中搜索:对于 q,我尝试过"subject:legs"
,我在 Solr 界面中检索了数据
但在 Datafari 搜索引擎中,我没有得到任何结果
Datafari 的帮助不是很有帮助,我查看了Manifoldcf 文档但没有更多的运气。我想有一个通过元数据进行这种搜索的真实示例。应该修改和/或测试什么以查看结果中的图像?
Olivier Tavard 回答后更新:
谢谢您的帮助。这个工具真的很有前途,虽然我在配置它时仍然遇到问题:
我找不到 datafari/WebContent/js/search.js。您的意思是:datafari/tomcat/webapps/Datafari/js/search.js?
我添加了你的建议。
我还添加了“描述”和“创建者”字段。
1 -在 SolR 搜索中: - 如果我在 q“动物”中搜索,我可以检索我的图像(而不是“动物”),这现在比“描述:动物”更好。- 但如果我搜索“腿”,我什么也检索不到。是不是因为有几个<meta>“主题”,有不同的搜索方式呢?- 如果我搜索“tarzan”(来自创建者字段),我也不会检索任何内容。
2 -在 Datafari UI 搜索中: - 我所做的更改似乎“破坏”了搜索:当我搜索时,我的轮子一直在转动。在控制台中我有:
GET "http://localhost:8080/Datafari/css/menu.css" 404
L'utilisation d'XMLHttpRequest de façon synchrone sur le fil d'exécution principal est obsolète à cause de son impact négatif sur la navigation de l'utilisateur final.
3 - 我为相同的字段添加了另一张带有其他元数据的图片,并且在 SolR 搜索中,如果我查询“jpg”,它们都会出现(OK),但在 json 响应中,额外的字段不会出现在另一个图片 !
{
"responseHeader": {
"status": 0,
"QTime": 6,
"params": {
"indent": "true",
"q": "jpg\n",
"_": "1427968093838",
"wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"last_modified": "2015-03-28T09:47:45Z",
"id": "file:/home/olivier/Bureau/datafari/images/zebre.jpg",
"url": "file:/home/olivier/Bureau/datafari/images/zebre.jpg",
"source": "file",
"extension": "jpg",
"language": "en",
"content_en": [
""
],
"title_en": [
"zebre.jpg"
],
"title": [
"zebre.jpg"
],
"_version_": 1496971642075611100,
"allow_token_share": [
"__nosecurity__"
],
"deny_token_document": [
"__nosecurity__"
],
"deny_token_share": [
"__nosecurity__"
],
"allow_token_document": [
"__nosecurity__"
]
},
{
"last_modified": "2015-03-29T15:45:23Z",
"subject": [
"Description Mots clé"
],
"id": "file:/home/olivier/Bureau/datafari/metadata/image1toto.jpg",
"creator": [
"Description, IPTC - Auteur: beta"
],
"description": [
"Description Description : gamma"
],
"url": "file:/home/olivier/Bureau/datafari/metadata/image1toto.jpg",
"source": "file",
"extension": "jpg",
"language": "en",
"content_en": [
""
],
"title_en": [
"image1toto.jpg"
],
"title": [
"image1toto.jpg"
],
"_version_": 1497001790322770000,
"allow_token_share": [
"__nosecurity__"
],
"deny_token_document": [
"__nosecurity__"
],
"deny_token_share": [
"__nosecurity__"
],
"allow_token_document": [
"__nosecurity__"
]
}
]
},
"highlighting": {
"file:/home/olivier/Bureau/datafari/images/imagejpg.jpg": {
"content_fr": [
""
],
"content_en": [
""
]
},
"file:/home/olivier/Bureau/datafari/images/zebre.jpg": {
"content_fr": [
""
],
"content_en": [
""
]
},
"file:/home/olivier/Bureau/datafari/metadata/image1toto.jpg": {
"content_fr": [
""
],
"content_en": [
""
]
}
},
"spellcheck": {
"suggestions": []
},
"capsuleSearchComponent": {}
}
我很困惑。
在 Olivier Tavard 回答后编辑
抱歉回答迟了,我正在处理一些紧急的自动取款机,无法按我的意愿测试/回答。
我按照您的步骤进行操作(非常具有指导意义,谢谢),并且在某种程度上设法在客户搜索中获得了结果:)
但 :
1-我必须使用通配符在datafari gui中找到它:“伪装的马”=>我必须输入'**horse*',而不是'horse'
2 - 如何检索多个字段的数据(例如:meta:keyword ...)
<meta name="meta:keyword" content="legs"/>
<meta name="meta:keyword" content="mammal"/>
3 - 我有一个“标准”安装,但我有一个 404 用于localhost:8080/Datafari/css/menu.css,也许这就是我在刷新页面之前得到搜索轮的原因