(from github.com/manfred-w)
We are crawling a website that has html and pdf files (The pdf files are linked by <a href="...">
Is it possible to take some metadata from the html file that is linking the pdf file and store them with the pdf record?
Example:
-
index.html
with meta keywords: test, news, other has a<a href="testdoc.pdf">PDF</a>
- the
testdoc.pdf
file has no keywords - i would like to show the “test, news, others” keywords when the pdf file is found.
Is it possible to realize such a scenario?
Thanks a lot
Manfred