It is posible to crawl dynamic content on websites?
Fess uses a playwright to crawl them.
Please set the following setting to Config Parameters if you want to use the playwright crawler.
client.crawlerClients=playwright:http://.*,playwright:https://.*
Do I have to make any changes to the deployment or config of Fess itself for this to work? I am using the docker-fess repo v14.16.
You need to install Playwright. Please see the dockerfile.
I was able to get this dockerfile built but I am noticing warnings about missing packages for playwright. So I added the missing packages and am no longer getting errors, but I still cannot get dynamic content to work. Could you go into more detail about how to set this up/verify it is working?
I was unable to reproduce the issue. The crawler works fine with Playwright.
I cloned the docker-fess repo cd into compose folder and ran docker compose up:
I see this in the logs. Is there a way to easily view these logs in the OpenSearch dashboard? I see its returning everything in json it isnβt easy to follow along
{
"@timestamp": "2024-10-03T17:51:45.110Z",
"log.level": "DEBUG",
"message": "Failed to create Playwright instance.",
"ecs.version": "1.2.0",
"service.name": "fess",
"event.dataset": "crawler",
"process.thread.name": "Crawler-20241003175122-1-1",
"log.logger": "org.codelibs.fess.crawler.client.http.PlaywrightClient",
"error.type": "com.microsoft.playwright.PlaywrightException",
"error.message": "Error {\n message='\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\nβ Host system is missing dependencies to run browsers. β\nβ Please install them with the following command: β\nβ β\nβ sudo mvn exec:java -e -D exec.mainClass=com.microsoft.playwright.CLI -D exec.args=\"install-deps\" β\nβ β\nβ Alternatively, use apt: β\nβ sudo apt-get install libatk1.0-0\\ β\nβ libatk-bridge2.0-0\\ β\nβ libxkbcommon0\\ β\nβ libatspi2.0-0\\ β\nβ libxdamage1\\ β\nβ libgbm1\\ β\nβ libasound2 β\nβ β\nβ <3 Playwright Team β\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n name='Error\n stack='Error: \nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\nβ Host system is missing dependencies to run browsers. β\nβ Please install them with the following command: β\nβ β\nβ sudo mvn exec:java -e -D exec.mainClass=com.microsoft.playwright.CLI -D exec.args=\"install-deps\" β\nβ β\nβ Alternatively, use apt: β\nβ sudo apt-get install libatk1.0-0\\ β\nβ libatk-bridge2.0-0\\ β\nβ libxkbcommon0\\ β\nβ libatspi2.0-0\\ β\nβ libxdamage1\\ β\nβ libgbm1\\ β\nβ libasound2 β\nβ β\nβ <3 Playwright Team β\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n at validateDependenciesLinux (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/dependencies.js:216:9)\n at async Registry._validateHostRequirements (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:575:43)\n at async Registry._validateHostRequirementsForExecutableIfNeeded (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:673:7)\n at async Registry.validateHostRequirementsForExecutablesIfNeeded (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:662:43)\n at async Chromium._launchProcess (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:175:7)\n at async Chromium._innerLaunch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:111:9)\n at async Chromium._innerLaunchWithRetries (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:92:14)\n at async ProgressController.run (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/progress.js:82:22)\n at async Chromium.launch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:61:21)\n at async BrowserTypeDispatcher.launch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/dispatchers/browserTypeDispatcher.js:35:21)\n}",
"error.stack_trace": "com.microsoft.playwright.PlaywrightException: Error {\n message='\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\nβ Host system is missing dependencies to run browsers. β\nβ Please install them with the following command: β\nβ β\nβ sudo mvn exec:java -e -D exec.mainClass=com.microsoft.playwright.CLI -D exec.args=\"install-deps\" β\nβ β\nβ Alternatively, use apt: β\nβ sudo apt-get install libatk1.0-0\\ β\nβ libatk-bridge2.0-0\\ β\nβ libxkbcommon0\\ β\nβ libatspi2.0-0\\ β\nβ libxdamage1\\ β\nβ libgbm1\\ β\nβ libasound2 β\nβ β\nβ <3 Playwright Team β\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n name='Error\n stack='Error: \nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\nβ Host system is missing dependencies to run browsers. β\nβ Please install them with the following command: β\nβ β\nβ sudo mvn exec:java -e -D exec.mainClass=com.microsoft.playwright.CLI -D exec.args=\"install-deps\" β\nβ β\nβ Alternatively, use apt: β\nβ sudo apt-get install libatk1.0-0\\ β\nβ libatk-bridge2.0-0\\ β\nβ libxkbcommon0\\ β\nβ libatspi2.0-0\\ β\nβ libxdamage1\\ β\nβ libgbm1\\ β\nβ libasound2 β\nβ β\nβ <3 Playwright Team β\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n at validateDependenciesLinux (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/dependencies.js:216:9)\n at async Registry._validateHostRequirements (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:575:43)\n at async Registry._validateHostRequirementsForExecutableIfNeeded (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:673:7)\n at async Registry.validateHostRequirementsForExecutablesIfNeeded (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:662:43)\n at async Chromium._launchProcess (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:175:7)\n at async Chromium._innerLaunch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:111:9)\n at async Chromium._innerLaunchWithRetries (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:92:14)\n at async ProgressController.run (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/progress.js:82:22)\n at async Chromium.launch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:61:21)\n at async BrowserTypeDispatcher.launch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/dispatchers/browserTypeDispatcher.js:35:21)\n}\n\tat com.microsoft.playwright.impl.WaitableResult.get(WaitableResult.java:56)\n\tat com.microsoft.playwright.impl.ChannelOwner.runUntil(ChannelOwner.java:120)\n\tat com.microsoft.playwright.impl.Connection.sendMessage(Connection.java:130)\n\tat com.microsoft.playwright.impl.ChannelOwner.sendMessage(ChannelOwner.java:106)\n\tat com.microsoft.playwright.impl.BrowserTypeImpl.launchImpl(BrowserTypeImpl.java:51)\n\tat com.microsoft.playwright.impl.BrowserTypeImpl.lambda$launch$0(BrowserTypeImpl.java:43)\n\tat com.microsoft.playwright.impl.LoggingSupport.withLogging(LoggingSupport.java:47)\n\tat com.microsoft.playwright.impl.ChannelOwner.withLogging(ChannelOwner.java:89)\n\tat com.microsoft.playwright.impl.BrowserTypeImpl.launch(BrowserTypeImpl.java:43)\n\tat com.microsoft.playwright.impl.BrowserTypeImpl.launch(BrowserTypeImpl.java:36)\n\tat org.codelibs.fess.crawler.client.http.PlaywrightClient.createPlaywrightWorker(PlaywrightClient.java:157)\n\tat org.codelibs.fess.crawler.client.http.PlaywrightClient.init(PlaywrightClient.java:142)\n\tat org.codelibs.fess.crawler.client.http.PlaywrightClient.execute(PlaywrightClient.java:264)\n\tat org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:154)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: com.microsoft.playwright.impl.DriverException: Error {\n message='\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\nβ Host system is missing dependencies to run browsers. β\nβ Please install them with the following command: β\nβ β\nβ sudo mvn exec:java -e -D exec.mainClass=com.microsoft.playwright.CLI -D exec.args=\"install-deps\" β\nβ β\nβ Alternatively, use apt: β\nβ sudo apt-get install libatk1.0-0\\ β\nβ libatk-bridge2.0-0\\ β\nβ libxkbcommon0\\ β\nβ libatspi2.0-0\\ β\nβ libxdamage1\\ β\nβ libgbm1\\ β\nβ libasound2 β\nβ β\nβ <3 Playwright Team β\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n name='Error\n stack='Error: \nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\nβ Host system is missing dependencies to run browsers. β\nβ Please install them with the following command: β\nβ β\nβ sudo mvn exec:java -e -D exec.mainClass=com.microsoft.playwright.CLI -D exec.args=\"install-deps\" β\nβ β\nβ Alternatively, use apt: β\nβ sudo apt-get install libatk1.0-0\\ β\nβ libatk-bridge2.0-0\\ β\nβ libxkbcommon0\\ β\nβ libatspi2.0-0\\ β\nβ libxdamage1\\ β\nβ libgbm1\\ β\nβ libasound2 β\nβ β\nβ <3 Playwright Team β\nββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n at validateDependenciesLinux (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/dependencies.js:216:9)\n at async Registry._validateHostRequirements (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:575:43)\n at async Registry._validateHostRequirementsForExecutableIfNeeded (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:673:7)\n at async Registry.validateHostRequirementsForExecutablesIfNeeded (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/registry/index.js:662:43)\n at async Chromium._launchProcess (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:175:7)\n at async Chromium._innerLaunch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:111:9)\n at async Chromium._innerLaunchWithRetries (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:92:14)\n at async ProgressController.run (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/progress.js:82:22)\n at async Chromium.launch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/browserType.js:61:21)\n at async BrowserTypeDispatcher.launch (/var/tmp/fess/fessTmpDir_20241003175122/playwright-java-2353819000671254235/package/lib/server/dispatchers/browserTypeDispatcher.js:35:21)\n}\n\tat com.microsoft.playwright.impl.Connection.dispatch(Connection.java:259)\n\tat com.microsoft.playwright.impl.Connection.processOneMessage(Connection.java:211)\n\tat com.microsoft.playwright.impl.ChannelOwner.runUntil(ChannelOwner.java:118)\n\t... 13 more\n"
}
I added the packages to the Dockerfile and then these errors go away. From what I can tell it looks like Playwright enables itself but Iβm still not successfully scraping my dynamic content. Is there a way I can force it to wait 3 sec on the page before it moves on? It might not be waiting for the frontend react app to load all the way
You donβt seem to have replaced image with build in the compose.yaml file as I mentioned.
The above error occurs when using the playwright docker file out of the box. With my modifications, I do not see this error. I am not sure why you cannot reproduce. Regardless, I was able to successfully crawl a site, but am running into a new problem.
I set up a simple client-side React app with βcreate-react-appβ I can see that opensearch indexed the dynamic content. You can see this below in the βcacheβ section json response returned from opensearch API. Unlike other sites Iβve scraped with playwright enabled I see that the βcontentβ section is empty. Could you direct me in what steps I should take next to fix this issue?
Request sent to opensearch: GET /fess.search/_search?q=βreactβ
{
"took": 36,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.45553336,
"hits": [
{
"_index": "fess.20241008170407108",
"_id": "1b7b1c1614abbb878e19d8ae7780e20e104c598ec6e92d668156ecde1d9aefca99a519d2b9e8b7d25fb4d332fd1774900570fa9fdc8ddf69aeaef412fa9e8bae",
"_score": 0.45553336,
"_source": {
"filetype": "html",
"expires": "2024-10-11T17:08:34.710Z",
"role": [
"Rguest"
],
"click_count": 0,
"title": "React App",
"content": "",
"has_cache": "true",
"segment": "20241008170832",
"digest": "Web site created using create-react-app",
"host": "host.docker.internal:3000",
"favorite_count": 0,
"important_content": null,
"lang": "en",
"content_length": "950",
"timestamp": "2024-10-08T17:09:00.328Z",
"virtual_host": [],
"cache": """<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><link rel="icon" href="/favicon.ico"><meta name="viewport" content="width=device-width,initial-scale=1"><meta name="theme-color" content="#000000"><meta name="description" content="Web site created using create-react-app"><link rel="apple-touch-icon" href="/logo192.png"><link rel="manifest" href="/manifest.json"><title>React App</title><script defer="defer" src="/static/js/main.3dd63bcb.js"></script><link href="/static/css/main.f855e6bc.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"><div class="App"><header class="App-header"><img src="/static/media/logo.6ce24c58023cc2f8fd88fe9d219db6c6.svg" class="App-logo" alt="logo"><p>Edit <code>src/App.js</code> and save to reload.</p><a class="App-link" href="https://reactjs.org" target="_blank" rel="noopener noreferrer">Learn React</a></header></div></div></body></html>""",
"created": "2024-10-08T17:09:00.328Z",
"label": [],
"doc_id": "27964480d639422bada08f6991802a71",
"url": "http://host.docker.internal:3000/",
"site": "host.docker.internal:3000/",
"config_id": "WALagSZIBheZZnvbXTbbC",
"anchor": [
"https://reactjs.org",
"http://host.docker.internal:3000/favicon.ico",
"http://host.docker.internal:3000/logo192.png",
"http://host.docker.internal:3000/manifest.json",
"http://host.docker.internal:3000/static/css/main.f855e6bc.css"
],
"boost": "1.0",
"mimetype": "text/html"
}
},
{...}
}
Web crawler config for client-side:
The contents of the following tags in fess_config.properties are ignored:
crawler.document.html.pruned.tags=noscript,script,style,header,footer,aside,nav,a[rel=nofollow]
Hey, that was exactly it! Thanks so much.
Just one last question. Many of these config parameters are not documented. Is there an easy way for someone to have a reference of all possible parameters? I saw that a few are listed in the config section, but things like client.crawlerClients
were only mentioned on this thread. I looked at the source code and was having trouble breaking down where and what properties do. Are some of them for external libraries, and if so, are the parameters in their documentation?
Fess is an open-source project with many configuration options, so any contributions or improvements to the documentation are always appreciated. If you need further assistance, please consider reaching out to commercial support.
Hello! Congratulations on FESS, not much experience yet but seems very powerfull.
I cant make work dynamic crawl.
Docker installation, enabled playwright build, have this line on fess-crawler.log, so asume all is right:
{β@timestampβ:β2024-11-08T13:11:06.112Zβ,βlog.levelβ: βINFOβ,βmessageβ:ββ¦Reading jar:file:/usr/share/fess/app/WEB-INF/lib/fess-crawler-playwright-14.16.0.jar!/crawler/client++.xmlβ, βecs.versionβ: β1.2.0β,βservice.nameβ:βfessβ,βevent.datasetβ:βcrawlerβ,βprocess.thread.nameβ:βmainβ,βlog.loggerβ:βorg.lastaflute.di.core.factory.LaContainerFactoryβ}
Thanks for your help.