Default crawler does not run all Web Crawlers

(from github.com/abolotnov)
Ok it worked for some time and again stopped collecting documents into the index.

I see a lot of stuff in the logs like this:

2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2019-01-29 16:41:14,384 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS
2019-01-29 16:41:14,426 [Crawler-20190129142043-32-7] DEBUG The url is null. (16074)
2019-01-29 16:41:14,426 [Crawler-20190129142043-169-8] DEBUG The url is null. (12341)
2019-01-29 16:41:14,426 [Crawler-20190129142043-213-5] DEBUG The url is null. (10563)
2019-01-29 16:41:14,426 [Crawler-20190129142043-168-7] DEBUG The url is null. (12360)
2019-01-29 16:41:14,426 [Crawler-20190129142043-43-8] DEBUG The url is null. (15927)
2019-01-29 16:41:14,431 [Crawler-20190129142043-32-1] DEBUG The url is null. (16073)
2019-01-29 16:41:14,431 [Crawler-20190129142043-43-3] DEBUG The url is null. (15928)
2019-01-29 16:41:14,501 [Crawler-20190129142043-168-4] DEBUG The url is null. (12360)

.crawl_queue has over 12K items in it.

(from marevol (Shinsuke Sugaya) · GitHub)
First, Crawler seems to be blocked, not stopped.
So, you need to check last accesses in fess-crawler.log.

My understanding was that the default crawler will not stop crawling until the crawl_queue is not empty.

Documents in meta indices will be removed when Crawler finishes all crawling.
Therefore, it’s not empty when crawling or blocked.
The number of documents in the meta indices does not help to solve your problem.

(from github.com/abolotnov)
ok, thank you for still helping me through this struggle. I have collected 2.1G worth of debug logs. It is not possible to review the whole thing. Can you recommend for the keywords/phrases I should be looking for?

(from marevol (Shinsuke Sugaya) · GitHub)

So, you need to check last accesses in fess-crawler.log.

(from github.com/abolotnov)
Like literally last access? cat fess-crawler.log | grep -i "last access" yields nothing.

access gives over 40K lines.

There are a few of: 2019-01-29 15:09:51,796 [CoreLib-TimeoutManager] DEBUG Failed to access Elasticsearch stats. ones

But most are just 2019-01-29 14:22:31,704 [Crawler-20190129142043-17-1] DEBUG Accessing https://www.antheminc.com/ type of messages

or 2019-01-29 14:43:26,019 [Crawler-20190129142043-150-5] DEBUG Processing accessResult: AccessResultImpl [id=null, sessionId=20190129142043-150, ruleId=webHtmlRule ... messages

or 2019-01-29 14:44:39,229 [Crawler-20190129142043-153-6] DEBUG Storing accessResult: AccessResultImpl [id=null, sessionId=20190129142043-153, ruleId=webHtmlRule ... messages

(from marevol (Shinsuke Sugaya) · GitHub)
No

As mentioned, the problem is the access was blocked.

2019-01-29 01:19:25,945 [Crawler-20190129003919-213-1] INFO Checking URL: http://www.oreillyauto.com/robots.txt

I think it’s better to put your log files to somewhere


(from github.com/abolotnov)
Sure, here’s the link to the log gz: https://www.dropbox.com/s/bxj61eyr05rswy6/crawler.log.gz?dl=0

let me know if you want me to grep something out of it, it’s big :frowning:

(from github.com/marevol)

2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> GET /ro-en/marketplace/sitemap.xml HTTP/1.1
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Host: www.ibm.com
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Connection: Keep-Alive
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> Accept-Encoding: gzip,deflate
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "GET /ro-en/marketplace/sitemap.xml HTTP/1.1[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Host: www.ibm.com[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Connection: Keep-Alive[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "[\r][\n]"

The above log means no response from www.ibm.com.
So, I think it’s a problem about your network or the like, not Fess.
If you use t?.instance in AWS, you need to change to others.

(from github.com/abolotnov)
I use c.* and m.* and they are on 10G networks.

I wonder what exactly indicates that the request was blocked though?

Same worker/thread? also had this in the logs for same host:

2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> GET /uk-en/marketplace/sitemap.xml HTTP/1.1
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Host: www.ibm.com
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Connection: Keep-Alive
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Cookie: PHPSESSID=i4t0aiohuj5hcmrj98nohtmti1
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> Accept-Encoding: gzip,deflate
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "GET /uk-en/marketplace/sitemap.xml HTTP/1.1[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Host: www.ibm.com[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Connection: Keep-Alive[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Cookie: PHPSESSID=i4t0aiohuj5hcmrj98nohtmti1[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "Accept-Encoding: gzip,deflate[\r][\n]"
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> "[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "HTTP/1.1 200 OK[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Backside-Transport: OK OK[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Encoding: gzip[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Type: text/xml; charset=utf-8[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "ETag: W/"721e6-2nmQElG0g2j8bIaNrL6mOty9wBU"[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Global-Transaction-ID: 3344377617[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Length: 27965[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Date: Tue, 29 Jan 2019 14:53:27 GMT[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Connection: keep-alive[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Vary: Accept-Encoding[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Robots-Tag: noindex[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-Content-Type-Options: nosniff[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "X-XSS-Protection: 1; mode=block[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Content-Security-Policy: upgrade-insecure-requests[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "Strict-Transport-Security: max-age=31536000[\r][\n]"
2019-01-29 14:53:27,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 << "[\r][\n]"
2019-01-29 14:53:2

(from github.com/abolotnov)
does it make sense to play with Excluded Failure Type?

(from github.com/abolotnov)
I guess what I could try instead is build a single EC2 image with clean install of FESS configured to a remote elastic instance. I will get something big with maybe 96 CPUs or something for a large index thread pool and all. Spin off 100 instances of FESS and have each work on 100 sites at a time. 100 sites FESS seems to process fairly well. If I go with a1.large, it’s fair deal I guess. I only collect about 1000 pages per site so it should not take too long. What do you think?

(from github.com/abolotnov)
So blocked connections. In the event that a crawler attempts to make a connection but the connection fails, assumed behaviour is that the site does not get crawler, right? And this explains why some of the web crawlers get 0 documents indexed.

Ok. I have 76 crawler threads running at this point and 0 documents indexed. Why are they all running if they aren’t crawling anything?

(from github.com/abolotnov)
If this was a network issue wouldn’t they just drop at a timeout a LONG time ago and move on with crawling or complete crawling? The default crawler is still running. And there are no documents collected for 6 hours:

(from marevol (Shinsuke Sugaya) · GitHub)

Ok. I have 76 crawler threads running at this point and 0 documents indexed.

A thread name is printed in fess-crawler.log.

2019-01-29 14:43:11,735 [Crawler-20190129142043-145-3] DEBUG http-outgoing-917 >> "[\r][\n]"
...
2019-01-29 14:53:26,163 [Crawler-20190129142043-145-3] DEBUG http-outgoing-1013 >> GET /uk-en/marketplace/sitemap.xml HTTP/1.1

Crawler-20190129142043-145-3 is a thread name.
If there are not other logs between them, the access seemed to be resumed.
So, I think it’s a problem other than Fess


(from github.com/abolotnov)
I am giving up yet and now I think I figured out a way to demonstrate my issue better.

So I create 500 crawlers and they work and crawl around 350 websites. Some are duplicates and some don’t crawl anything after parsing robots.txt which is probably a different question.

But the same story again:

the default crawler is still running:

but the documents haven’t been collected in the index for some time:

so I downloaded the logs and wrote a tiny parser to see if the website has been crawled or not:

with open('/srv/genderfair/docs/fess-crawler.log', 'r') as log:
    for o in orgs_to_index:
        counter = 0
        lines = []
        phrase = urllib.request.urlparse(o.website).netloc
        log.seek(0)
        for line in log:
            if phrase in line:
                counter = counter+1
                lines.append(line)
        if counter >= 3:
            pass
            #print("{}: {}".format(phrase, counter))
        else:
            print("--- {}: {} ---".format(phrase, counter))
            for l in lines:
                print(l)

and here is the outcome:

--- btplc.com: 1 ---
2019-02-05 19:57:19,889 [WebFsCrawler] INFO  Target URL: http://btplc.com/

--- gsk.com: 1 ---
2019-02-05 19:57:22,363 [WebFsCrawler] INFO  Target URL: http://gsk.com/

--- lloydsbankinggroup.com: 1 ---
2019-02-05 19:57:23,724 [WebFsCrawler] INFO  Target URL: http://lloydsbankinggroup.com/

--- abc.xyz: 2 ---
2019-02-05 19:57:22,264 [WebFsCrawler] INFO  Target URL: http://abc.xyz/

2019-02-05 19:57:22,281 [WebFsCrawler] INFO  Target URL: http://abc.xyz/

--- abc.xyz: 2 ---
2019-02-05 19:57:22,264 [WebFsCrawler] INFO  Target URL: http://abc.xyz/

2019-02-05 19:57:22,281 [WebFsCrawler] INFO  Target URL: http://abc.xyz/

--- barclays.com: 1 ---
2019-02-05 19:57:19,365 [WebFsCrawler] INFO  Target URL: http://barclays.com/

--- bp.com: 1 ---
2019-02-05 19:57:19,848 [WebFsCrawler] INFO  Target URL: http://bp.com/

--- shop.nordstrom.com: 1 ---
2019-02-05 19:57:23,122 [WebFsCrawler] INFO  Target URL: http://shop.nordstrom.com/

--- cnhindustrial.com: 1 ---
2019-02-05 19:57:20,442 [WebFsCrawler] INFO  Target URL: http://cnhindustrial.com/

--- bat.com: 1 ---
2019-02-05 19:57:19,911 [WebFsCrawler] INFO  Target URL: http://bat.com/

--- astrazeneca.com: 1 ---
2019-02-05 19:57:19,086 [WebFsCrawler] INFO  Target URL: http://astrazeneca.com/

--- aon.co.uk: 1 ---
2019-02-05 19:57:18,754 [WebFsCrawler] INFO  Target URL: http://aon.co.uk/

--- ri.telefonica.com.br: 1 ---
2019-02-05 19:57:26,959 [WebFsCrawler] INFO  Target URL: http://ri.telefonica.com.br/

--- pearson.com: 1 ---
2019-02-05 19:57:24,911 [WebFsCrawler] INFO  Target URL: http://pearson.com/

--- diageo.com: 1 ---
2019-02-05 19:57:21,207 [WebFsCrawler] INFO  Target URL: http://diageo.com/

--- corporate.mattel.com: 1 ---
2019-02-05 19:57:23,771 [WebFsCrawler] INFO  Target URL: http://corporate.mattel.com/

--- bbu.brookfield.com: 1 ---
2019-02-05 19:57:19,306 [WebFsCrawler] INFO  Target URL: http://bbu.brookfield.com/en/

--- nationalgrid.com: 1 ---
2019-02-05 19:57:24,200 [WebFsCrawler] INFO  Target URL: http://nationalgrid.com/

--- ccep.com: 1 ---
2019-02-05 19:57:20,133 [WebFsCrawler] INFO  Target URL: http://ccep.com/

this basically means that other than picking up the configuration for the site, the crawlers have no attempts to collect content. What should I be potentially looking at to cure it?

(from github.com/abolotnov)
I am confident there are no environment or network issues that cause this. I am on 30 vCPU 60G box and unless there are some additional configuration that wasn’t mentioned in the documentation for FESS, I have proper setup.

This should probably be my last attempt to get help to fix this. I am sorry, @marevol for being all over with this. If you can’t help, I will move on to another crawler solution. I like FESS but I can’t get it to work anyway and all these pointers to look into the logs are not very helpful anyway.

(from github.com/marevol)
For other network configurations, it may be client.xml.
You can put it to /usr/share/fess/app/WEB-INF/classes/crawler to overwrite it.

(from github.com/abolotnov)
here’s lsof -i output with all established connections that have fess pid. Does it give you any pointers?

java      24406            fess   19u  IPv6  75158      0t0  TCP *:http-alt (LISTEN)
java      24406            fess  451u  IPv6  77689      0t0  TCP localhost:57658->localhost:9300 (ESTABLISHED)
java      24406            fess  452u  IPv6  77690      0t0  TCP localhost:57656->localhost:9300 (ESTABLISHED)
java      24406            fess  453u  IPv6  77691      0t0  TCP localhost:57660->localhost:9300 (ESTABLISHED)
java      24406            fess  454u  IPv6  77692      0t0  TCP localhost:57662->localhost:9300 (ESTABLISHED)
java      24406            fess  455u  IPv6  77693      0t0  TCP localhost:57664->localhost:9300 (ESTABLISHED)
java      24406            fess  456u  IPv6  77694      0t0  TCP localhost:57666->localhost:9300 (ESTABLISHED)
java      24406            fess  457u  IPv6  80177      0t0  TCP localhost:57672->localhost:9300 (ESTABLISHED)
java      24406            fess  458u  IPv6  80178      0t0  TCP localhost:57668->localhost:9300 (ESTABLISHED)
java      24406            fess  459u  IPv6  83747      0t0  TCP localhost:57670->localhost:9300 (ESTABLISHED)
java      24406            fess  460u  IPv6  83748      0t0  TCP localhost:57674->localhost:9300 (ESTABLISHED)
java      24406            fess  461u  IPv6  83749      0t0  TCP localhost:57676->localhost:9300 (ESTABLISHED)
java      24406            fess  462u  IPv6  83750      0t0  TCP localhost:57680->localhost:9300 (ESTABLISHED)
java      24406            fess  463u  IPv6  83751      0t0  TCP localhost:57678->localhost:9300 (ESTABLISHED)
java      30204            fess  435u  IPv6 173482      0t0  TCP localhost:59638->localhost:9300 (ESTABLISHED)
java      30204            fess  437u  IPv6 173481      0t0  TCP localhost:59636->localhost:9300 (ESTABLISHED)
java      30204            fess  438u  IPv6 173483      0t0  TCP localhost:59640->localhost:9300 (ESTABLISHED)
java      30204            fess  439u  IPv6 173484      0t0  TCP localhost:59642->localhost:9300 (ESTABLISHED)
java      30204            fess  440u  IPv6 173485      0t0  TCP localhost:59644->localhost:9300 (ESTABLISHED)
java      30204            fess  441u  IPv6 173486      0t0  TCP localhost:59646->localhost:9300 (ESTABLISHED)
java      30204            fess  442u  IPv6 173487      0t0  TCP localhost:59648->localhost:9300 (ESTABLISHED)
java      30204            fess  443u  IPv6 173488      0t0  TCP localhost:59650->localhost:9300 (ESTABLISHED)
java      30204            fess  444u  IPv6 173489      0t0  TCP localhost:59652->localhost:9300 (ESTABLISHED)
java      30204            fess  445u  IPv6 173490      0t0  TCP localhost:59654->localhost:9300 (ESTABLISHED)
java      30204            fess  446u  IPv6 173491      0t0  TCP localhost:59656->localhost:9300 (ESTABLISHED)
java      30204            fess  447u  IPv6 173492      0t0  TCP localhost:59660->localhost:9300 (ESTABLISHED)
java      30204            fess  448u  IPv6 173493      0t0  TCP localhost:59658->localhost:9300 (ESTABLISHED)
java      30204            fess  642u  IPv6 180538      0t0  TCP localhost:59664->localhost:9300 (ESTABLISHED)
java      30204            fess  643u  IPv6 180539      0t0  TCP localhost:59666->localhost:9300 (ESTABLISHED)
java      30204            fess  644u  IPv6 180540      0t0  TCP localhost:59668->localhost:9300 (ESTABLISHED)
java      30204            fess  645u  IPv6 180541      0t0  TCP localhost:59670->localhost:9300 (ESTABLISHED)
java      30204            fess  646u  IPv6 180542      0t0  TCP localhost:59672->localhost:9300 (ESTABLISHED)
java      30204            fess  647u  IPv6 180543      0t0  TCP localhost:59674->localhost:9300 (ESTABLISHED)
java      30204            fess  648u  IPv6 180544      0t0  TCP localhost:59676->localhost:9300 (ESTABLISHED)
java      30204            fess  649u  IPv6 180545      0t0  TCP localhost:59678->localhost:9300 (ESTABLISHED)
java      30204            fess  650u  IPv6 180546      0t0  TCP localhost:59680->localhost:9300 (ESTABLISHED)
java      30204            fess  651u  IPv6 180547      0t0  TCP localhost:59682->localhost:9300 (ESTABLISHED)
java      30204            fess  652u  IPv6 180548      0t0  TCP localhost:59684->localhost:9300 (ESTABLISHED)
java      30204            fess  653u  IPv6 180550      0t0  TCP localhost:59688->localhost:9300 (ESTABLISHED)
java      30204            fess  654u  IPv6 180549      0t0  TCP localhost:59686->localhost:9300 (ESTABLISHED)
java      30204            fess  657u  IPv6 228296      0t0  TCP ip-172-31-20-132.us-west-2.compute.internal:38480->a184-86-195-29.deploy.static.akamaitechnologies.com:http (ESTABLISHED)
java      30204            fess  658u  IPv6 200246      0t0  TCP ip-172-31-20-132.us-west-2.compute.internal:47044->a23-44-161-61.deploy.static.akamaitechnologies.com:http (ESTABLISHED)
java      30204            fess  659u  IPv6 247999      0t0  TCP ip-172-31-20-132.us-west-2.compute.internal:39516->a23-198-158-184.deploy.static.akamaitechnologies.com:http (ESTABLISHED)
java      30204            fess  666u  IPv6 183483      0t0  TCP ip-172-31-20-132.us-west-2.compute.internal:45490->a23-62-73-167.deploy.static.akamaitechnologies.com:http (ESTABLISHED)
java      30204            fess  674u  IPv6 296052      0t0  TCP ip-172-31-20-132.us-west-2.compute.internal:52440->a2-19-139-209.deploy.static.akamaitechnologies.com:http (ESTABLISHED)
java      30204            fess  678u  IPv6 144320      0t0  TCP ip-172-31-20-132.us-west-2.compute.internal:51350->a23-198-154-40.deploy.static.akamaitechnologies.com:http (ESTABLISHED)
java      30204            fess  679u  IPv6 226760      0t0  TCP ip-172-31-20-132.us-west-2.compute.internal:46396->a23-49-12-226.deploy.static.akamaitechnologies.com:http (ESTABLISHED)

(from github.com/abolotnov)
More:

ubuntu@ip-172-31-20-132:~$ ps -u fess
   PID TTY          TIME CMD
 24406 ?        00:02:28 java
 30204 ?        00:07:15 java
ubuntu@ip-172-31-20-132:~$ ps huH p 24406|wc -l
161
ubuntu@ip-172-31-20-132:~$ ps huH p 30204|wc -l
232

Doesn’t look like it has anything to do with network settings. My crawler config is:

Simultaneous crawler configs 20 and each crawler configured at 3 threads. Doesn’t look like the network max connections issue.