本篇文章为大家展示了如何实现generate.max.count的参数处理,内容简明扼要并且容易理解,绝对能使你眼前一亮,通过这篇文章的详细介绍希望你能有所收获。
对generate.max.count参数的处理在org.apache.nutch.crawl.Generator内部类Selector中
org.apache.nutch.crawl.Generator中相关变量声明情况
private HashMap<String, int[]> hostCounts = new HashMap<String, int[]>();
private int maxCount;
内部类Selector的config方法中
maxCount = job.getInt(GENERATOR_MAX_COUNT, -1);
reduce方法中的处理
/***
1、获取 某一主机下的int[] ,如果为null,声明一个数组,放入map中,int数组第2个值+1;
*/
//1
int[] hostCount = hostCounts.get(hostordomain);
if (hostCount == null) {
hostCount = new int[] { 1, 0 };
hostCounts.put(hostordomain, hostCount);
}
hostCount[1]++;// increment hostCount
//2、检查是否到了topN的数量,如果hostCount的第一个值大于limit
// check if topN reached, select next segment if it is
while (segCounts[hostCount[0] - 1] >= limit//segCounts :
&& hostCount[0] < maxNumSegments) {
hostCount[0]++;
hostCount[1] = 0;
}
// reached the limit of allowed URLs per host / domain
// see if we can put it in the next segment?
if (hostCount[1] >= maxCount) {
if (hostCount[0] < maxNumSegments) {
hostCount[0]++;
hostCount[1] = 0;
} else {
if (hostCount[1] == maxCount + 1
&& LOG.isInfoEnabled()) {
LOG.info("Host or domain "
+ hostordomain
+ " has more than "
+ maxCount
+ " URLs for all "
+ maxNumSegments
+ " segments. Additional URLs won't be included in the fetchlist.");
}
// skip this entry
continue;
}
}
entry.segnum = new IntWritable(hostCount[0]);
segCounts[hostCount[0] - 1]++;
上述内容就是如何实现generate.max.count的参数处理,你们学到知识或技能了吗?如果还想学到更多技能或者丰富自己的知识储备,欢迎关注天达云行业资讯频道。