Commit Graph

263 Commits (e4ab6e27e4fd127d1feeea862a7ff10eb91c2ae7)

Author SHA1 Message Date
yihua.huang f5018d569e remove test_issue409 #409 8 years ago
yihua.huang 17ae500c77 test_issue409 #409 8 years ago
yihua.huang 4cd5b4f93e test_issue409 8 years ago
yihua.huang fb0acd710c complete SimpleHttpClientTest 8 years ago
yihua.huang d07941d900 SimpleHttpClientTest 8 years ago
yihua.huang f02f469c69 add test #570 8 years ago
yihua.huang 2d693580fc add test 8 years ago
yihua.huang b879b0eed0 fix redisscheduler #583 8 years ago
Yihua Huang 9903f0367d Merge pull request #570 from SoulZhong/master
修复formatter初始化未传参bug
8 years ago
yihua.huang 2e35e149be for 0.7.1 8 years ago
yihua.huang 17478fcfc4 0.7.0 release 8 years ago
yihua.huang 49de9374cd new SimpleHttpClient #576 8 years ago
yihua.huang 7ffc6998ef add isExtractLinks to OOSpider #575 8 years ago
soul bc828e1384 修复formatter初始化未传参bug 8 years ago
yihua.huang a8c2e6c729 alpha release 8 years ago
yihua.huang 3c1338193b for 0.7.0.alpha 8 years ago
yihua.huang d38d51dfcb fix javadoc 8 years ago
yihua.huang 1b04a7f2b3 #527 move logic check from downloaderto spider 8 years ago
yihua.huang 74110e6ec5 remove useless file 8 years ago
yihua.huang b100dfe273 update version 8 years ago
yihua.huang c13110c4cb fix samples 8 years ago
yihua.huang d87c73b472 change check-and-set to atomic sadd for redis DuplicateRemover #368 8 years ago
yihua.huang aaccc93215 new version 8 years ago
yihua.huang 3e633c6871 version 8 years ago
yihua.huang f45e2f118b for release 8 years ago
Jsbd 6d78d51fc0 Merge branch 'master' into master 8 years ago
yihua.huang d69204b919 0.6.0 8 years ago
yihua.huang 97592d6720 Version 0.6.0 8 years ago
yihua.huang 00dfebbceb #424 remove guava dep and add fix docs 8 years ago
yihua.huang a960a39c44 fix compile error for example change 8 years ago
yihua.huang a3ee9e3d08 fix example 8 years ago
Jsbd 1b886d48a2 新增PhantomJSDownloader构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js 8 years ago
Jsbd d1f2e65e5d 新增PhantomJSDownloader构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js 8 years ago
Jsbd ebc61363c8 为PhantomJSDownloader添加新的构造函数,支持phantomjs自定义命令
为PhantomJSDownloader添加新的构造函数,支持phantomjs自定义命令
 example: 
   *    phantomjs.exe 支持windows环境
   *    phantomjs --ignore-ssl-errors=yes 忽略抓取地址是https时的一些错误
   *    /usr/local/bin/phantomjs 命令的绝对路径,避免因系统环境变量引起的IOException
8 years ago
yihua.huang b92e6b04f0 #400 修复FileCacheQueueScheduler自己设置DuplicateRemover会导致NPE的问题 8 years ago
Yihua Huang 1491033534 Merge pull request #377 from jerry-sc/monitor-bug
fix the monitor bug which the spider will terminate when a seed url with port
8 years ago
Jerry e56b8c3efc fix the monitor bug which the spider will terminate when a seed url with port 9 years ago
郭玉昆 700898fe8a fixed #301 修复使用注解抽取JSON数据的问题 9 years ago
Salon.sai f89a6a6826 add: redis scheduler with priority 9 years ago
Yihua Huang 37cb43b667 Merge pull request #176 from lavenderx/master
add PhantomJSDownloader
9 years ago
Linker Lin 047cb8ff8f updated versions to 0.5.4-SNAPSHOT 9 years ago
yihua.huang c0b8e8f8ae remove .classpath .project 9 years ago
yihua.huang a8e6de4b90 Merge branch 'master' of git.oschina.net:flashsword20/webmagic 9 years ago
yihua.huang ce5495ecd5 remove useless files 9 years ago
yihua.huang 8265c7dade remove submodules for relase 9 years ago
yihua.huang 7edfa26f90 complete javadoc 9 years ago
yihua.huang 8b90b91e33 complete some javadoc 9 years ago
yihua.huang 2b556cf053 update verison to 0.5.3-SNAPSHOT 9 years ago
yihua.huang 9c5716a543 complete javadoc 9 years ago
yihua.huang db3cbf6ca5 update version to 0.5.3-SNAPSHOT 9 years ago
yihua.huang 7586e3d75c add some test for github repo downloader 9 years ago
Yihua Huang cfde3b7657 Merge pull request #237 from SpenceZhou/master
Update RedisScheduler.java
9 years ago
SpenceZhou 165e5a72eb Update RedisScheduler.java
修改redisscheduler中获取爬取总数bug
9 years ago
x1ny 90e14b31b0 修正FileCacheQueueScheduler导致程序不能正常结束和未关闭流
FileCacheQueueScheduler中开启了一个线程周期运行来保存数据但在爬虫结束后没有关闭导致程序无法结束,以及没有关闭io流。

解决方法:
让FileCacheQueueScheduler实现Closable接口,在close方法中关闭线程以及流。
在Spider的close方法中添加对scheduler的关闭操作。
9 years ago
edwardsbean 74962d69b9 fix bug:MultiPagePipeline and DoubleKeyMap concurrent bug 10 years ago
dolphineor 7628dc6b63 move PhantomJSDownloader to webmagic-extension 10 years ago
yihua.huang 8551b668a0 remove commented code 11 years ago
zhugw eb3c78b9d8 Update FileCacheQueueScheduler.java
这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中. 
但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法).
这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了.
解决方案一:
判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url.
方案二:
对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中. 

希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!
11 years ago
yihua.huang 42a30074c9 update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157 11 years ago
zhugw 1db940a088 Update FileCacheQueueScheduler.java
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
11 years ago
yihua.huang 3734865a6a fix package name =.= 11 years ago
yihua.huang e7668e01b8 fix SourceRegion error and add some tests on it #144 11 years ago
yihua.huang 4e5ba02020 fix test cont' 11 years ago
yihua.huang 2fd8f05fe2 change path seperator for varient OS #139 11 years ago
yihua.huang 8c33be48a6 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 5f8c3fd5c5 update version 11 years ago
yihua.huang 928f98dd93 auto create folder in JsonFilePipeline #122 11 years ago
yihua.huang 7fbe18b8c0 implementation of PageMapper #120 11 years ago
yihua.huang 5dc9fe95a9 interface of PageMapper #120 11 years ago
yihua.huang 7668731f08 update version to snapshot 11 years ago
yihua.huang 182dd51689 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 81e6e772ac versions back to 0.5.1 11 years ago
yihua.huang feb604da87 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 358e906379 [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 470750fc0d [maven-release-plugin] prepare release WebMagic-0.5.1 11 years ago
yihua.huang 186b90512e refactor redisscheduler #118 11 years ago
yihua.huang d1140b9e29 add bloom filter for scheduler #118 11 years ago
yihua.huang e8d4a9be2b fix remove duplicate error #117 11 years ago
yihua.huang 04ade75606 Merge branch 'stable' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-avalon/pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-lucene/pom.xml
	webmagic-samples/pom.xml
	webmagic-saxon/pom.xml
	webmagic-scripts/pom.xml
	webmagic-selenium/pom.xml
11 years ago
yihua.huang a08d8cb167 update verion 11 years ago
yihua.huang 42a2676e8c update version 11 years ago
yihua.huang c25b32f1ca [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 7ff83bb11a [maven-release-plugin] prepare release WebMagic-0.5.0 11 years ago
yihua.huang 1104122979 more abstraction in scheduler 11 years ago
yihua.huang b0fb1c3e10 remove copy-dependcies plugin for m2e error 11 years ago
yihua.huang 94a67165e1 remove jmx server for simplify #98 11 years ago
yihua.huang 86a45a6643 change SpiderMonitor to singleton #98 11 years ago
yihua.huang ab4d36806e clean code 11 years ago
yihua.huang 04fde8203b add control for monitor 11 years ago
yihua.huang 2770811a10 update monitor example 11 years ago
yihua.huang 17e95f2a7f comments 11 years ago
yihua.huang 375e64e845 more monitor status 11 years ago
yihua.huang c6661899fd new thread pool #110 11 years ago
yihua.huang 179baa7a22 return when page is null 11 years ago
yihua.huang 4738ae2d14 change url find to match #94 11 years ago
yihua.huang f973889cda refactor subpageprossor etc. #94 11 years ago
yihua.huang acb63d55d7 some check and example #98 11 years ago
yihua.huang 11ba5beb42 [refactor]move monitor to webmagic-extension #98 11 years ago
yihua.huang b06aa489fb [BugFix]Only one url from sourceRegion can be extracted #107 11 years ago
yihua.huang 023c2ac84e spider config draft 11 years ago