Commit Graph

263 Commits (e4ab6e27e4fd127d1feeea862a7ff10eb91c2ae7)

Author SHA1 Message Date
yihua.huang 7586e3d75c add some test for github repo downloader 9 years ago
Yihua Huang cfde3b7657 Merge pull request #237 from SpenceZhou/master
Update RedisScheduler.java
9 years ago
SpenceZhou 165e5a72eb Update RedisScheduler.java
修改redisscheduler中获取爬取总数bug
9 years ago
x1ny 90e14b31b0 修正FileCacheQueueScheduler导致程序不能正常结束和未关闭流
FileCacheQueueScheduler中开启了一个线程周期运行来保存数据但在爬虫结束后没有关闭导致程序无法结束,以及没有关闭io流。

解决方法:
让FileCacheQueueScheduler实现Closable接口,在close方法中关闭线程以及流。
在Spider的close方法中添加对scheduler的关闭操作。
9 years ago
edwardsbean 74962d69b9 fix bug:MultiPagePipeline and DoubleKeyMap concurrent bug 10 years ago
dolphineor 7628dc6b63 move PhantomJSDownloader to webmagic-extension 10 years ago
yihua.huang 8551b668a0 remove commented code 11 years ago
zhugw eb3c78b9d8 Update FileCacheQueueScheduler.java
这样是不是更严谨? 否则的话,中断后再次启动时, (第一个)入口地址仍会被添加到队列及写入到文件中. 
但是现在有另外一个问题存在,如第一遍全部抓取完毕了(通过spider.getStatus==Stopped判断),休眠24小时,再来抓取(通过递归调用抓取方法).
这时不同于中断后再启动,lineReader==cursor, 于是初始化时队列为空,入口地址又在urls集合中了, 故导致抓取线程马上就结束了.这样的话就没有办法去抓取网站上的新增内容了.
解决方案一:
判断抓取完毕后,紧接着覆盖cursor文件,第二次来抓取时,curosr为0, 于是将urls.txt中的所有url均放入队列中了, 可以通过这些url来发现新增url.
方案二:
对方案一进行优化,方案一虽然可以满足业务要求,但会做很多无用功,如仍会对所有旧target url进行下载,抽取,持久化等操作.而新增的内容一般都会在HelpUrl中, 比如某一页多了一个新帖子,或者多了几页内容. 故第二遍及以后来爬取时可以仅将HelpUrl放入队列中. 

希望能给予反馈,我上述理解对不对, 有什么没有考虑到的情况或者有更简单的方案?谢谢!
11 years ago
yihua.huang 42a30074c9 update urls.contains to DuplicateRemover in FileCacheQueueScheduler #157 11 years ago
zhugw 1db940a088 Update FileCacheQueueScheduler.java
在使用过程中发现urls.txt文件存在重复URL的情况,经跟踪源代码,发现初始化加载文件后,读取所有的url放入一集合中,但是之后添加待抓取URL时并未判断是否已存在该集合中(即文件中)了,故导致文件中重复URL的情况.故据此对源码做了修改,还请作者审阅.
11 years ago
yihua.huang 3734865a6a fix package name =.= 11 years ago
yihua.huang e7668e01b8 fix SourceRegion error and add some tests on it #144 11 years ago
yihua.huang 4e5ba02020 fix test cont' 11 years ago
yihua.huang 2fd8f05fe2 change path seperator for varient OS #139 11 years ago
yihua.huang 8c33be48a6 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 5f8c3fd5c5 update version 11 years ago
yihua.huang 928f98dd93 auto create folder in JsonFilePipeline #122 11 years ago
yihua.huang 7fbe18b8c0 implementation of PageMapper #120 11 years ago
yihua.huang 5dc9fe95a9 interface of PageMapper #120 11 years ago
yihua.huang 7668731f08 update version to snapshot 11 years ago
yihua.huang 182dd51689 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 81e6e772ac versions back to 0.5.1 11 years ago
yihua.huang feb604da87 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 358e906379 [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 470750fc0d [maven-release-plugin] prepare release WebMagic-0.5.1 11 years ago
yihua.huang 186b90512e refactor redisscheduler #118 11 years ago
yihua.huang d1140b9e29 add bloom filter for scheduler #118 11 years ago
yihua.huang e8d4a9be2b fix remove duplicate error #117 11 years ago
yihua.huang 04ade75606 Merge branch 'stable' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-avalon/pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-lucene/pom.xml
	webmagic-samples/pom.xml
	webmagic-saxon/pom.xml
	webmagic-scripts/pom.xml
	webmagic-selenium/pom.xml
11 years ago
yihua.huang a08d8cb167 update verion 11 years ago
yihua.huang 42a2676e8c update version 11 years ago
yihua.huang c25b32f1ca [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 7ff83bb11a [maven-release-plugin] prepare release WebMagic-0.5.0 11 years ago
yihua.huang 1104122979 more abstraction in scheduler 11 years ago
yihua.huang b0fb1c3e10 remove copy-dependcies plugin for m2e error 11 years ago
yihua.huang 94a67165e1 remove jmx server for simplify #98 11 years ago
yihua.huang 86a45a6643 change SpiderMonitor to singleton #98 11 years ago
yihua.huang ab4d36806e clean code 11 years ago
yihua.huang 04fde8203b add control for monitor 11 years ago
yihua.huang 2770811a10 update monitor example 11 years ago
yihua.huang 17e95f2a7f comments 11 years ago
yihua.huang 375e64e845 more monitor status 11 years ago
yihua.huang c6661899fd new thread pool #110 11 years ago
yihua.huang 179baa7a22 return when page is null 11 years ago
yihua.huang 4738ae2d14 change url find to match #94 11 years ago
yihua.huang f973889cda refactor subpageprossor etc. #94 11 years ago
yihua.huang acb63d55d7 some check and example #98 11 years ago
yihua.huang 11ba5beb42 [refactor]move monitor to webmagic-extension #98 11 years ago
yihua.huang b06aa489fb [BugFix]Only one url from sourceRegion can be extracted #107 11 years ago
yihua.huang 023c2ac84e spider config draft 11 years ago
yihua.huang a5db6cf292 some monitor and JMX support #98 11 years ago
yihua.huang aae1ab2cd6 fix compile error 11 years ago
yihua.huang 1fbfc92de2 Inherit support of Field annotation in Model #103 11 years ago
yihua.huang a03f6a8431 eclipse project 11 years ago
yihua.huang 3a79b1b64a [Bugfix]formatter property does not work when field is String#100 11 years ago
Yihua Huang cc9d319fd9 Merge pull request #94 from sebastian1118/master
update:PatternHandler
11 years ago
yihua.huang 03c251237b add Json parse support 11 years ago
Tian 99e12aafaa update:PatternHandler 11 years ago
yihua.huang c1e7207869 add FileCacheQueueScheduler support for cycleRetryTimes 11 years ago
yihua.huang 969ad1766b change logger style to slf4j for cleaner code 11 years ago
yihua.huang 9b2cb43f47 ConfigurablePageProcessor #91 11 years ago
Bo LIANG 159eeea2f5 Remove unused variable to make the project cleaner. 11 years ago
yihua.huang c143fc662c add SubPageProcessor #86 11 years ago
Yihua Huang 474f785dab Merge pull request #86 from sebastian1118/master
new feature: PatternProcessor
11 years ago
Tian 38a12f8641 new feature: PatternProcessor 11 years ago
yihua.huang dafd0b5875 [BugFix]multi model in one pageprocessor will be skipped #85 11 years ago
yihua.huang a1c7e826f7 fix dep of slf4j-log4j12 11 years ago
yihua.huang f3c2503a29 add warning of slf4j #78 11 years ago
yihua.huang 8958d774f2 add default values for @Formatter 11 years ago
yihua.huang 6c11718566 Clean project structure #70 11 years ago
yihua.huang 98e2bba099 Merge branch 'master' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-scripts/pom.xml
11 years ago
yihua.huang 757cc9b942 [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 63ffb5c792 [maven-release-plugin] prepare release webmaigc-0.4.3 11 years ago
yihua.huang d5a978e00f update version back to 0.4.3 11 years ago
yihua.huang 0e98183f74 Change log4j to slf4j #55 11 years ago
yihua.huang fa33b15843 property loader 11 years ago
yihua.huang 362fdd0662 Merge branch 'master' of github.com:code4craft/webmagic 11 years ago
yihua.huang af809c4d55 update version to 0.5.0-snapshot 11 years ago
jon a722f9bb66 修复由于FileCacheQueueScheduler中fileCursor 文件再次打开时没有初始化抛出NullPointerException的错误 11 years ago
yihua.huang 12a6390cbd update spring4 configuration 11 years ago
yihua.huang fc97cb58c5 update lib and version 11 years ago
yihua.huang d274310cb2 [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang e8c32a32dc [maven-release-plugin] prepare release webmagic-0.4.2 11 years ago
yihua.huang 486d9d276f #45 Remove multi in ExtractBy 11 years ago
yihua.huang e7083dc39d [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang ae623567b3 [maven-release-plugin] prepare release webmagic-0.4.1 11 years ago
yihua.huang 18a3af4a0a add more sample for jsonpath #42 11 years ago
yihua.huang 59ad4cad27 #42 Add jsonpath in annotation mode for json result 11 years ago
yihua.huang cf62d707e0 #36 Spider does not exit when success 11 years ago
yihua.huang a01312930a #39 Parsing html after page.getHtml() 11 years ago
yihua.huang f9daae39cf [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang fdb9441519 [maven-release-plugin] prepare release webmagic-0.4.0 11 years ago
yihua.huang 1d75ae7f5b rollback version to 0.4.0 because not deploy success 11 years ago
yihua.huang b838c4e433 #34 Close reader in FileCacheQueueScheduler 11 years ago
yihua.huang 775eb9732f [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 0b4fadc24d [maven-release-plugin] prepare release webmagic-0.4.0 11 years ago
yihua.huang fd6d2fd6f8 try to keepalive TCP connection 11 years ago
yihua.huang 425df08523 update version to 0.4.0 11 years ago
yihua.huang e046bb0723 remove useless code 11 years ago
yihua.huang 6e32a19f80 update api for direct download 11 years ago
yihua.huang 807aefe9df change EntityUtil to IOUtil because some encoding error 11 years ago
yihua.huang 8f774afc84 add direct download 11 years ago
yihua.huang 2e496402dc add more warning for 0.3.3 11 years ago
yihua.huang 1a2c84ea78 #27 add timeout config to site 11 years ago
yihua.huang 3b00190f99 api without implementation for #28: add specific url crawl 11 years ago
yihua.huang 4acbc19cee [maven-release-plugin] prepare for next development iteration 12 years ago
yihua.huang cc3b787991 [maven-release-plugin] prepare release webmagic-0.3.2 12 years ago
yihua.huang 6f18eec77e fix a test error 12 years ago
yihua.huang b131878123 add example 12 years ago
yihua.huang 95ab4edec3 some bugfix 12 years ago
yihua.huang 250cc5e662 change formatter to class 12 years ago
yihua.huang b18216245b add type convert 12 years ago
yihua.huang fb693a4ac4 [maven-release-plugin] prepare for next development iteration 12 years ago
yihua.huang bfaaa042b9 [maven-release-plugin] prepare release webmagic-parent-0.3.1 12 years ago
yihua.huang d7c7a78177 complete test cases 12 years ago
yihua.huang c17a31a21d fix null pointe exception #26 12 years ago
yihua.huang e7bf425df4 [maven-release-plugin] prepare for next development iteration 12 years ago
yihua.huang 77ff252316 [maven-release-plugin] prepare release webmagic-0.3.0 12 years ago
yihua.huang d141541ef3 add retry 12 years ago
yihua.huang aefd0569a5 update version 12 years ago
yihua.huang 194518fd82 add switch 12 years ago
yihua.huang 326b97c65a update 12 years ago
yihua.huang d7cd9e5747 update pom 12 years ago
yihua.huang 478ace7e97 add FilePageModelPipeline 12 years ago
yihua.huang ad66d33f38 [maven-release-plugin] prepare for next development iteration 12 years ago
yihua.huang 9dc6b11954 [maven-release-plugin] prepare release webmagic-parent-0.2.1 12 years ago
yihua.huang 4f62dfc8a4 release 12 years ago
yihua.huang 74c940c758 [maven-release-plugin] prepare for next development iteration 12 years ago
yihua.huang a4bb4e3429 [maven-release-plugin] prepare release webmagic-parent-0.2.1 12 years ago
yihua.huang 194f16aa75 update 12 years ago
yihua.huang 09ffd468c0 fix comments 12 years ago
yihua.huang c70ed57025 remove PriorityScheduler to core 12 years ago
yihua.huang 7003426898 update pom 12 years ago
yihua.huang 606417fdc7 update pom 12 years ago
yihua.huang d460e136ef update version 12 years ago
yihua.huang c79d6ecf09 complete all comments 12 years ago
yihua.huang 5073258237 closable 12 years ago
yihua.huang 5f1f4cbc46 update comments 12 years ago
yihua.huang 6cc1d62a08 bugfix: rawhtml do not work 12 years ago
yihua.huang a994b1c9fd complete extension comments in en 12 years ago
yihua.huang c59c1fe80d update comments 12 years ago
yihua.huang 59aad6a7f4 comments in english 12 years ago
yihua.huang e566a53936 update ignore test 12 years ago
yihua.huang 1148450ff9 update filecache to more useful 12 years ago
yihua.huang 3ba7a76f44 add combo extract to replace Extract2 Extract3... 12 years ago
yihua.huang 5cb45af3a4 +doc 12 years ago
yihua.huang a339e4ab5c add jsonpathselector 12 years ago
yihua.huang 9e82256ce3 update docs 12 years ago
yihua.huang f21097421b add new constructor to redisscheduler 12 years ago
yihua.huang 0f2c5b5723 update redisscheduler 12 years ago