Commit Graph

631 Commits (e2cc7402ffb4759a75bfabd3905c1b79a6eace9d)

Author SHA1 Message Date
yihua.huang 6bd1eed25e fix duplicate call of onSuccess and onError
yihua.huang 3a589d4ca6 HttpRequestBody implements Serializable
yihua.huang 13cdf82695 update version to 0.7.2-SNAPSHOT
yihua.huang 1e9187f24e version 0.7.1
yihua.huang 592fa2c0f1 add site header test
yihua.huang 19d34dbb65 not add bracket to regex in RegexSelector
Yihua Huang b6b991a09b Merge pull request from zhuyuesut/master
增加对零宽断言的支持
yihua.huang bb0eb69acf update ZhihuPageProcessor example
yihua.huang 2e35e149be for 0.7.1
yihua.huang 17d8bfa907 docs and pgp version
yihua.huang 17478fcfc4 0.7.0 release
yihua.huang 636359300f add Site.disableCookieManagement
yihua.huang 49de9374cd new SimpleHttpClient
yihua.huang 8999ea9320 add public constructor for SimpleProxyProvider
yihua.huang a8c2e6c729 alpha release
yihua.huang 3c1338193b for 0.7.0.alpha
yihua.huang e8abc28072 add some log when crawler stop
zhuyue 9e1b7ed3f7 Update RegexSelector.java
zhuyue c80f25edbd Update RegexSelectorTest.java
简单的增加了一点测试
zhuyue c3183252ac Update RegexSelector.java
yihua.huang cbf80af5dd test for SimpleProxyProvider
yihua.huang eb632a93d3 SimpleProxyProvider
yihua.huang d38d51dfcb fix javadoc
GZhY 5f34adf938 完善 LinksSelector.selectList 的测试用例
GZhY ce3f0ac239 删除 fixAllRelativeHrefs 并修复 SeleniumDownloader 对 fixAllRelativeHrefs 的依赖
GZhY bc6e81e00f 修复checkElementAndConvert方法注释中注释错误
yihua.huang 4a2c0f4f97 add returnProxy for proxyProvider
yihua.huang 1b04a7f2b3 move logic check from downloaderto spider
yihua.huang 0f4d6e8b12 remove port in UrlUtils.getDomain()
yihua.huang a1ae632b62 test for request cookies and headers
yihua.huang db67db8103 remove fixAllRelativeHrefs by default, get absolute urls for links()
yihua.huang abd020b45b some comments
yihua.huang 2622b448b8 fix test
yihua.huang b06a248c00 fix test
yihua.huang 1cfbd13aae refacor in httpclientdownloader
yihua.huang 83ada9749e fix test
yihua.huang fe95a6842f Request再次重构:去掉params,仅保留HttpRequestBody
yihua.huang 395396c68e 增加HttpRequestBody
xbynet c93a8a2722 修复字符编码检测BUG
yihua.huang 74110e6ec5 remove useless file
yihua.huang b100dfe273 update version
xbynet@outlook.com 1c24baa8d1 Request支持设置header与cookie
新增POST请求时,XML、JSON参数支持
Page支持获取响应header
yihua.huang 6bd197859b fix test
yihua.huang f23e138c72 add response headers to Page
yihua.huang c13110c4cb fix samples
yihua.huang c51ac6017c remove Site.addStartRequest() etc.
yihua.huang 68050fc88e test pass
yihua.huang 474b7c9d57 refactor
yihua.huang 25c81013ca new proxy pool api
yihua.huang 46297deaa1 HttpUriRequestConverter
yihua.huang 1d86f7c048 compile passed in httpclientDownloader
yihua.huang b71f379512 fix
yihua.huang a7f9e7cad5 重构一部分httpclient
yihua.huang 221c155060 move release connection before return proxy
yihua.huang 68beff42c5 add test
wuyifan 79522f941e Bug, add null check to site in HttpClientDownloader & HttpClientGenerator
yihua.huang e9341d0291 complete test
yihua.huang e7d35c4846 add params to all method of request
yihua.huang 75bad591d7 rewrite hashCode and equals for params
Yihua Huang 11c32669b2 Merge pull request from xbynet/master
简化POST参数设置.
yihua.huang aa01e27779 change constructor for Proxy to public
mei 791520e6a0 fix a bug of RegexSelector when regex has zero-width assertions.
yihua.huang c175ea88c0 #more test
yihua.huang 9b964c0a99 test for
yihua.huang fc702fd3b6 introduce mockito for test
yihua.huang 5215a492cc remove duplicate check for POST request
yihua.huang 0a1fb19052 add tests
yihua.huang a2e7f0004b Merge branch 'master' of github.com:code4craft/webmagic
yihua.huang ef32571821 rewrite Request.equals and hashCode, add Method to check
yihua.huang 8b8f535c30 refactor:extract charset detect to utils
Ckex.zha e645524ad2 fix bug,set ExecutorService
yihua.huang a872a6480e fix code sample for github
yihua.huang 1d2171805f add test for
yihua.huang bbe0b52ddd remove synchronized in QueueScheduler
yihua.huang ad69963005 remove synchronize in Page
yihua.huang 3a796b9413 remove duplicate code
yihua.huang 42f1018010 remove messy code
xbynet 650468c0e4 解决POST中文参数乱码问题
yihua.huang aaccc93215 new version
yihua.huang 3e633c6871 version
yihua.huang f45e2f118b for release
yihua.huang d60615f503 修复使用startUrls没有设置domain导致使用cookie空指针的问题#438
yihua.huang 407fbb6130 refactor logger#445
Ckex.zha 0dc26c8ca0 optimize code.
Yihua Huang 4f76d62d4f Merge pull request from ckex/develop
绕过安全证书
Ckex.zha e4af05a6f2 绕过安全证书
xbynet@outlook.com c23627bf63 解决post/redirect/post 302跳转问题
yihua.huang d69204b919 0.6.0
yihua.huang 9bdb48b2d0 version 0.6.0
yihua.huang eeb607fd0e 将Spider.processRequest()抛出异常改回原来的逻辑
yihua.huang 97592d6720 Version 0.6.0
yihua.huang 00dfebbceb remove guava dep and add fix docs
yihua.huang c2531c6817 clean dependency
yihua.huang a960a39c44 fix compile error for example change
yihua.huang 7476ceccee more stable test
yihua.huang 5ce3fdfe5a some refactor in log
yihua.huang 98163a3e40 update examples
yihua.huang b090dcd20d sepcific error page for HttpClientDownloaderTest to avoid test error when local port is available
yihua.huang 8f942d6fe2 修复抓取https链接线程无法结束导致进程一直运行的问题
yihua.huang dafd2b77ff fix GithubRepoPageProcessor in example
yihua.huang cfed860fb9 Merge branch 'master' of github.com:code4craft/webmagic
yihua.huang 2189aab652 fix test
Yihua Huang 1491033534 Merge pull request from jerry-sc/monitor-bug
fix the monitor bug which the spider will terminate when a seed url with port
yihua.huang 507556d0aa fix test: ProxyTest.testProxy() do not load exist proxy config
Jerry e56b8c3efc fix the monitor bug which the spider will terminate when a seed url with port
yihua.huang 448e528140 update StringUtils to apache lang3
yihua.huang 3e33959b7a fix javadoc
yihua.huang 8730e3e97a Merge branch 'fix' of git://github.com/kapsterio/webmagic into kapsterio-fix
yihua.huang 2400ff7e1a resovle conflict
yihua.huang b7f3c4bba0 Merge branch 'master' of git://github.com/hepan/webmagic into hepan-master
yihua.huang d8f978fd20 fix test in JsonPathSelectorTest
yihua.huang 61c28a0130 refactor on proxypool
yihua.huang b871b210c5 Merge branch 'proxy-strategy' of github.com:EdwardsBean/webmagic into EdwardsBean-proxy-strategy
yihua.huang b5413368de update ut
Jon 83c27ebbc4 增加IP代理认证功能
yihua.huang ca072c5575 fix URL regex in GithubRepoPageProcessor
hepan 89c6e52863 代理增加用户名密码认证
Linker Lin 047cb8ff8f updated versions to 0.5.4-SNAPSHOT
zhangheng09 6b179c3d55 这个改动的原因基于两点:1)代理归还给代理池的时机应该是执行完http请求后就要尽早归还 2)http代理应该是HttpClientDownloader该考虑的事,不应该有Spider来处理,Spider并不知道它的downloader是个HttpClientDownloader
zhangheng09 5f106c9c69 当page为null时,意味着非正常的响应状态,应该抛出异常,否则SpiderListener的onSuccess方法和onError方法都会执行
yihua.huang c0b8e8f8ae remove .classpath .project
yihua.huang a8e6de4b90 Merge branch 'master' of git.oschina.net:flashsword20/webmagic
yihua.huang 0fd4623f0a Merge branch 'osc'
yihua.huang ce5495ecd5 remove useless files
yihua.huang 8265c7dade remove submodules for relase
yihua.huang 7edfa26f90 complete javadoc
yihua.huang 8b90b91e33 complete some javadoc
yihua.huang 2b556cf053 update verison to 0.5.3-SNAPSHOT
yihua.huang 9c5716a543 complete javadoc
yihua.huang db3cbf6ca5 update version to 0.5.3-SNAPSHOT
yihua.huang 81ce1ffc5f fix ignore
yihua.huang 93764fa2c9 ignore some test
yihua.huang 5706bb90af update xsoup to 0.3.1
yihua.huang 7586e3d75c add some test for github repo downloader
x1ny 90e14b31b0 修正FileCacheQueueScheduler导致程序不能正常结束和未关闭流
FileCacheQueueScheduler中开启了一个线程周期运行来保存数据但在爬虫结束后没有关闭导致程序无法结束,以及没有关闭io流。

解决方法:
让FileCacheQueueScheduler实现Closable接口,在close方法中关闭线程以及流。
在Spider的close方法中添加对scheduler的关闭操作。
yihua.huang 56e0cd513a compile error fix
yihua.huang c5740b1840 change assert
yihua.huang 67eb632f4d test for issue
高军 590561a6e4 修正site.setHttpProxy()不起作用的bug
edwardsbean 19474e4716 add SimpleProxyPool and IProxyPool
edwardsbean 4978665633 add retry sleep time
yihua.huang 8ffc1a7093 add NPE check for POST method
zhugw bc666e927d Update Site.java
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
yihua.huang 147401ce5e remove duplicate setPath in ProxyPool
yihua.huang e7668e01b8 fix SourceRegion error and add some tests on it
yihua.huang 4446669c24 fix test
yihua.huang 9866297ec4 Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it.
yihua.huang 4e6e946dd7 more friendly exception message in PlainText
yihua.huang af9939622b move thread package out of selector (because it is add by mistake at the beginning)
yihua.huang eae37c868b new sample
yihua.huang b3a282e58d some fix for tests
yihua.huang 074d767f45 Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy
zwf 2f89cfc31a add test and fix bug of proxy module
yihua.huang eb89d66566 fix test
yihua.huang 5e8ca02ec6 contributor
yihua.huang 8c33be48a6 Merge branch 'stable' of github.com:code4craft/webmagic
yihua.huang 5f8c3fd5c5 update version
yihua.huang 7a64847a3c Bugfix: selector does not works well in element
yihua.huang 8d67fd0357 change back return proxy from spider to httpclientdownloader
yihua.huang 40bf8ca58f change return proxy from spider to httpclientdownloader
yihua.huang 1f21d9cc14 spell mistake fix
Yihua Huang e310139d00 Merge pull request from yxssfxwzy/proxy
多个代理的管理
yihua.huang b165090434 Bugfix:Type convert error in JsonPathSelector
yihua.huang a5d1b56e44 fix ut
yihua.huang 3939074a23 Bugfix: nodes() only return the first element
yihua.huang 41c2ea9498 refactor of selectable cont'
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
yihua.huang f9825c214a refactor selectable for html fragment
yihua.huang 03d26c169b Enhance auto charset detect
1. Only read from content once to fix stream closed exception
2. invite moco as server test
zwf c146e2c7b4 add proxy pool
yihua.huang 21982d3460 remove cpdetector temporary
fengwuze fcbfb75608 修改自动从网页中获取字符的代码块,抽取出来成为单独的方法。
fengwuze 95494d3c4d 增加处理meta的逻辑。
遗留:
3、网页没有指定编码的情况下,需要采用cpdetector,但目前cpdetector这个在Maven的中央库里面没有,不清楚如何解决。
yihua.huang dde2d89bbe Ignore content in json when bracket when remove padding
ywooer 259f0a16c5 Update FilePipeline.java
ywooer 26d38851b5 add charset to Writer
yihua.huang 7668731f08 update version to snapshot
yihua.huang 182dd51689 Merge branch 'stable' of github.com:code4craft/webmagic
yihua.huang 81e6e772ac versions back to 0.5.1
yihua.huang feb604da87 Merge branch 'stable' of github.com:code4craft/webmagic
yihua.huang 358e906379 [maven-release-plugin] prepare for next development iteration
yihua.huang 470750fc0d [maven-release-plugin] prepare release WebMagic-0.5.1
yihua.huang 01aec7e1ab extension point of geturl
yihua.huang ec1c2e8cbc test and so on
yihua.huang 4f22f1210e some bug fix
yihua.huang 56f033ce8d set setDuplicateRemover for chain api
yihua.huang d1140b9e29 add bloom filter for scheduler
yihua.huang 8e4814bdc5 fix path seperator
yihua.huang e8d4a9be2b fix remove duplicate error
yihua.huang 04ade75606 Merge branch 'stable' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-avalon/pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-lucene/pom.xml
	webmagic-samples/pom.xml
	webmagic-saxon/pom.xml
	webmagic-scripts/pom.xml
	webmagic-selenium/pom.xml
yihua.huang a08d8cb167 update verion
yihua.huang 42a2676e8c update version
yihua.huang c25b32f1ca [maven-release-plugin] prepare for next development iteration
yihua.huang 7ff83bb11a [maven-release-plugin] prepare release WebMagic-0.5.0
yihua.huang 1104122979 more abstraction in scheduler
yihua.huang 2770811a10 update monitor example
yihua.huang 5ecd909ef2 add timeout for wait/notify
yihua.huang c7afdb516e remove thread utils
yihua.huang 17e95f2a7f comments
yihua.huang 05eb7831b6 refactor and comments
yihua.huang 375e64e845 more monitor status