yihua.huang
6bd1eed25e
fix duplicate call of onSuccess and onError #605
8 years ago
yihua.huang
3a589d4ca6
HttpRequestBody implements Serializable #594
8 years ago
yihua.huang
13cdf82695
update version to 0.7.2-SNAPSHOT
8 years ago
yihua.huang
1e9187f24e
version 0.7.1
8 years ago
yihua.huang
592fa2c0f1
add site header test
8 years ago
yihua.huang
19d34dbb65
not add bracket to regex in RegexSelector #559
8 years ago
Yihua Huang
b6b991a09b
Merge pull request #556 from zhuyuesut/master
...
增加对零宽断言的支持
8 years ago
yihua.huang
bb0eb69acf
update ZhihuPageProcessor example
8 years ago
yihua.huang
2e35e149be
for 0.7.1
8 years ago
yihua.huang
17d8bfa907
docs and pgp version
8 years ago
yihua.huang
17478fcfc4
0.7.0 release
8 years ago
yihua.huang
636359300f
add Site.disableCookieManagement #577
8 years ago
yihua.huang
49de9374cd
new SimpleHttpClient #576
8 years ago
yihua.huang
8999ea9320
add public constructor for SimpleProxyProvider
8 years ago
yihua.huang
a8c2e6c729
alpha release
8 years ago
yihua.huang
3c1338193b
for 0.7.0.alpha
8 years ago
yihua.huang
e8abc28072
#552 add some log when crawler stop
8 years ago
zhuyue
9e1b7ed3f7
Update RegexSelector.java
8 years ago
zhuyue
c80f25edbd
Update RegexSelectorTest.java
...
简单的增加了一点测试
8 years ago
zhuyue
c3183252ac
Update RegexSelector.java
8 years ago
yihua.huang
cbf80af5dd
test for SimpleProxyProvider #535
8 years ago
yihua.huang
eb632a93d3
SimpleProxyProvider #535
8 years ago
yihua.huang
d38d51dfcb
fix javadoc
8 years ago
GZhY
5f34adf938
完善 LinksSelector.selectList 的测试用例
8 years ago
GZhY
ce3f0ac239
删除 fixAllRelativeHrefs 并修复 SeleniumDownloader 对 fixAllRelativeHrefs 的依赖
8 years ago
GZhY
bc6e81e00f
修复checkElementAndConvert方法注释中注释错误
8 years ago
yihua.huang
4a2c0f4f97
add returnProxy for proxyProvider
8 years ago
yihua.huang
1b04a7f2b3
#527 move logic check from downloaderto spider
8 years ago
yihua.huang
0f4d6e8b12
#525 remove port in UrlUtils.getDomain()
8 years ago
yihua.huang
a1ae632b62
test for request cookies and headers
8 years ago
yihua.huang
db67db8103
#523 remove fixAllRelativeHrefs by default, get absolute urls for links()
8 years ago
yihua.huang
abd020b45b
some comments
8 years ago
yihua.huang
2622b448b8
fix test
8 years ago
yihua.huang
b06a248c00
fix test
8 years ago
yihua.huang
1cfbd13aae
refacor in httpclientdownloader
8 years ago
yihua.huang
83ada9749e
fix test
8 years ago
yihua.huang
fe95a6842f
Request再次重构:去掉params,仅保留HttpRequestBody
8 years ago
yihua.huang
395396c68e
增加HttpRequestBody
8 years ago
xbynet
c93a8a2722
修复字符编码检测BUG
8 years ago
yihua.huang
74110e6ec5
remove useless file
8 years ago
yihua.huang
b100dfe273
update version
8 years ago
xbynet@outlook.com
1c24baa8d1
Request支持设置header与cookie
...
新增POST请求时,XML、JSON参数支持
Page支持获取响应header
8 years ago
yihua.huang
6bd197859b
fix test
8 years ago
yihua.huang
f23e138c72
add response headers to Page #508
8 years ago
yihua.huang
c13110c4cb
fix samples
8 years ago
yihua.huang
c51ac6017c
remove Site.addStartRequest() etc. #494
8 years ago
yihua.huang
68050fc88e
test pass
8 years ago
yihua.huang
474b7c9d57
refactor
8 years ago
yihua.huang
25c81013ca
new proxy pool api
8 years ago
yihua.huang
46297deaa1
HttpUriRequestConverter
8 years ago
yihua.huang
1d86f7c048
compile passed in httpclientDownloader
8 years ago
yihua.huang
b71f379512
fix
8 years ago
yihua.huang
a7f9e7cad5
重构一部分httpclient
8 years ago
yihua.huang
221c155060
move release connection before return proxy #396
8 years ago
yihua.huang
68beff42c5
add test #493
8 years ago
wuyifan
79522f941e
Bug, add null check to site in HttpClientDownloader & HttpClientGenerator
8 years ago
yihua.huang
e9341d0291
complete test #447
8 years ago
yihua.huang
e7d35c4846
add params to all method of request #447
8 years ago
yihua.huang
75bad591d7
rewrite hashCode and equals for params #447
8 years ago
Yihua Huang
11c32669b2
Merge pull request #447 from xbynet/master
...
简化POST参数设置.
8 years ago
yihua.huang
aa01e27779
change constructor for Proxy to public #490
8 years ago
mei
791520e6a0
fix a bug of RegexSelector when regex has zero-width assertions.
8 years ago
yihua.huang
c175ea88c0
#more test #484
8 years ago
yihua.huang
9b964c0a99
test for #484
8 years ago
yihua.huang
fc702fd3b6
introduce mockito for test
8 years ago
yihua.huang
5215a492cc
remove duplicate check for POST request #484
8 years ago
yihua.huang
0a1fb19052
add tests #483
8 years ago
yihua.huang
a2e7f0004b
Merge branch 'master' of github.com:code4craft/webmagic
8 years ago
yihua.huang
ef32571821
rewrite Request.equals and hashCode, add Method to check #483
8 years ago
yihua.huang
8b8f535c30
refactor:extract charset detect to utils
8 years ago
Ckex.zha
e645524ad2
fix bug,set ExecutorService
8 years ago
yihua.huang
a872a6480e
fix code sample for github #348
8 years ago
yihua.huang
1d2171805f
add test for #228
8 years ago
yihua.huang
bbe0b52ddd
remove synchronized in QueueScheduler #410
8 years ago
yihua.huang
ad69963005
remove synchronize in Page #411
8 years ago
yihua.huang
3a796b9413
remove duplicate code #421
8 years ago
yihua.huang
42f1018010
remove messy code
8 years ago
xbynet
650468c0e4
解决POST中文参数乱码问题
8 years ago
yihua.huang
aaccc93215
new version
8 years ago
yihua.huang
3e633c6871
version
8 years ago
yihua.huang
f45e2f118b
for release
8 years ago
yihua.huang
d60615f503
修复使用startUrls没有设置domain导致使用cookie空指针的问题#438
8 years ago
yihua.huang
407fbb6130
refactor logger#445
8 years ago
Ckex.zha
0dc26c8ca0
optimize code.
8 years ago
Yihua Huang
4f76d62d4f
Merge pull request #444 from ckex/develop
...
绕过安全证书
8 years ago
Ckex.zha
e4af05a6f2
绕过安全证书
8 years ago
xbynet@outlook.com
c23627bf63
解决post/redirect/post 302跳转问题
8 years ago
yihua.huang
d69204b919
0.6.0
8 years ago
yihua.huang
9bdb48b2d0
version 0.6.0
8 years ago
yihua.huang
eeb607fd0e
将Spider.processRequest()抛出异常改回原来的逻辑
8 years ago
yihua.huang
97592d6720
Version 0.6.0
8 years ago
yihua.huang
00dfebbceb
#424 remove guava dep and add fix docs
8 years ago
yihua.huang
c2531c6817
clean dependency
8 years ago
yihua.huang
a960a39c44
fix compile error for example change
8 years ago
yihua.huang
7476ceccee
more stable test
8 years ago
yihua.huang
5ce3fdfe5a
some refactor in log
8 years ago
yihua.huang
98163a3e40
update examples
8 years ago
yihua.huang
b090dcd20d
sepcific error page for HttpClientDownloaderTest to avoid test error when local port is available
8 years ago
yihua.huang
8f942d6fe2
#419 修复抓取https链接线程无法结束导致进程一直运行的问题
8 years ago
yihua.huang
dafd2b77ff
fix GithubRepoPageProcessor in example
8 years ago
yihua.huang
cfed860fb9
Merge branch 'master' of github.com:code4craft/webmagic
8 years ago
yihua.huang
2189aab652
fix test
8 years ago
Yihua Huang
1491033534
Merge pull request #377 from jerry-sc/monitor-bug
...
fix the monitor bug which the spider will terminate when a seed url with port
8 years ago
yihua.huang
507556d0aa
fix test: ProxyTest.testProxy() do not load exist proxy config
8 years ago
Jerry
e56b8c3efc
fix the monitor bug which the spider will terminate when a seed url with port
9 years ago
yihua.huang
448e528140
update StringUtils to apache lang3 #314
9 years ago
yihua.huang
3e33959b7a
#319 fix javadoc
9 years ago
yihua.huang
8730e3e97a
Merge branch 'fix' of git://github.com/kapsterio/webmagic into kapsterio-fix
9 years ago
yihua.huang
2400ff7e1a
resovle conflict
9 years ago
yihua.huang
b7f3c4bba0
Merge branch 'master' of git://github.com/hepan/webmagic into hepan-master
9 years ago
yihua.huang
d8f978fd20
fix test in JsonPathSelectorTest #289
9 years ago
yihua.huang
61c28a0130
refactor on proxypool
9 years ago
yihua.huang
b871b210c5
Merge branch 'proxy-strategy' of github.com:EdwardsBean/webmagic into EdwardsBean-proxy-strategy
9 years ago
yihua.huang
b5413368de
update ut
9 years ago
Jon
83c27ebbc4
增加IP代理认证功能
9 years ago
yihua.huang
ca072c5575
fix URL regex in GithubRepoPageProcessor #305
9 years ago
hepan
89c6e52863
代理增加用户名密码认证
9 years ago
Linker Lin
047cb8ff8f
updated versions to 0.5.4-SNAPSHOT
9 years ago
zhangheng09
6b179c3d55
这个改动的原因基于两点:1)代理归还给代理池的时机应该是执行完http请求后就要尽早归还 2)http代理应该是HttpClientDownloader该考虑的事,不应该有Spider来处理,Spider并不知道它的downloader是个HttpClientDownloader
9 years ago
zhangheng09
5f106c9c69
当page为null时,意味着非正常的响应状态,应该抛出异常,否则SpiderListener的onSuccess方法和onError方法都会执行
9 years ago
yihua.huang
c0b8e8f8ae
remove .classpath .project
9 years ago
yihua.huang
a8e6de4b90
Merge branch 'master' of git.oschina.net:flashsword20/webmagic
9 years ago
yihua.huang
0fd4623f0a
Merge branch 'osc'
9 years ago
yihua.huang
ce5495ecd5
remove useless files
9 years ago
yihua.huang
8265c7dade
remove submodules for relase
9 years ago
yihua.huang
7edfa26f90
complete javadoc
9 years ago
yihua.huang
8b90b91e33
complete some javadoc
9 years ago
yihua.huang
2b556cf053
update verison to 0.5.3-SNAPSHOT
9 years ago
yihua.huang
9c5716a543
complete javadoc
9 years ago
yihua.huang
db3cbf6ca5
update version to 0.5.3-SNAPSHOT
9 years ago
yihua.huang
81ce1ffc5f
fix ignore
9 years ago
yihua.huang
93764fa2c9
ignore some test
9 years ago
yihua.huang
5706bb90af
update xsoup to 0.3.1
9 years ago
yihua.huang
7586e3d75c
add some test for github repo downloader
9 years ago
x1ny
90e14b31b0
修正FileCacheQueueScheduler导致程序不能正常结束和未关闭流
...
FileCacheQueueScheduler中开启了一个线程周期运行来保存数据但在爬虫结束后没有关闭导致程序无法结束,以及没有关闭io流。
解决方法:
让FileCacheQueueScheduler实现Closable接口,在close方法中关闭线程以及流。
在Spider的close方法中添加对scheduler的关闭操作。
9 years ago
yihua.huang
56e0cd513a
compile error fix
10 years ago
yihua.huang
c5740b1840
change assert #200
10 years ago
yihua.huang
67eb632f4d
test for issue #200
10 years ago
高军
590561a6e4
修正site.setHttpProxy()不起作用的bug
10 years ago
edwardsbean
19474e4716
add SimpleProxyPool and IProxyPool
10 years ago
edwardsbean
4978665633
add retry sleep time
10 years ago
yihua.huang
8ffc1a7093
add NPE check for POST method
10 years ago
zhugw
bc666e927d
Update Site.java
...
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
11 years ago
yihua.huang
147401ce5e
remove duplicate setPath in ProxyPool
11 years ago
yihua.huang
e7668e01b8
fix SourceRegion error and add some tests on it #144
11 years ago
yihua.huang
4446669c24
fix test
11 years ago
yihua.huang
9866297ec4
Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149
11 years ago
yihua.huang
4e6e946dd7
more friendly exception message in PlainText #144
11 years ago
yihua.huang
af9939622b
move thread package out of selector (because it is add by mistake at the beginning)
11 years ago
yihua.huang
eae37c868b
new sample
11 years ago
yihua.huang
b3a282e58d
some fix for tests #130
11 years ago
yihua.huang
074d767f45
Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy
11 years ago
zwf
2f89cfc31a
add test and fix bug of proxy module
11 years ago
yihua.huang
eb89d66566
fix test
11 years ago
yihua.huang
5e8ca02ec6
contributor
11 years ago
yihua.huang
8c33be48a6
Merge branch 'stable' of github.com:code4craft/webmagic
11 years ago
yihua.huang
5f8c3fd5c5
update version
11 years ago
yihua.huang
7a64847a3c
Bugfix: selector does not works well in element #113
11 years ago
yihua.huang
8d67fd0357
change back return proxy from spider to httpclientdownloader #128
11 years ago
yihua.huang
40bf8ca58f
change return proxy from spider to httpclientdownloader #128
11 years ago
yihua.huang
1f21d9cc14
spell mistake fix #128
11 years ago
Yihua Huang
e310139d00
Merge pull request #128 from yxssfxwzy/proxy
...
多个代理的管理
11 years ago
yihua.huang
b165090434
Bugfix:Type convert error in JsonPathSelector #129
11 years ago
yihua.huang
a5d1b56e44
fix ut #113
11 years ago
yihua.huang
3939074a23
Bugfix: nodes() only return the first element #113
11 years ago
yihua.huang
41c2ea9498
refactor of selectable cont' #113
...
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
11 years ago
yihua.huang
f9825c214a
refactor selectable for html fragment #113
11 years ago
yihua.huang
03d26c169b
Enhance auto charset detect #126
...
1. Only read from content once to fix stream closed exception
2. invite moco as server test
11 years ago
zwf
c146e2c7b4
add proxy pool
11 years ago
yihua.huang
21982d3460
remove cpdetector temporary #126
11 years ago
fengwuze
fcbfb75608
修改自动从网页中获取字符的代码块,抽取出来成为单独的方法。
11 years ago
fengwuze
95494d3c4d
增加处理meta的逻辑。
...
遗留:
3、网页没有指定编码的情况下,需要采用cpdetector,但目前cpdetector这个在Maven的中央库里面没有,不清楚如何解决。
11 years ago
yihua.huang
dde2d89bbe
Ignore content in json when bracket when remove padding #124
11 years ago
ywooer
259f0a16c5
Update FilePipeline.java
11 years ago
ywooer
26d38851b5
add charset to Writer
11 years ago
yihua.huang
7668731f08
update version to snapshot
11 years ago
yihua.huang
182dd51689
Merge branch 'stable' of github.com:code4craft/webmagic
11 years ago
yihua.huang
81e6e772ac
versions back to 0.5.1
11 years ago
yihua.huang
feb604da87
Merge branch 'stable' of github.com:code4craft/webmagic
11 years ago
yihua.huang
358e906379
[maven-release-plugin] prepare for next development iteration
11 years ago
yihua.huang
470750fc0d
[maven-release-plugin] prepare release WebMagic-0.5.1
11 years ago
yihua.huang
01aec7e1ab
extension point of geturl #118
11 years ago
yihua.huang
ec1c2e8cbc
test and so on
11 years ago
yihua.huang
4f22f1210e
some bug fix #118
11 years ago
yihua.huang
56f033ce8d
set setDuplicateRemover for chain api #118
11 years ago
yihua.huang
d1140b9e29
add bloom filter for scheduler #118
11 years ago
yihua.huang
8e4814bdc5
fix path seperator
11 years ago
yihua.huang
e8d4a9be2b
fix remove duplicate error #117
11 years ago
yihua.huang
04ade75606
Merge branch 'stable' of github.com:code4craft/webmagic
...
Conflicts:
README.md
pom.xml
webmagic-avalon/pom.xml
webmagic-core/pom.xml
webmagic-extension/pom.xml
webmagic-lucene/pom.xml
webmagic-samples/pom.xml
webmagic-saxon/pom.xml
webmagic-scripts/pom.xml
webmagic-selenium/pom.xml
11 years ago
yihua.huang
a08d8cb167
update verion
11 years ago
yihua.huang
42a2676e8c
update version
11 years ago
yihua.huang
c25b32f1ca
[maven-release-plugin] prepare for next development iteration
11 years ago
yihua.huang
7ff83bb11a
[maven-release-plugin] prepare release WebMagic-0.5.0
11 years ago
yihua.huang
1104122979
more abstraction in scheduler
11 years ago
yihua.huang
2770811a10
update monitor example
11 years ago
yihua.huang
5ecd909ef2
add timeout for wait/notify #111
11 years ago
yihua.huang
c7afdb516e
remove thread utils #110
11 years ago
yihua.huang
17e95f2a7f
comments
11 years ago
yihua.huang
05eb7831b6
refactor and comments #110
11 years ago
yihua.huang
375e64e845
more monitor status
11 years ago