Commit Graph

629 Commits (develop)

Author SHA1 Message Date
yihua.huang a7f9e7cad5 重构一部分httpclient 8 years ago
yihua.huang 221c155060 move release connection before return proxy #396 8 years ago
yihua.huang 68beff42c5 add test #493 8 years ago
wuyifan 79522f941e Bug, add null check to site in HttpClientDownloader & HttpClientGenerator 8 years ago
yihua.huang e9341d0291 complete test #447 8 years ago
yihua.huang e7d35c4846 add params to all method of request #447 8 years ago
yihua.huang 75bad591d7 rewrite hashCode and equals for params #447 8 years ago
Yihua Huang 11c32669b2 Merge pull request #447 from xbynet/master
简化POST参数设置.
8 years ago
yihua.huang aa01e27779 change constructor for Proxy to public #490 8 years ago
mei 791520e6a0 fix a bug of RegexSelector when regex has zero-width assertions. 8 years ago
yihua.huang c175ea88c0 #more test #484 8 years ago
yihua.huang 9b964c0a99 test for #484 8 years ago
yihua.huang fc702fd3b6 introduce mockito for test 8 years ago
yihua.huang 5215a492cc remove duplicate check for POST request #484 8 years ago
yihua.huang 0a1fb19052 add tests #483 8 years ago
yihua.huang a2e7f0004b Merge branch 'master' of github.com:code4craft/webmagic 8 years ago
yihua.huang ef32571821 rewrite Request.equals and hashCode, add Method to check #483 8 years ago
yihua.huang 8b8f535c30 refactor:extract charset detect to utils 8 years ago
Ckex.zha e645524ad2 fix bug,set ExecutorService 8 years ago
yihua.huang a872a6480e fix code sample for github #348 8 years ago
yihua.huang 1d2171805f add test for #228 8 years ago
yihua.huang bbe0b52ddd remove synchronized in QueueScheduler #410 8 years ago
yihua.huang ad69963005 remove synchronize in Page #411 8 years ago
yihua.huang 3a796b9413 remove duplicate code #421 8 years ago
yihua.huang 42f1018010 remove messy code 8 years ago
xbynet 650468c0e4 解决POST中文参数乱码问题 8 years ago
yihua.huang aaccc93215 new version 8 years ago
yihua.huang 3e633c6871 version 8 years ago
yihua.huang f45e2f118b for release 8 years ago
yihua.huang d60615f503 修复使用startUrls没有设置domain导致使用cookie空指针的问题#438 8 years ago
yihua.huang 407fbb6130 refactor logger#445 8 years ago
Ckex.zha 0dc26c8ca0 optimize code. 8 years ago
Yihua Huang 4f76d62d4f Merge pull request #444 from ckex/develop
绕过安全证书
8 years ago
Ckex.zha e4af05a6f2 绕过安全证书 8 years ago
xbynet@outlook.com c23627bf63 解决post/redirect/post 302跳转问题 8 years ago
yihua.huang d69204b919 0.6.0 8 years ago
yihua.huang 9bdb48b2d0 version 0.6.0 8 years ago
yihua.huang eeb607fd0e 将Spider.processRequest()抛出异常改回原来的逻辑 8 years ago
yihua.huang 97592d6720 Version 0.6.0 8 years ago
yihua.huang 00dfebbceb #424 remove guava dep and add fix docs 8 years ago
yihua.huang c2531c6817 clean dependency 8 years ago
yihua.huang a960a39c44 fix compile error for example change 8 years ago
yihua.huang 7476ceccee more stable test 8 years ago
yihua.huang 5ce3fdfe5a some refactor in log 8 years ago
yihua.huang 98163a3e40 update examples 8 years ago
yihua.huang b090dcd20d sepcific error page for HttpClientDownloaderTest to avoid test error when local port is available 8 years ago
yihua.huang 8f942d6fe2 #419 修复抓取https链接线程无法结束导致进程一直运行的问题 8 years ago
yihua.huang dafd2b77ff fix GithubRepoPageProcessor in example 8 years ago
yihua.huang cfed860fb9 Merge branch 'master' of github.com:code4craft/webmagic 8 years ago
yihua.huang 2189aab652 fix test 8 years ago
Yihua Huang 1491033534 Merge pull request #377 from jerry-sc/monitor-bug
fix the monitor bug which the spider will terminate when a seed url with port
8 years ago
yihua.huang 507556d0aa fix test: ProxyTest.testProxy() do not load exist proxy config 8 years ago
Jerry e56b8c3efc fix the monitor bug which the spider will terminate when a seed url with port 8 years ago
yihua.huang 448e528140 update StringUtils to apache lang3 #314 9 years ago
yihua.huang 3e33959b7a #319 fix javadoc 9 years ago
yihua.huang 8730e3e97a Merge branch 'fix' of git://github.com/kapsterio/webmagic into kapsterio-fix 9 years ago
yihua.huang 2400ff7e1a resovle conflict 9 years ago
yihua.huang b7f3c4bba0 Merge branch 'master' of git://github.com/hepan/webmagic into hepan-master 9 years ago
yihua.huang d8f978fd20 fix test in JsonPathSelectorTest #289 9 years ago
yihua.huang 61c28a0130 refactor on proxypool 9 years ago
yihua.huang b871b210c5 Merge branch 'proxy-strategy' of github.com:EdwardsBean/webmagic into EdwardsBean-proxy-strategy 9 years ago
yihua.huang b5413368de update ut 9 years ago
Jon 83c27ebbc4 增加IP代理认证功能 9 years ago
yihua.huang ca072c5575 fix URL regex in GithubRepoPageProcessor #305 9 years ago
hepan 89c6e52863 代理增加用户名密码认证 9 years ago
Linker Lin 047cb8ff8f updated versions to 0.5.4-SNAPSHOT 9 years ago
zhangheng09 6b179c3d55 这个改动的原因基于两点:1)代理归还给代理池的时机应该是执行完http请求后就要尽早归还 2)http代理应该是HttpClientDownloader该考虑的事,不应该有Spider来处理,Spider并不知道它的downloader是个HttpClientDownloader 9 years ago
zhangheng09 5f106c9c69 当page为null时,意味着非正常的响应状态,应该抛出异常,否则SpiderListener的onSuccess方法和onError方法都会执行 9 years ago
yihua.huang c0b8e8f8ae remove .classpath .project 9 years ago
yihua.huang a8e6de4b90 Merge branch 'master' of git.oschina.net:flashsword20/webmagic 9 years ago
yihua.huang 0fd4623f0a Merge branch 'osc' 9 years ago
yihua.huang ce5495ecd5 remove useless files 9 years ago
yihua.huang 8265c7dade remove submodules for relase 9 years ago
yihua.huang 7edfa26f90 complete javadoc 9 years ago
yihua.huang 8b90b91e33 complete some javadoc 9 years ago
yihua.huang 2b556cf053 update verison to 0.5.3-SNAPSHOT 9 years ago
yihua.huang 9c5716a543 complete javadoc 9 years ago
yihua.huang db3cbf6ca5 update version to 0.5.3-SNAPSHOT 9 years ago
yihua.huang 81ce1ffc5f fix ignore 9 years ago
yihua.huang 93764fa2c9 ignore some test 9 years ago
yihua.huang 5706bb90af update xsoup to 0.3.1 9 years ago
yihua.huang 7586e3d75c add some test for github repo downloader 9 years ago
x1ny 90e14b31b0 修正FileCacheQueueScheduler导致程序不能正常结束和未关闭流
FileCacheQueueScheduler中开启了一个线程周期运行来保存数据但在爬虫结束后没有关闭导致程序无法结束,以及没有关闭io流。

解决方法:
让FileCacheQueueScheduler实现Closable接口,在close方法中关闭线程以及流。
在Spider的close方法中添加对scheduler的关闭操作。
9 years ago
yihua.huang 56e0cd513a compile error fix 10 years ago
yihua.huang c5740b1840 change assert #200 10 years ago
yihua.huang 67eb632f4d test for issue #200 10 years ago
高军 590561a6e4 修正site.setHttpProxy()不起作用的bug 10 years ago
edwardsbean 19474e4716 add SimpleProxyPool and IProxyPool 10 years ago
edwardsbean 4978665633 add retry sleep time 10 years ago
yihua.huang 8ffc1a7093 add NPE check for POST method 10 years ago
zhugw bc666e927d Update Site.java
setCycleRetryTimes的javadoc是这么说的:Set cycleRetryTimes times when download fail, 0 by default. Only work in RedisScheduler.
而通过查看源码发现似乎并没有做限制,即只能用于RedisScheduler. 故想问一下该javadoc是否过时了?
11 years ago
yihua.huang 147401ce5e remove duplicate setPath in ProxyPool 11 years ago
yihua.huang e7668e01b8 fix SourceRegion error and add some tests on it #144 11 years ago
yihua.huang 4446669c24 fix test 11 years ago
yihua.huang 9866297ec4 Disable jsoup entity escape by Default. Set Html.DISABLE_HTML_ENTITY_ESCAPE to false to enable it. #149 11 years ago
yihua.huang 4e6e946dd7 more friendly exception message in PlainText #144 11 years ago
yihua.huang af9939622b move thread package out of selector (because it is add by mistake at the beginning) 11 years ago
yihua.huang eae37c868b new sample 11 years ago
yihua.huang b3a282e58d some fix for tests #130 11 years ago
yihua.huang 074d767f45 Merge branch 'proxy' of github.com:yxssfxwzy/webmagic into yxssfxwzy-proxy 11 years ago
zwf 2f89cfc31a add test and fix bug of proxy module 11 years ago
yihua.huang eb89d66566 fix test 11 years ago
yihua.huang 5e8ca02ec6 contributor 11 years ago
yihua.huang 8c33be48a6 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 5f8c3fd5c5 update version 11 years ago
yihua.huang 7a64847a3c Bugfix: selector does not works well in element #113 11 years ago
yihua.huang 8d67fd0357 change back return proxy from spider to httpclientdownloader #128 11 years ago
yihua.huang 40bf8ca58f change return proxy from spider to httpclientdownloader #128 11 years ago
yihua.huang 1f21d9cc14 spell mistake fix #128 11 years ago
Yihua Huang e310139d00 Merge pull request #128 from yxssfxwzy/proxy
多个代理的管理
11 years ago
yihua.huang b165090434 Bugfix:Type convert error in JsonPathSelector #129 11 years ago
yihua.huang a5d1b56e44 fix ut #113 11 years ago
yihua.huang 3939074a23 Bugfix: nodes() only return the first element #113 11 years ago
yihua.huang 41c2ea9498 refactor of selectable cont' #113
1. remove lazy init of Html
2. rename strings to sourceTexts for better meaning
3. make getSourceTexts abstract and DO NOT always store strings
4. instead store parsed elements of document in HtmlNode
11 years ago
yihua.huang f9825c214a refactor selectable for html fragment #113 11 years ago
yihua.huang 03d26c169b Enhance auto charset detect #126
1. Only read from content once to fix stream closed exception
2. invite moco as server test
11 years ago
zwf c146e2c7b4 add proxy pool 11 years ago
yihua.huang 21982d3460 remove cpdetector temporary #126 11 years ago
fengwuze fcbfb75608 修改自动从网页中获取字符的代码块,抽取出来成为单独的方法。 11 years ago
fengwuze 95494d3c4d 增加处理meta的逻辑。
遗留:
3、网页没有指定编码的情况下,需要采用cpdetector,但目前cpdetector这个在Maven的中央库里面没有,不清楚如何解决。
11 years ago
yihua.huang dde2d89bbe Ignore content in json when bracket when remove padding #124 11 years ago
ywooer 259f0a16c5 Update FilePipeline.java 11 years ago
ywooer 26d38851b5 add charset to Writer 11 years ago
yihua.huang 7668731f08 update version to snapshot 11 years ago
yihua.huang 182dd51689 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 81e6e772ac versions back to 0.5.1 11 years ago
yihua.huang feb604da87 Merge branch 'stable' of github.com:code4craft/webmagic 11 years ago
yihua.huang 358e906379 [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 470750fc0d [maven-release-plugin] prepare release WebMagic-0.5.1 11 years ago
yihua.huang 01aec7e1ab extension point of geturl #118 11 years ago
yihua.huang ec1c2e8cbc test and so on 11 years ago
yihua.huang 4f22f1210e some bug fix #118 11 years ago
yihua.huang 56f033ce8d set setDuplicateRemover for chain api #118 11 years ago
yihua.huang d1140b9e29 add bloom filter for scheduler #118 11 years ago
yihua.huang 8e4814bdc5 fix path seperator 11 years ago
yihua.huang e8d4a9be2b fix remove duplicate error #117 11 years ago
yihua.huang 04ade75606 Merge branch 'stable' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-avalon/pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-lucene/pom.xml
	webmagic-samples/pom.xml
	webmagic-saxon/pom.xml
	webmagic-scripts/pom.xml
	webmagic-selenium/pom.xml
11 years ago
yihua.huang a08d8cb167 update verion 11 years ago
yihua.huang 42a2676e8c update version 11 years ago
yihua.huang c25b32f1ca [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 7ff83bb11a [maven-release-plugin] prepare release WebMagic-0.5.0 11 years ago
yihua.huang 1104122979 more abstraction in scheduler 11 years ago
yihua.huang 2770811a10 update monitor example 11 years ago
yihua.huang 5ecd909ef2 add timeout for wait/notify #111 11 years ago
yihua.huang c7afdb516e remove thread utils #110 11 years ago
yihua.huang 17e95f2a7f comments 11 years ago
yihua.huang 05eb7831b6 refactor and comments #110 11 years ago
yihua.huang 375e64e845 more monitor status 11 years ago
yihua.huang 018061d2cd fix error in thread pool 11 years ago
yihua.huang cdc423f2bf log 11 years ago
yihua.huang c6661899fd new thread pool #110 11 years ago
yihua.huang 179baa7a22 return when page is null 11 years ago
yihua.huang 0336f4cdb4 remove IllegalStateException when download error for less error log 11 years ago
yihua.huang 11ba5beb42 [refactor]move monitor to webmagic-extension #98 11 years ago
yihua.huang d61f65cef8 update mbean to mxbean #98 11 years ago
yihua.huang ad6a273b12 update test url 11 years ago
yihua.huang 30af23d003 split monitor to server and client mode #98 11 years ago
yihua.huang ced79630d3 specify jndi and jmx #98 11 years ago
yihua.huang 95d3802e77 add formdata support for post request #108 11 years ago
yihua.huang f49bb877c8 clean some code #109 11 years ago
yihua.huang e1aaf1dd11 fix mistake of guava Table #109 11 years ago
yihua.huang 8ba2da146c request method #108 and more cookie #109 config 11 years ago
yihua.huang b06aa489fb [BugFix]Only one url from sourceRegion can be extracted #107 11 years ago
Bo LIANG 08fa3b01c1 when download error, throw an exception instead of calling onError and returning peacefully. #105 11 years ago
yihua.huang 27b37e8164 extension point and sample for JMX support #98 11 years ago
yihua.huang a5db6cf292 some monitor and JMX support #98 11 years ago
yihua.huang f39aa435cf add null check #104 11 years ago
yihua.huang 42bbe40a37 [Bugfix]Urls will be lost when call setScheduler() #104 11 years ago
Bo LIANG 163773af6b combine two try-catch block into one, make it cleaner. 11 years ago
yihua.huang ec446277b1 some refactor in httpclientdownloader 11 years ago
yihua.huang a03f6a8431 eclipse project 11 years ago
yihua.huang 4a035e729a extension point for LocalDuplicatedRemovedScheduler #95 11 years ago
yihua.huang b249e49748 [Bugfix]loop error when add TargetRequest #99 11 years ago
Yihua Huang da2f023c12 Merge pull request #96 from ouyanghuangzheng/master
修改了Spider 和site  几处注释
11 years ago
yihua.huang f7950ebcab fix tests 11 years ago
愤怒的番茄 32ba1b8889 修复几处注释问题 11 years ago
yihua.huang 84b897f83b update AngularJSProcessor 11 years ago
yihua.huang 03c251237b add Json parse support 11 years ago
愤怒的番茄 644e8d1f72 同步官方源码 11 years ago
yihua.huang 969ad1766b change logger style to slf4j for cleaner code 11 years ago
yihua.huang 9b2cb43f47 ConfigurablePageProcessor #91 11 years ago
Bo LIANG b043ac76d6 change the formatter of log.
To use slf4j, we should insert {} into the formatter string.
11 years ago
yihua.huang 7aaf837e15 change logger to slf4j style for performance #84 11 years ago
yihua.huang f9b157951d Merge branch 'master' of github.com:code4craft/webmagic 11 years ago
yihua.huang 22c394e629 [doc] 11 years ago
Bo LIANG 762a3973fd Modify the log levels of LocalDuplicatedRemovedScheduler.java
The old version will print a debug level log each time the push method is
called. So sometimes, when a html page have multiple links for the same
page, the debug log will appears more than once. Also, when we meet a
duplicate URL, it will also print a log, which will be confusing.
I change the level of it to trace. And each time a URL is really push into
queue, print a debug level log.
11 years ago
yihua.huang a1c7e826f7 fix dep of slf4j-log4j12 11 years ago
yihua.huang 01848301d4 encode illegal charactors in url #80 11 years ago
yihua.huang 2780423e60 enable blank space in quotes in UrlUtils.fixAllRelativeHrefs #80 11 years ago
yihua.huang 97b6f46280 Bugfix: break loop in addTargetRequests #81 11 years ago
yihua.huang 8d8194bee4 Change HashMap to LinkedHashMap in ResultItems for same order of input and output #76 11 years ago
yihua.huang 8b35d79569 Do not cache document in Selectable for selected Html element #73 11 years ago
yihua.huang 6201fd6966 add worker as container 11 years ago
yihua.huang 6c11718566 Clean project structure #70 11 years ago
yihua.huang 9606a173cd fix ZipCodePageProcessor 11 years ago
yihua.huang 4f68368db0 Merge branch 'master' of git.oschina.net:flashsword20/webmagic
Conflicts:
	webmagic-core/src/main/java/us/codecraft/webmagic/selector/RegexSelector.java
11 years ago
yihua.huang 98e2bba099 Merge branch 'master' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
	webmagic-scripts/pom.xml
11 years ago
yihua.huang 757cc9b942 [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 63ffb5c792 [maven-release-plugin] prepare release webmaigc-0.4.3 11 years ago
yihua.huang 66d4d3c192 Merge branch 'master' into 0.4.x 11 years ago
yihua.huang af07280176 remove defend code for httpclient 4.3.1 because it is fixed in 4.3.3 #59 11 years ago
yihua.huang d5a978e00f update version back to 0.4.3 11 years ago
yihua.huang 55368919df add attribute 'text' support for CssSelector #66 11 years ago
yihua.huang 88b50d4182 bigfix: cycleTry will not work when spawnUrl is set to false #62 11 years ago
yihua.huang 2768a1cae4 add test for cycleTriedTimes and fix cycleTriedTimes inc error #60 11 years ago
yihua.huang bbd0d7e600 update httpclient version to 4.3.3 #59 11 years ago
yihua.huang 571061454a #58 add CYCLE_TRIED_TIMES support to QueueScheduler and PriorityScheduler 11 years ago
yihua.huang 0e98183f74 Change log4j to slf4j #55 11 years ago
yihua.huang fa33b15843 property loader 11 years ago
yihua.huang af809c4d55 update version to 0.5.0-snapshot 11 years ago
Almark Ming 2b46b11e55 Update RegexSelector.java
Optimize regex format check

Conflicts:
	webmagic-core/src/main/java/us/codecraft/webmagic/selector/RegexSelector.java
11 years ago
yihua.huang 2a8e1b654d Merge branch 'master' of git.oschina.net:flashsword20/webmagic into osc
Conflicts:
	pom.xml
11 years ago
Almark Ming 91ed66ecac Update RegexSelector.java 11 years ago
Almark Ming 83926970b2 Check valid left parenthesis 11 years ago
yihua.huang b51fb2696b update ut for cookie 11 years ago
yihua.huang ff2f588c41 #48 nullpointer exception 11 years ago
yihua.huang fc97cb58c5 update lib and version 11 years ago
yihua.huang 7c41bec92f Merge branch 'master' of github.com:code4craft/webmagic
Conflicts:
	README.md
	webmagic-samples/pom.xml
	webmagic-selenium/pom.xml
11 years ago
yihua.huang d274310cb2 [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang e8c32a32dc [maven-release-plugin] prepare release webmagic-0.4.2 11 years ago
yihua.huang 6a828e923c #46 Downloader thread hang up when timeout 11 years ago
shijinping 9a524aa364 double-check 中再取次httpClient的内容 11 years ago
yihua.huang fd23cb6dc0 Merge branch 'master' of github.com:code4craft/webmagic
Conflicts:
	README.md
	pom.xml
	webmagic-samples/pom.xml
	webmagic-selenium/pom.xml
11 years ago
yihua.huang e7083dc39d [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang ae623567b3 [maven-release-plugin] prepare release webmagic-0.4.1 11 years ago
yihua.huang 59ad4cad27 #42 Add jsonpath in annotation mode for json result 11 years ago
yihua.huang c2d6d495b3 #41 add getThreadAlive(),getStatus,getPageCount() to spider 11 years ago
yihua.huang cf62d707e0 #36 Spider does not exit when success 11 years ago
yihua.huang a01312930a #39 Parsing html after page.getHtml() 11 years ago
yihua.huang f63d33b457 update some comments 11 years ago
yihua.huang 04fcf3193f #38 Change algorithm of SmartContentSelector 11 years ago
yihua.huang 296a68920e fix javadoc and add setPipelines() for spider 11 years ago
yihua.huang 47a0360783 #35 add status code to page 11 years ago
yihua.huang bc5c30de17 update scripts 11 years ago
yihua.huang f9daae39cf [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang fdb9441519 [maven-release-plugin] prepare release webmagic-0.4.0 11 years ago
yihua.huang 1d75ae7f5b rollback version to 0.4.0 because not deploy success 11 years ago
yihua.huang df8ca8ad09 add scripts 11 years ago
yihua.huang e40b48e77b Merge tag 'webmagic-0.4.0' of github.com:code4craft/webmagic
[maven-release-plugin]  copy for tag webmagic-0.4.0

Conflicts:
	pom.xml
	webmagic-core/pom.xml
	webmagic-extension/pom.xml
11 years ago
yihua.huang 775eb9732f [maven-release-plugin] prepare for next development iteration 11 years ago
yihua.huang 0b4fadc24d [maven-release-plugin] prepare release webmagic-0.4.0 11 years ago
yihua.huang fe6d9bb2e2 get keep-alive rework 11 years ago
yihua.huang fd6d2fd6f8 try to keepalive TCP connection 11 years ago
yihua.huang 425df08523 update version to 0.4.0 11 years ago
yihua.huang e046bb0723 remove useless code 11 years ago
yihua.huang 6e32a19f80 update api for direct download 11 years ago
yihua.huang 807aefe9df change EntityUtil to IOUtil because some encoding error 11 years ago
yihua.huang 00b0a751b4 #33 ignore 'content-encoding' when redirect 11 years ago
yihua.huang 8f774afc84 add direct download 11 years ago
yihua.huang c18b603399 optimize long compare 11 years ago