You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
webmagic/release-note.md

91 lines
3.4 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

Release Notes
----
See latest versions in [https://github.com/code4craft/webmagic/releases](https://github.com/code4craft/webmagic/releases)
*2012-9-4* `version0.3.0`
* Change default XPath selector from HtmlCleaner to [Xsoup](https://github.com/code4craft/xsoup).
[Xsoup](https://github.com/code4craft/xsoup) is an XPath selector based on Jsoup written by me. It has much better performance than HtmlCleaner.
Time of processing a page is reduced from 7~9ms to 0.4ms.
If Xsoup is not stable for your usage, just use `Spider.xsoupOff()` to turn off it and report an issue to me!
* Add cycle retry times for Site.
When cycle retry times is set, Spider will put the url which downloading failed back to scheduler, and retry after a cycle of queue.
*2012-8-20* `version0.2.1`
ComboExtractor support for annotation.
Request priority support (using `PriorityScheduler`).
Complete some I18n work (comments and documents).
More convenient extractor API:
* Add attribute name select for CSSSelector.
* Group of regex selector can be specified.
* Add OrSelector.
* Add Selectors, import static Selectors.* for fluent API such as:
or(regex("<title>(.*)</title>"), xpath("//title"), $("title")).select(s);
* Add JsonPathSelector for Json parse.
*2012-8-9* `version0.2.0`
此次更新的主题是"方便"(之前的主题是"灵活")。
增加了webmagic-extension模块。
增加了注解方式支持可以通过POJO+注解的方式编写一个爬虫更符合Java开发习惯。以下是抓取一个博客的完整代码
@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
public class OschinaBlog {
@ExtractBy("//title")
private String title;
@ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
private String content;
@ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
private List<String> tags;
public static void main(String[] args) {
OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"),
new ConsolePageModelPipeline(), OschinaBlog.class)
.scheduler(new RedisScheduler("127.0.0.1")).thread(5).run();
}
}
增加一个Spider.test(url)方法,用于开发爬虫时进行调试。
增加基于redis的分布式支持。
增加XPath2.0语法支持(webmagic-saxon模块)。
增加基于Selenium的浏览器渲染支持用于抓取动态加载内容(webmagic-selenium模块)。
修复了不支持https的bug。
补充了文档:[webmagic-0.2.0用户手册](http://code4craft.github.io/webmagic/)。
*2012-7-25* `version0.1.0`
第一个稳定版本。
修改了若干API使得可扩展性更强为每个任务分配一个ID可以通过ID区分不同任务。
重写了Pipeline接口将抽取结果集包装到ResultItems对象而不是通用一个Page对象便于逻辑分离。
增加下载的重试机制支持gzip支持自定义UA/cookie。
增加多线程抓取功能,只需在初始化的时候指定线程数即可。
增加jquery形式的CSS Selector API可以通过`page.getHtml().$("div.body")`形式抽取元素。
完善了文档,架构说明:[webmagic的设计机制及原理-如何开发一个Java爬虫](http://my.oschina.net/flashsword/blog/145796)Javadoc[http://code4craft.github.io/webmagic/docs](http://code4craft.github.io/webmagic/docs)。