You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Go to file
yihua.huang 7038c00a9a reformat 11 years ago
asserts add distributed architechure 11 years ago
en_docs update en_docs 11 years ago
webmagic-avalon Do not cache document in Selectable for selected Html element #73 11 years ago
webmagic-core Change HashMap to LinkedHashMap in ResultItems for same order of input and output #76 11 years ago
webmagic-extension add warning of slf4j #78 11 years ago
webmagic-samples Clean project structure #70 11 years ago
webmagic-saxon Clean project structure #70 11 years ago
webmagic-scripts Clean project structure #70 11 years ago
webmagic-selenium Clean project structure #70 11 years ago
zh_docs add warning of slf4j #78 11 years ago
.gitignore The SeleniumDownloader should call the setRawText 11 years ago
.gitmodules update submodule url 11 years ago
.travis.yml add travis ci support for submodule 11 years ago
README.md reformat 11 years ago
pom.xml update xsoup version to 0.2.2 11 years ago
release-note.md #34 Close reader in FileCacheQueueScheduler 11 years ago
user-manual.md update version in docs 11 years ago
webmagic-avalon.md scripts readme 11 years ago

README.md

logo

Readme in Chinese

User Manual (Chinese)

Build Status

A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.

Features:

  • Simple core with high flexibility.
  • Simple API for html extracting.
  • Annotation with POJO to customize a crawler, no configuration.
  • Multi-thread and Distribution support.
  • Easy to be integrated.

Install:

Add dependencies to your pom.xml:

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.4.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.4.3</version>
</dependency>

WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.

<exclusions>
    <exclusion>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
</exclusions>

Get Started:

First crawler:

Write a class implements PageProcessor

public class OschinaBlogPageProcesser implements PageProcessor {

    private Site site = Site.me().setDomain("my.oschina.net");

    @Override
    public void process(Page page) {
        List<String> links = page.getHtml().links().regex("http://my\\.oschina\\.net/flashsword/blog/\\d+").all();
        page.addTargetRequests(links);
        page.putField("title", page.getHtml().xpath("//div[@class='BlogEntity']/div[@class='BlogTitle']/h1").toString());
        page.putField("content", page.getHtml().$("div.content").toString());
        page.putField("tags",page.getHtml().xpath("//div[@class='BlogTags']/a/text()").all());
    }

    @Override
    public Site getSite() {
        return site;

    }

    public static void main(String[] args) {
        Spider.create(new OschinaBlogPageProcesser()).addUrl("http://my.oschina.net/flashsword/blog")
             .addPipeline(new ConsolePipeline()).run();
    }
}
  • page.addTargetRequests(links)

    Add urls for crawling.

You can also use annotation way:

@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
public class OschinaBlog {

    @ExtractBy("//title")
    private String title;

    @ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
    private String content;

    @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
    private List<String> tags;

    public static void main(String[] args) {
        OOSpider.create(
        	Site.me(),
			new ConsolePageModelPipeline(), OschinaBlog.class).addUrl("http://my.oschina.net/flashsword/blog").run();
    }
}

Docs and samples:

The architecture of webmagic (refered to Scrapy)

image

Javadocs: http://code4craft.github.io/webmagic/docs/en/

There are some samples in webmagic-samples package.

Lisence:

Lisenced under Apache 2.0 lisence

Contributors:

Thanks these people for commiting source code, reporting bugs or suggesting for new feature:

Thanks:

To write webmagic, I refered to the projects below :

Mail-list:

https://groups.google.com/forum/#!forum/webmagic-java

Bitdeli Badge