Lucene3.6英文分词 lucene自定义分词器-爱华网

最近在做一个文本分类,需要涉及到英文单词的分类,单纯依靠Java自带的自配函数,应该也是可以解决的.但是想要使用一下开源的工具来做.所以研究了一下Lucene,发现很强大.文档比较完整,还提供源代码.真的很好,有时间的话,要好好的学一下子,争取能做一个搜索.我搜了一下将Lucene分词的例子,结果copy代码,不行.好多的资料,都是2.*版本的,到了3.*版本之后,就没有找到将的事例了.然后仔细看了一下文档.写了一个小代码发现如此可以.另外我发现,文档的分词,lucene本身是提供了好多类的.但是还有一些contribution,我猜,是其他人做的一些成果吧,并不是官方的,都是个人杜撰,可能是猜对了.英文分词呢,有两个选择,一个就是contribution里面的那个 anlyzer.en包,另外呢,anlyzer这个包自带了不少类,也是可以的.我发现contribution里面的en包,在分body这个单词出来的时候,不知道为什么会把body变成bodi,没有解决.所以,我就用anlyzer包吧,我的功能需求不是很多.贴上代码:en包:import org.apache.lucene.analysis.*;import org.apache.lucene.analysis.en.*;import org.apache.lucene.util.Version;importorg.apache.lucene.analysis.tokenattributes.CharTermAttribute;import java.io.*;import java.util.*;public class en1 {public static void main(String[] args) throwsIOException{EnglishAnalyzer english = newEnglishAnalyzer(Version.LUCENE_31);HashSet<String> hs = newHashSet<String>();String s = "Good Afternoon Doesn't IS a good body names NAMES1,671,000 hy body";TokenStream ts =english.tokenStream("", newStringReader(s));CharTermAttribute cab =ts.addAttribute(CharTermAttribute.class);while(ts.incrementToken()){System.out.println(cab.toString());}}
Lucene3.6英文分词 lucene自定义分词器

anlyzer包:import java.io.Reader;import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.analysis.StopAnalyzer;import org.apache.lucene.analysis.StopFilter;import org.apache.lucene.analysis.Token;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.Tokenizer;import org.apache.lucene.analysis.WhitespaceAnalyzer;importorg.apache.lucene.analysis.standard.StandardAnalyzer;importorg.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.lucene.util.Version;public class abc {private static String testString1 = "The quick brown foxjumped over the lazy dogs"; private static StringtestString2 = "xy&z mail is - xyz@sohu.com"; public static voidtestWhitespace(String testString) throws Exception{ Analyzer analyzer = newWhitespaceAnalyzer(Version.LUCENE_36); Reader r = new StringReader(testString); // Tokenizer ts = (Tokenizer)analyzer.tokenStream("", r); TokenStream ts=analyzer.tokenStream("", r);CharTermAttribute cab =ts.addAttribute(CharTermAttribute.class); System.err.println("=====Whitespaceanalyzer===="); System.err.println("分析方法：空格分割"); while (ts.incrementToken()) { System.out.print(cab.toString()+" ;"); }System.out.println(); } public static voidtestSimple(String testString) throws Exception{ Analyzer analyzer = newSimpleAnalyzer(Version.LUCENE_36); Reader r = new StringReader(testString); // Tokenizer ts = (Tokenizer)analyzer.tokenStream("", r); TokenStream ts =analyzer.tokenStream("",r);CharTermAttribute cab =ts.addAttribute(CharTermAttribute.class); System.err.println("=====Simpleanalyzer===="); System.err.println("分析方法：空格及各种符号分割"); while (ts.incrementToken()) { System.out.print(cab.toString()+" ;"); }System.out.println(); } public static voidtestStop(String testString) throws Exception{ Analyzer analyzer = newStopAnalyzer(Version.LUCENE_36); Reader r = new StringReader(testString); StopFilter sf = (StopFilter)analyzer.tokenStream("", r); System.err.println("=====stop analyzer===="); System.err.println("分析方法：空格及各种符号分割,去掉停止词，停止词包括is,are,in,on,the等无实际意义的词"); //停止词 CharTermAttribute cab =sf.addAttribute(CharTermAttribute.class); while (sf.incrementToken()) { System.out.print(cab.toString()+" ;"); }System.out.println(); } public static voidtestStandard(String testString) throws Exception{ Analyzer analyzer = newStandardAnalyzer(Version.LUCENE_36); Reader r = new StringReader(testString); StopFilter sf = (StopFilter)analyzer.tokenStream("", r); System.err.println("=====standardanalyzer===="); System.err.println("分析方法：混合分割,包括了去掉停止词，支持汉语"); CharTermAttribute cab =sf.addAttribute(CharTermAttribute.class); while (sf.incrementToken()) { System.out.print(cab.toString()+" ;"); }System.out.println(); } public static voidmain(String[] args) throws Exception { testWhitespace("i amlihan, i am a boy, i come from Beijing，我是来自北京的李晗"); testSimple("i am lihan,i am a boy, i come from Beijing，我是来自北京的李晗"); testStop("i am lihan, iam a boy, i come from Beijing，我是来自北京的李晗"); testStandard("i amlihan, i am a boy, i come from Beijing，我是来自北京的李晗"); }}

lucene 英文分词 lucene 英文分词器

爱华网本文地址 » http://www.aihuau.com/a/25101016/322036.html

Lucene3.6英文分词 lucene自定义分词器

更多阅读

如何自定义文件夹图标 win7自定义文件夹图标

qq空间怎么免费设置背景 qq空间免费自定义背景

怎样自定义QQ空间开场动画？ qq空间开场动画diy

如何设置毕业论文的自定义格式样式及应用格式毕业论文样式

修改开机画面自定义开机画面开机画面修改

声明:《Lucene3.6英文分词 lucene自定义分词器》为网友凡情坠念分享！如侵犯到您的合法权益请联系我们删除