lucene源码分析(7)Analyzer分析
源码 分析 Lucene Analyzer
2023-09-11 14:21:40 时间
1.Analyzer的使用
Analyzer使用在IndexWriter的构造方法
/** * Constructs a new IndexWriter per the settings given in <code>conf</code>. * If you want to make "live" changes to this writer instance, use * {@link #getConfig()}. * * <p> * <b>NOTE:</b> after ths writer is created, the given configuration instance * cannot be passed to another writer. * * @param d * the index directory. The index is either created or appended * according <code>conf.getOpenMode()</code>. * @param conf * the configuration settings according to which IndexWriter should * be initialized. * @throws IOException * if the directory cannot be read/written to, or if it does not * exist and <code>conf.getOpenMode()</code> is * <code>OpenMode.APPEND</code> or if there is any other low-level * IO error */ public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException { enableTestPoints = isEnableTestPoints(); conf.setIndexWriter(this); // prevent reuse by other instances config = conf; infoStream = config.getInfoStream(); softDeletesEnabled = config.getSoftDeletesField() != null; // obtain the write.lock. If the user configured a timeout, // we wrap with a sleeper and this might take some time. writeLock = d.obtainLock(WRITE_LOCK_NAME); boolean success = false; try { directoryOrig = d; directory = new LockValidatingDirectoryWrapper(d, writeLock); analyzer = config.getAnalyzer(); mergeScheduler = config.getMergeScheduler(); mergeScheduler.setInfoStream(infoStream); codec = config.getCodec(); OpenMode mode = config.getOpenMode(); final boolean indexExists; final boolean create; if (mode == OpenMode.CREATE) { indexExists = DirectoryReader.indexExists(directory); create = true; } else if (mode == OpenMode.APPEND) { indexExists = true; create = false; } else { // CREATE_OR_APPEND - create only if an index does not exist indexExists = DirectoryReader.indexExists(directory); create = !indexExists; } // If index is too old, reading the segments will throw // IndexFormatTooOldException. String[] files = directory.listAll(); // Set up our initial SegmentInfos: IndexCommit commit = config.getIndexCommit(); // Set up our initial SegmentInfos: StandardDirectoryReader reader; if (commit == null) { reader = null; } else { reader = commit.getReader(); } if (create) { if (config.getIndexCommit() != null) { // We cannot both open from a commit point and create: if (mode == OpenMode.CREATE) { throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() with OpenMode.CREATE"); } else { throw new IllegalArgumentException("cannot use IndexWriterConfig.setIndexCommit() when index has no commit"); } } // Try to read first. This is to allow create // against an index that's currently open for // searching. In this case we write the next // segments_N file with no segments: final SegmentInfos sis = new SegmentInfos(Version.LATEST.major); if (indexExists) { final SegmentInfos previous = SegmentInfos.readLatestCommit(directory); sis.updateGenerationVersionAndCounter(previous); } segmentInfos = sis; rollbackSegments = segmentInfos.createBackupSegmentInfos(); // Record that we have a change (zero out all // segments) pending: changed(); } else if (reader != null) { // Init from an existing already opened NRT or non-NRT reader: if (reader.directory() != commit.getDirectory()) { throw new IllegalArgumentException("IndexCommit's reader must have the same directory as the IndexCommit"); } if (reader.directory() != directoryOrig) { throw new IllegalArgumentException("IndexCommit's reader must have the same directory passed to IndexWriter"); } if (reader.segmentInfos.getLastGeneration() == 0) { // TODO: maybe we could allow this? It's tricky... throw new IllegalArgumentException("index must already have an initial commit to open from reader"); } // Must clone because we don't want the incoming NRT reader to "see" any changes this writer now makes: segmentInfos = reader.segmentInfos.clone(); SegmentInfos lastCommit; try { lastCommit = SegmentInfos.readCommit(directoryOrig, segmentInfos.getSegmentsFileName()); } catch (IOException ioe) { throw new IllegalArgumentException("the provided reader is stale: its prior commit file \"" + segmentInfos.getSegmentsFileName() + "\" is missing from index"); } if (reader.writer != null) { // The old writer better be closed (we have the write lock now!): assert reader.writer.closed; // In case the old writer wrote further segments (which we are now dropping), // update SIS metadata so we remain write-once: segmentInfos.updateGenerationVersionAndCounter(reader.writer.segmentInfos); lastCommit.updateGenerationVersionAndCounter(reader.writer.segmentInfos); } rollbackSegments = lastCommit.createBackupSegmentInfos(); } else { // Init from either the latest commit point, or an explicit prior commit point: String lastSegmentsFile = SegmentInfos.getLastCommitSegmentsFileName(files); if (lastSegmentsFile == null) { throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files)); } // Do not use SegmentInfos.read(Directory) since the spooky // retrying it does is not necessary here (we hold the write lock): segmentInfos = SegmentInfos.readCommit(directoryOrig, lastSegmentsFile); if (commit != null) { // Swap out all segments, but, keep metadata in // SegmentInfos, like version & generation, to // preserve write-once. This is important if // readers are open against the future commit // points. if (commit.getDirectory() != directoryOrig) { throw new IllegalArgumentException("IndexCommit's directory doesn't match my directory, expected=" + directoryOrig + ", got=" + commit.getDirectory()); } SegmentInfos oldInfos = SegmentInfos.readCommit(directoryOrig, commit.getSegmentsFileName()); segmentInfos.replace(oldInfos); changed(); if (infoStream.isEnabled("IW")) { infoStream.message("IW", "init: loaded commit \"" + commit.getSegmentsFileName() + "\""); } } rollbackSegments = segmentInfos.createBackupSegmentInfos(); } commitUserData = new HashMap<>(segmentInfos.getUserData()).entrySet(); pendingNumDocs.set(segmentInfos.totalMaxDoc()); // start with previous field numbers, but new FieldInfos // NOTE: this is correct even for an NRT reader because we'll pull FieldInfos even for the un-committed segments: globalFieldNumberMap = getFieldNumberMap(); validateIndexSort(); config.getFlushPolicy().init(config); bufferedUpdatesStream = new BufferedUpdatesStream(infoStream); docWriter = new DocumentsWriter(flushNotifications, segmentInfos.getIndexCreatedVersionMajor(), pendingNumDocs, enableTestPoints, this::newSegmentName, config, directoryOrig, directory, globalFieldNumberMap); readerPool = new ReaderPool(directory, directoryOrig, segmentInfos, globalFieldNumberMap, bufferedUpdatesStream::getCompletedDelGen, infoStream, conf.getSoftDeletesField(), reader); if (config.getReaderPooling()) { readerPool.enableReaderPooling(); } // Default deleter (for backwards compatibility) is // KeepOnlyLastCommitDeleter: // Sync'd is silly here, but IFD asserts we sync'd on the IW instance: synchronized(this) { deleter = new IndexFileDeleter(files, directoryOrig, directory, config.getIndexDeletionPolicy(), segmentInfos, infoStream, this, indexExists, reader != null); // We incRef all files when we return an NRT reader from IW, so all files must exist even in the NRT case: assert create || filesExist(segmentInfos); } if (deleter.startingCommitDeleted) { // Deletion policy deleted the "head" commit point. // We have to mark ourself as changed so that if we // are closed w/o any further changes we write a new // segments_N file. changed(); } if (reader != null) { // We always assume we are carrying over incoming changes when opening from reader: segmentInfos.changed(); changed(); } if (infoStream.isEnabled("IW")) { infoStream.message("IW", "init: create=" + create + " reader=" + reader); messageState(); } success = true; } finally { if (!success) { if (infoStream.isEnabled("IW")) { infoStream.message("IW", "init: hit exception on init; releasing write lock"); } IOUtils.closeWhileHandlingException(writeLock); writeLock = null; } } }
2.Analyzer的定义
/** * An Analyzer builds TokenStreams, which analyze text. It thus represents a * policy for extracting index terms from text. * <p> * In order to define what analysis is done, subclasses must define their * {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}. * The components are then reused in each call to {@link #tokenStream(String, Reader)}. * <p> * Simple example: * <pre class="prettyprint"> * Analyzer analyzer = new Analyzer() { * {@literal @Override} * protected TokenStreamComponents createComponents(String fieldName) { * Tokenizer source = new FooTokenizer(reader); * TokenStream filter = new FooFilter(source); * filter = new BarFilter(filter); * return new TokenStreamComponents(source, filter); * } * {@literal @Override} * protected TokenStream normalize(TokenStream in) { * // Assuming FooFilter is about normalization and BarFilter is about * // stemming, only FooFilter should be applied * return new FooFilter(in); * } * }; * </pre> * For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}. * <p> * For some concrete implementations bundled with Lucene, look in the analysis modules: * <ul> * <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>: * Analyzers for indexing content in different languages and domains. * <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>: * Exposes functionality from ICU to Apache Lucene. * <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>: * Morphological analyzer for Japanese text. * <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>: * Dictionary-driven lemmatization for the Polish language. * <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>: * Analysis for indexing phonetic signatures (for sounds-alike search). * <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>: * Analyzer for Simplified Chinese, which indexes words. * <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>: * Algorithmic Stemmer for the Polish Language. * </ul> */
可以看出,Analyzer针对不同的语言给出了不同的方式
其中,common抽象出所有Analyzer类,如下图所示
相关文章
- AtomicInteger源码解析-Java8
- ABP源码分析三十五:ABP中动态WebAPI原理解析
- ABP源码分析五:ABP初始化全过程
- 【华为云技术分享】Linux内核补丁源码分析(1)
- 【nodejs原理&源码杂记(8)】Timer模块与基于二叉堆的定时器
- spring mvc之启动过程源码分析
- 【源码阅读】为什么需要阅读源码
- Hadoop2源码分析-RPC机制初识
- Apache Spark源码走读(十)ShuffleMapTask计算结果的保存与读取 &WEB UI和Metrics初始化及数据更新过程分析
- MFC Windows 程序设计[179]之自定义控件(附源码)
- Opencv学习笔记 - imread源码解读
- Android系统源码编译
- 从源码分析DEARGUI之add_text_point
- 从源码分析DEARGUI之add_selectable
- 鸿蒙轻内核源码分析:虚拟文件系统 VFS
- 【Android 热修复】热修复原理 ( 类加载分析 | 分析 PathClassLoader 源码 | 分析 BaseDexClassLoader 源码 | 分析 PathDexList 源码 )
- 第二人生的源码分析(三十九)关闭WinXP的错误报告功能
- 第二人生的源码分析(六十五)Expat XML解析器的介绍
- 第二人生的源码分析(八十六)LLView类添加子窗口显示
- lucene源码分析的一些资料
- react源码分析:深度理解React.Context
- Python基于OpenCV的土壤裂缝分割系统[源码&部署教程]
- 源码部署ELK日志分析管理系统
- (4)Blender源码分析之闪屏窗口显示流程
- 谷歌浏览器的源码分析(14)
- VC++ Windows7及以上系统中管理员权限与UAC虚拟化详解(附源码)