Lily Hbase Indexer在solr中为hbase的rowkey建立索引字段

在hbase录入数据的同时将数据索引到solr是一个很不错的主意。Lily Hbase Indexer就可以这么做。



资料地址在  .

在这篇讲解morphline indexer配置的文章中,涉及到一个属性  unique-key-field。



This attribute specifies the name of the document identifier field used in Solr.
The default value for this field is "id".


仔细想想,这说的难道不就是 solr字段配置文件 schema.xml中uniqueKey 嘛。。




<?xml version=”1.0″?>
<indexer table=”indextest” mapping-type=”column” mapper=”com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper” unique-key-field=”id”>
<param name=”morphlineFile” value=”morphlines.conf”/>


Global indexer attributes

The following is a list of attributes that can be set on the top-level <indexer> element in an indexer configuration.


The table attribute specifies the name of the HBase table to be indexed by the indexer. It is the only mandatory attribute in the indexer element.


The mapping-type attribute has two possible values: row, or column. This attribute specifies whether row-based or column-based indexing is to be performed.

Row-based indexing treats all data within a single HBase row as input for a single document in Solr. This is the kind of indexing that would be used for an HBase table that contains a separate entity in each row, e.g. a table containing users.

Column-based indexing treats each HBase cell as input for a single document in Solr. This approach could be used for example in a messaging platform where a single user’s messages are all stored in a single row, with each message being stored in a separate cell.

The default mapping-type value is row.


The read-row attribute has two possible values: dynamic, or never.

This attribute is only important when using row-based indexing. It specifies whether or not the indexer should re-read data from HBase in order to perform indexing.

When set to “dynamic”, the indexer will read the necessary data from a row if a partial update to the row is performed in HBase. In dynamic mode, the row will not be re-read if all data needed to perform indexing is included in the row update.

If this attribute is set to never, a row will never be re-read by the indexer.

The default setting is “dynamic”.


The mapper attribute allows the user to specify a custom mapper class that will create a Solr document from a HBase Result object. The mapper class must implement the



By default, the built-in


is used.


The unique-key-formatter attribute specifies the name of the class used to format HBase row keys (as well as column families and column qualifiers) as text. A textual representation of these pieces of information is needed for indexing in Solr, as all data in Solr is textual, but row keys, column families, and column qualifiers are byte arrays.

A unique-key-formatter class must implement the



The default value of this attribute is com.ngdata.hbaseindexer.uniquekey.StringUniqueKeyFormatter. The StringUniqueKey formatter simply treats row keys and other byte arrays as strings.

If your row keys, column families, or qualifiers can’t simply be used as strings, consider using the com.ngdata.hbaseindexer.uniquekey.HexUniqueKeyFormatter.


This attribute specifies the name of the document identifier field used in Solr.

The default value for this field is “id”.


The row-field attribute specifies the name of the Solr field to be used for storing an HBase row key.

This field is only important when doing column-based indexing. In order for the indexer to be able to delete all documents for a single row from the index, it needs to be able to find all documents for the row in Solr. When this attribute is populated in the indexer definition, it’s value is used as the name of a field in Solr to store the encoded row key.

By default, this attribute is empty, meaning that the row key is not stored in Solr. The consequence of this is that deleting a complete row or complete column family in HBase will not delete the indexed documents in Solr.


The column-family-field specifies the name of the Solr field to be used for storing the HBase column family name.

See the description of the row-field attribute for more information.

By default, this attribute is empty, so the column-family name is not saved in Solr.


The table-name-field specifies the name of the Solr field to be used for storing the name of the HBase table where a record is stored.

By default, this attribute is empty, so the name of the HBase table is not stored unless this setting is explicitly set in the indexer config.

Elements within the indexer definition

There are three types of elements that can be used within an indexer configuration: <field>, <extract>, and <param>.


The field element defines a single field to be indexed in Solr, as well as where its contents are to be taken from and interpreted from HBase. There are typically one or more fields listed in an indexer configuration — one for each Solr field to be stored.

The field attribute has four attributes, listed below.


The name attribute specifies the name of a Solr field in which to store data. A field with a matching name should be defined in the Solr schema.

The name attribute is mandatory.


The value attribute specifies the data to be used from HBase for populating the field in Solr. It takes the form of a column family name and qualifier, separated by a colon.

The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier expressions will be used.

The following are examples of valid value attributes:

  • mycolumnfamily:myqualifier
  • mycolumnfamily:my*
  • mycolumnfamily:*


The source attribute determines what portion of an HBase KeyValue will be used as indexing content.

It has two possible values: value and qualifier.

When value is specified (which is the case by default), then the cell value is used as input for indexing.

When qualifier is specified, then the column qualifier is used as input for indexing.


The type attribute defines the datatype of the content in HBase.

Because all data is stored in HBase as byte arrays, but all content in Solr is indexed as text, a method for converting from byte arrays to the actual datatype is needed.

The value of this field can be one of any of the datatypes supported by the HBase Bytesclass: int, long, string, boolean, float, double, short, or bigdecimal.

If the Bytes-based representation has not been used for storing data in HBase, the name of a custom class can be specified for this attribute. The custom class must implement the




The <param> element defines a key-value pair that will be supplied to custom classes that implement the



<param> elements can also be nested in a <field> element.

The element has two attributes: name and value. Both are mandatory.

Example configuration

The example configuration below demonstrates all elements and attributes that can be used to configure an indexer.

   Do row-based indexing on table "table1", never re-reading updated content.
   Store the unique document id in Solr field called "custom-id".
   Additionally store the row key in a Solr field called "custom-row", and store the
   column family in a Solr field called "custom-family".

   Perform conversion of byte array keys using the class "com.mycompany.MyKeyFormatter".

  &lt;!-- A float-based field taken from any qualifier in the column family "colfam" --&gt;
  &lt;field name="field1" value="colfam:*" source="qualifier" type="float"/&gt;

  &lt;param name="globalKeyA" value="globalValueA"/&gt;
  &lt;param name="globalKeyB" value="globalValueB"/&gt;




CDH中Lily HBase Indexer(Key-value Store Indexer)的配置使用

CDH版本: 4.5.7



所以学习了一下Lily Hbase Indexer来为Hbase的数据在solr建立索引。


1.在 HBase 列系列上启用复制

确保已启用群集范围内的 HBase 复制。使用 HBase shell 定义列系列复制设置。

对于每个现有表,在需要通过发出格式命令进行索引的每个列系列上设置 REPLICATION_SCOPE

$ hbase shell
hbase shell> disable 'record'
hbase shell> alter 'record', {NAME => 'data', REPLICATION_SCOPE => 1}
hbase shell> enable 'record'

对于每个新表,在需要通过发出格式命令进行索引的每个列系列上设置 REPLICATION_SCOPE

$ hbase shell hbase shell> create 'record', {NAME => 'data', REPLICATION_SCOPE => 1}

2.创建相应的 SolrCloud 集合

用于 HBase 索引的 SolrCloud 集合必须具有可容纳 HBase 列系列的类型和要进行索引处理的限定符的 Solr 架构。使用以下表单命令创建 SolrCloud 集合:

    $ solrctl instancedir --generate /etc/solr/conf/record
    $ edit /etc/solr/conf/record/conf/schema.xml 
    $ solrctl instancedir --create record /etc/solr/conf/record
    $ solrctl collection --create record

3.创建 Lily HBase Indexer 配置

$ vi /etc/solr/conf/morphline-hbase-mapper.xml

<?xml version="1.0"?>
<indexer table="record" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">

   <!-- The relative or absolute path on the local file system to the morphline configuration file. -->
   <!-- Use relative path "morphlines.conf" for morphlines managed by Cloudera Manager -->
   <param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/>

   <!-- The optional morphlineId identifies a morphline if there are multiple morphlines in morphlines.conf -->
   <!-- <param name="morphlineId" value="morphline1"/> -->


4.创建 Morphline 配置文件.

在CDH 管理界面进入Key-Value Store Indexer面板->配置->服务范围->Morphlines->Morphlines文件。
$ vi /etc/hbase-solr/conf/morphlines.conf

morphlines : [
    id : morphline1
    importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]

    commands : [                    
        extractHBaseCells {
          mappings : [
              inputColumn : "data:title"
              outputField : "title"
              type : string 
              source : value

              inputColumn : "data:subject"
              outputField : "subject"
             type : "byte[]"
             source : value

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      #{ readAvroContainer {} } 
      #  extractAvroPaths {
      #    paths : { 
      #      data : /user_name      
      #    }
      #  }

      { logTrace { format : "output record: {}", args : ["@{}"] } }





inputColumn:需要写入到solr中的HBase列字段。值包含列族和列限定符,并用‘ : ’分开。其中列限定符也可以使用通配符‘*’来表示,譬如可以使用data:*表示读取只要列族为data的所有hbase列数据,也可以通过data:my*来表示读取列族为data列限定符已my开头的字段值。




5.注册Lily HBase Indexer配置文件到Lily HBase Indexer Service服务中

当前面的所有步骤完成之后,我们需要把Lily HBase Indexer的配置文件注册到Zookeeper中,使用如下命令:

hbase-indexer add-indexer -n recordIndexer -c /etc/solr/conf/morphline-hbase-mapper.xml –connection-param  solr.zk=xhadoop1:2181,xhadoop2:2181,xhadoop3:2181,xhadoop4:2181,xhadoop5:2181/solr  –connection-param solr.collection=record –zookeeper xhadoop1:2181,xhadoop2:2181,xhadoop3:2181,xhadoop4:21281,xhadoop5:2181




put ‘record’,’row6′,’data:title’,’Good Luck!My love’

{ “responseHeader“: { “status“: 0, “QTime“: 14, “params“: { “q“: “title:Good”, “_“: “1447069417938”, “wt“: “json” } }, “response“: { “numFound“: 1, “start“: 0, “maxScore“: 0.15342641, “docs“: [ { “id“: “row6”, “title“: [ “Good Luck!My love” ], “_version_“: 1517358513903370200 } ] } }



HBase中有两张特殊的表 -ROOT- .META.