Lily Hbase Indexer在solr中为hbase的rowkey建立索引字段

在hbase录入数据的同时将数据索引到solr是一个很不错的主意。Lily Hbase Indexer就可以这么做。

但是在设计处理流程时发现如何将hbase的rowkey索引为hbase的字段,在网上资料没搜索到。

机智的“我”去搜索morphline的相关资料,果然发现了一些端倪。

资料地址在 https://github.com/NGDATA/hbase-indexer/wiki/Indexer-configuration  .

在这篇讲解morphline indexer配置的文章中,涉及到一个属性  unique-key-field。

原文如下:

unique-key-field

This attribute specifies the name of the document identifier field used in Solr.
The default value for this field is "id".

翻译过来就是:此属性用以指定solr中文档标识字段的名字。默认是“id”。

仔细想想,这说的难道不就是 solr字段配置文件 schema.xml中uniqueKey 嘛。。

so,在配置morphline-mapper.xml文件时,如果要把rowkey放入索引的结果中,要在schema.xml中配置一个专门的字段或者是直接使用uniqueKey。

如果使用uniqueKey作为接收rowkey的字段名,记得在上传hbase数据时不要再带入uniqueKey了。只传入其他正常的列就好了。

morphline-mapper.xml模板:

<?xml version=”1.0″?>
<indexer table=”indextest” mapping-type=”column” mapper=”com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper” unique-key-field=”id”>
<param name=”morphlineFile” value=”morphlines.conf”/>
</indexer>

后面附上资料全文,意思很直白,就不翻译了。

Global indexer attributes

The following is a list of attributes that can be set on the top-level <indexer> element in an indexer configuration.

table

The table attribute specifies the name of the HBase table to be indexed by the indexer. It is the only mandatory attribute in the indexer element.

mapping-type

The mapping-type attribute has two possible values: row, or column. This attribute specifies whether row-based or column-based indexing is to be performed.

Row-based indexing treats all data within a single HBase row as input for a single document in Solr. This is the kind of indexing that would be used for an HBase table that contains a separate entity in each row, e.g. a table containing users.

Column-based indexing treats each HBase cell as input for a single document in Solr. This approach could be used for example in a messaging platform where a single user’s messages are all stored in a single row, with each message being stored in a separate cell.

The default mapping-type value is row.

read-row

The read-row attribute has two possible values: dynamic, or never.

This attribute is only important when using row-based indexing. It specifies whether or not the indexer should re-read data from HBase in order to perform indexing.

When set to “dynamic”, the indexer will read the necessary data from a row if a partial update to the row is performed in HBase. In dynamic mode, the row will not be re-read if all data needed to perform indexing is included in the row update.

If this attribute is set to never, a row will never be re-read by the indexer.

The default setting is “dynamic”.

mapper

The mapper attribute allows the user to specify a custom mapper class that will create a Solr document from a HBase Result object. The mapper class must implement the

com.ngdata.hbaseindexer.parse.ResultToSolrMapper

interface.

By default, the built-in

com.ngdata.hbaseindexer.parse.DefaultResultToSolrMapper

is used.

unique-key-formatter

The unique-key-formatter attribute specifies the name of the class used to format HBase row keys (as well as column families and column qualifiers) as text. A textual representation of these pieces of information is needed for indexing in Solr, as all data in Solr is textual, but row keys, column families, and column qualifiers are byte arrays.

A unique-key-formatter class must implement the

com.ngdata.hbaseindexer.uniquekey.UniqueKeyFormatter

interface.

The default value of this attribute is com.ngdata.hbaseindexer.uniquekey.StringUniqueKeyFormatter. The StringUniqueKey formatter simply treats row keys and other byte arrays as strings.

If your row keys, column families, or qualifiers can’t simply be used as strings, consider using the com.ngdata.hbaseindexer.uniquekey.HexUniqueKeyFormatter.

unique-key-field

This attribute specifies the name of the document identifier field used in Solr.

The default value for this field is “id”.

row-field

The row-field attribute specifies the name of the Solr field to be used for storing an HBase row key.

This field is only important when doing column-based indexing. In order for the indexer to be able to delete all documents for a single row from the index, it needs to be able to find all documents for the row in Solr. When this attribute is populated in the indexer definition, it’s value is used as the name of a field in Solr to store the encoded row key.

By default, this attribute is empty, meaning that the row key is not stored in Solr. The consequence of this is that deleting a complete row or complete column family in HBase will not delete the indexed documents in Solr.

column-family-field

The column-family-field specifies the name of the Solr field to be used for storing the HBase column family name.

See the description of the row-field attribute for more information.

By default, this attribute is empty, so the column-family name is not saved in Solr.

table-name-field

The table-name-field specifies the name of the Solr field to be used for storing the name of the HBase table where a record is stored.

By default, this attribute is empty, so the name of the HBase table is not stored unless this setting is explicitly set in the indexer config.

Elements within the indexer definition

There are three types of elements that can be used within an indexer configuration: <field>, <extract>, and <param>.

<field>

The field element defines a single field to be indexed in Solr, as well as where its contents are to be taken from and interpreted from HBase. There are typically one or more fields listed in an indexer configuration — one for each Solr field to be stored.

The field attribute has four attributes, listed below.

name

The name attribute specifies the name of a Solr field in which to store data. A field with a matching name should be defined in the Solr schema.

The name attribute is mandatory.

value

The value attribute specifies the data to be used from HBase for populating the field in Solr. It takes the form of a column family name and qualifier, separated by a colon.

The qualifier portion can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and qualifier expressions will be used.

The following are examples of valid value attributes:

  • mycolumnfamily:myqualifier
  • mycolumnfamily:my*
  • mycolumnfamily:*

source

The source attribute determines what portion of an HBase KeyValue will be used as indexing content.

It has two possible values: value and qualifier.

When value is specified (which is the case by default), then the cell value is used as input for indexing.

When qualifier is specified, then the column qualifier is used as input for indexing.

type

The type attribute defines the datatype of the content in HBase.

Because all data is stored in HBase as byte arrays, but all content in Solr is indexed as text, a method for converting from byte arrays to the actual datatype is needed.

The value of this field can be one of any of the datatypes supported by the HBase Bytesclass: int, long, string, boolean, float, double, short, or bigdecimal.

If the Bytes-based representation has not been used for storing data in HBase, the name of a custom class can be specified for this attribute. The custom class must implement the

com.ngdata.hbaseindexer.parse.ByteArrayValueMapper

interface.

<param>

The <param> element defines a key-value pair that will be supplied to custom classes that implement the

com.ngdata.hbaseindexer.Configurable

interface.

<param> elements can also be nested in a <field> element.

The element has two attributes: name and value. Both are mandatory.

Example configuration

The example configuration below demonstrates all elements and attributes that can be used to configure an indexer.


&lt;!--
   Do row-based indexing on table "table1", never re-reading updated content.
   Store the unique document id in Solr field called "custom-id".
   Additionally store the row key in a Solr field called "custom-row", and store the
   column family in a Solr field called "custom-family".

   Perform conversion of byte array keys using the class "com.mycompany.MyKeyFormatter".
--&gt;
&lt;indexer
    table="table1"
    mapping-type="row"
    read-row="never"
    unique-key-field="custom-id"
    row-field="custom-row"
    column-family-field="custom-family"
    table-name-field="custom-table"
    unique-key-formatter="com.mycompany.MyKeyFormatter"
    &gt;

  &lt;!-- A float-based field taken from any qualifier in the column family "colfam" --&gt;
  &lt;field name="field1" value="colfam:*" source="qualifier" type="float"/&gt;

  &lt;param name="globalKeyA" value="globalValueA"/&gt;
  &lt;param name="globalKeyB" value="globalValueB"/&gt;

&lt;/indexer&gt;

 

 

CDH中Lily HBase Indexer(Key-value Store Indexer)的配置使用

CDH版本: 4.5.7

前段时间在学习Solr的知识。做了一个测试系统,精确查询使用HBase来做,模糊查询使用Solr。

Hbase的数据录入直接采用原生API,效率还不错。但是当向Solr录入数据建立索引时时常会发生异常。

所以学习了一下Lily Hbase Indexer来为Hbase的数据在solr建立索引。

 

1.在 HBase 列系列上启用复制

确保已启用群集范围内的 HBase 复制。使用 HBase shell 定义列系列复制设置。

对于每个现有表,在需要通过发出格式命令进行索引的每个列系列上设置 REPLICATION_SCOPE

$ hbase shell
hbase shell> disable 'record'
hbase shell> alter 'record', {NAME => 'data', REPLICATION_SCOPE => 1}
hbase shell> enable 'record'

对于每个新表,在需要通过发出格式命令进行索引的每个列系列上设置 REPLICATION_SCOPE

$ hbase shell hbase shell> create 'record', {NAME => 'data', REPLICATION_SCOPE => 1}

2.创建相应的 SolrCloud 集合

用于 HBase 索引的 SolrCloud 集合必须具有可容纳 HBase 列系列的类型和要进行索引处理的限定符的 Solr 架构。使用以下表单命令创建 SolrCloud 集合:

<1>生产实体配置文件:
    $ solrctl instancedir --generate /etc/solr/conf/record
<2>编辑Collection的Schema.xml文件,根据自己的需求确定。
    $ edit /etc/solr/conf/record/conf/schema.xml 
<3>上传实体配置文件至zookeeper
    $ solrctl instancedir --create record /etc/solr/conf/record
<4>使用实体配置文件建立collection
    $ solrctl collection --create record

3.创建 Lily HBase Indexer 配置

$ vi /etc/solr/conf/morphline-hbase-mapper.xml

<?xml version="1.0"?>
<indexer table="record" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">

   <!-- The relative or absolute path on the local file system to the morphline configuration file. -->
   <!-- Use relative path "morphlines.conf" for morphlines managed by Cloudera Manager -->
   <param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/>

   <!-- The optional morphlineId identifies a morphline if there are multiple morphlines in morphlines.conf -->
   <!-- <param name="morphlineId" value="morphline1"/> -->

</indexer>

4.创建 Morphline 配置文件.

在CDH 管理界面进入Key-Value Store Indexer面板->配置->服务范围->Morphlines->Morphlines文件。
$ vi /etc/hbase-solr/conf/morphlines.conf

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]

    commands : [                    
      {
        extractHBaseCells {
          mappings : [
            {
              inputColumn : "data:title"
              outputField : "title"
              type : string 
              source : value
            }

            {
              inputColumn : "data:subject"
              outputField : "subject"
             type : "byte[]"
             source : value
            }
          ]
        }
      }

      #for avro use with type : "byte[]" in extractHBaseCells mapping above
      #{ readAvroContainer {} } 
      #{ 
      #  extractAvroPaths {
      #    paths : { 
      #      data : /user_name      
      #    }
      #  }
      #}

      { logTrace { format : "output record: {}", args : ["@{}"] } }
    ]
  }
]

id:表示当前morphlines文件的ID名称。

importCommands:需要引入的命令包地址。

extractHBaseCells:该命令用来读取HBase列数据并写入到SolrInputDocument对象中,该命令必须包含零个或者多个mappings命令对象。

mappings:用来指定HBase列限定符的字段映射。

inputColumn:需要写入到solr中的HBase列字段。值包含列族和列限定符,并用‘ : ’分开。其中列限定符也可以使用通配符‘*’来表示,譬如可以使用data:*表示读取只要列族为data的所有hbase列数据,也可以通过data:my*来表示读取列族为data列限定符已my开头的字段值。

outputField:用来表示morphline读取的记录需要输出的数据字段名称,该名称必须和solr中的schema.xml文件的字段名称保持一致,否则写入不正确。

type:用来定义读取HBase数据的数据类型,我们知道HBase中的数据都是以byte[]的形式保存,但是所有的内容在Solr中索引为text形式,所以需要一个方法来把byte[]类型转换为实际的数据类型。type参数的值就是用来做这件事情的。现在支持的数据类型有:byte[](原封不动的拷贝hbase中的byte[]数据),int,long,string,boolean,float,double,short和bigdecimal。当然你也可以指定自定的数据类型,只需要实现com.ngdata.hbaseindexer.parse.ByteArrayValueMapper接口即可。

source:用来指定HBase的KeyValue那一部分作为索引输入数据,可选的有‘value’和’qualifier’,当为value的时候表示使用HBase的列值作为索引输入,当为qualifier的时候表示使用HBase的列限定符作为索引输入。

5.注册Lily HBase Indexer配置文件到Lily HBase Indexer Service服务中

当前面的所有步骤完成之后,我们需要把Lily HBase Indexer的配置文件注册到Zookeeper中,使用如下命令:

hbase-indexer add-indexer -n recordIndexer -c /etc/solr/conf/morphline-hbase-mapper.xml –connection-param  solr.zk=xhadoop1:2181,xhadoop2:2181,xhadoop3:2181,xhadoop4:2181,xhadoop5:2181/solr  –connection-param solr.collection=record –zookeeper xhadoop1:2181,xhadoop2:2181,xhadoop3:2181,xhadoop4:21281,xhadoop5:2181

当前面5条全部按序完成后,就可以测试hbase数据输入Solr索引数据了。

6.HBase写入数据

向HBase中配置的索引表写入数据,如下:

put ‘record’,’row6′,’data:title’,’Good Luck!My love’
当写入数据后,稍过几秒我们可以在相对于的solr中查询到该插入的数据,表明配置已经成功。

{ “responseHeader“: { “status“: 0, “QTime“: 14, “params“: { “q“: “title:Good”, “_“: “1447069417938”, “wt“: “json” } }, “response“: { “numFound“: 1, “start“: 0, “maxScore“: 0.15342641, “docs“: [ { “id“: “row6”, “title“: [ “Good Luck!My love” ], “_version_“: 1517358513903370200 } ] } }

HBase中zookeeper与用户表的关系

image0030

HBase中有两张特殊的表 -ROOT- .META.

.META.记录用户表的Region信息,且.META.表可以被split
-ROOT-记录.META.的Region信息,用来记录.META表被分到了哪些RegionServer上
zookeeper记录了-ROOT-的位置.

Client访问用户数据之前需要首先访问zookeeper,然后访问-ROOT-表,接着访问.META.表,最后才能找到用户数据的位置去访问,中间需要多次网络操作,不过client端会做cache缓存.