1 过滤器
HBase 不仅提供了增、删、改、查等简单的查询,而且提供了更加高级的过滤器(Filter)来查询。
过滤器可以根据列 族、列、版本等更多的条件来对数据进行过滤,基于 HBase 本身提供的三维有序(行键,列,版本有序),这些过滤器可以高效地完成查询过滤的任务,带有过滤器条件的 RPC 查询请求会把过滤器分发到各个 RegionServer(这是一个服务端过滤器),这样也可以降低网络传输的压力。 HBase中两种数据读取函数get()和scan()都支持过滤器
几大Filters
1、Comparision Filters
1.1 RowFilter
1.2 FamilyFilter
1.3 QualifierFilter
1.4 ValueFilter
1.5 DependentColumnFilter
2、Dedicated Filters
2.1 SingleColumnValueFilter
2.2 SingleColumnValueExcludeFilter
2.3 PrefixFilter
2.4 PageFilter
2.5 KeyOnlyFilter
2.6 FirstKeyOnlyFilter
2.7 TimestampsFilter
2.8 RandomRowFilter
3、Decorating Filters
3.1 SkipFilter
3.2 WhileMatchFilters
具体详见以下示例:
create test1, lf, sf
lf: column family of LONG values (binary value)
-- sf: column family of STRING values
导入数据
put test1, user1|ts1, sf:c1, sku1 put test1, user1|ts2, sf:c1, sku188 put test1, user1|ts3, sf:s1, sku123 put test1, user2|ts4, sf:c1, sku2 put test1, user2|ts5, sf:c2, sku288 put test1, user2|ts6, sf:s1, sku222
一个用户(userX),在什么时间(tsX),作为rowkey
对什么产品(value:skuXXX),做了什么操作作为列名,比如,c1: click from homepage; c2: click from ad; s1: search from homepage; b1: buy
1. 谁的值=sku188
scan test1, FILTER=>"ValueFilter(=,binary:sku188)" ROW COLUMN+CELL user1|ts2 column=sf:c1, timestamp=1409122354918, value=sku188
2.谁的值包含88
scan test1, FILTER=>"ValueFilter(=,substring:88)" ROW COLUMN+CELL user1|ts2 column=sf:c1, timestamp=1409122354918, value=sku188 user2|ts5 column=sf:c2, timestamp=1409122355030, value=sku288
3.通过广告点击进来的(column为c2)值包含88的用户
scan test1, FILTER=>"ColumnPrefixFilter(c2) AND ValueFilter(=,substring:88)" ROW COLUMN+CELL user2|ts5 column=sf:c2, timestamp=1409122355030, value=sku2884.通过搜索进来的(column为s)值包含123或者222的用户
scan test1, FILTER=>"ColumnPrefixFilter(s) AND ( ValueFilter(=,substring:123) OR ValueFilter(=,substring:222) )" ROW COLUMN+CELL user1|ts3 column=sf:s1, timestamp=1409122354954, value=sku123 user2|ts6 column=sf:s1, timestamp=1409122355970, value=sku222
5.查询rowkey为user1开头的
scan test1, FILTER => "PrefixFilter (user1)" ROW COLUMN+CELL user1|ts1 column=sf:c1, timestamp=1409122354868, value=sku1 user1|ts2 column=sf:c1, timestamp=1409122354918, value=sku188 user1|ts3 column=sf:s1, timestamp=1409122354954, value=sku123
6.FirstKeyOnlyFilter: 一个rowkey可以有多个version,同一个rowkey的同一个column也会有多个的值, 只拿出key中的第一个column的第一个version,
KeyOnlyFilter: 只要key,不要value
scan test1, FILTER=>"FirstKeyOnlyFilter() AND ValueFilter(=,binary:sku188) AND KeyOnlyFilter()" ROW COLUMN+CELL user1|ts2 column=sf:c1, timestamp=1409122354918, value=
7.查询从user1|ts2开始,找到所有的rowkey以user1开头的
scan test1, {STARTROW=>user1|ts2, FILTER => "PrefixFilter (user1)"} ROW COLUMN+CELL user1|ts2 column=sf:c1, timestamp=1409122354918, value=sku188 user1|ts3 column=sf:s1, timestamp=1409122354954, value=sku123
8.查询从user1|ts2开始,找到所有的到rowkey以user2开头
scan test1, {STARTROW=>user1|ts2, STOPROW=>user2} ROW COLUMN+CELL user1|ts2 column=sf:c1, timestamp=1409122354918, value=sku188 user1|ts3 column=sf:s1, timestamp=1409122354954, value=sku123
9.查询rowkey里面包含ts3的
import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.filter.RowFilter scan test1, {FILTER => RowFilter.new(CompareFilter::CompareOp.valueOf(EQUAL), SubstringComparator.new(ts3))} ROW COLUMN+CELL user1|ts3 column=sf:s1, timestamp=1409122354954, value=sku123
10.查询rowkey里面包含ts的
import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.filter.RowFilter scan test1, {FILTER => RowFilter.new(CompareFilter::CompareOp.valueOf(EQUAL), SubstringComparator.new(ts))} ROW COLUMN+CELL user1|ts1 column=sf:c1, timestamp=1409122354868, value=sku1 user1|ts2 column=sf:c1, timestamp=1409122354918, value=sku188 user1|ts3 column=sf:s1, timestamp=1409122354954, value=sku123 user2|ts4 column=sf:c1, timestamp=1409122354998, value=sku2 user2|ts5 column=sf:c2, timestamp=1409122355030, value=sku288 user2|ts6 column=sf:s1, timestamp=1409122355970, value=sku222
加入一条测试数据
put test1, user2|err, sf:s1, sku999
11.查询rowkey里面以user开头的,新加入的测试数据并不符合正则表达式的规则,故查询不出来
import org.apache.hadoop.hbase.filter.RegexStringComparator import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.filter.RowFilter scan test1, {FILTER => RowFilter.new(CompareFilter::CompareOp.valueOf(EQUAL),RegexStringComparator.new(^userd+|tsd+$))} ROW COLUMN+CELL user1|ts1 column=sf:c1, timestamp=1409122354868, value=sku1 user1|ts2 column=sf:c1, timestamp=1409122354918, value=sku188 user1|ts3 column=sf:s1, timestamp=1409122354954, value=sku123 user2|ts4 column=sf:c1, timestamp=1409122354998, value=sku2 user2|ts5 column=sf:c2, timestamp=1409122355030, value=sku288 user2|ts6 column=sf:s1, timestamp=1409122355970, value=sku222
加入测试数据
put test1, user1|ts9, sf:b1, sku1
12.查询b1开头的列中并且值为sku1的
scan test1, FILTER=>"ColumnPrefixFilter(b1) AND ValueFilter(=,binary:sku1)" ROW COLUMN+CELL user1|ts9 column=sf:b1, timestamp=1409124908668, value=sku1
13.查询b1开头的列中并且值为sku1 (SingleColumnValueFilter的使)
import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator scan test1, {COLUMNS => sf:b1, FILTER => SingleColumnValueFilter.new(Bytes.toBytes(sf), Bytes.toBytes(b1), CompareFilter::CompareOp.valueOf(EQUAL), Bytes.toBytes(sku1))} ROW COLUMN+CELL user1|ts9 column=sf:b1, timestamp=1409124908668, value=sku1