首页 > 其他分享> > Flink实战（九十九）：双流join（二）双流 Join 的3种操作示例

Flink实战（九十九）：双流join（二）双流 Join 的3种操作示例

2021-06-10 20:01:41 作者：互联网

声明：本系列博客是根据SGG的视频整理而成，非常适合大家入门学习。

《2021年最新版大数据面试题全面开启更新》

在数据库中的静态表上做 OLAP 分析时，两表 join 是非常常见的操作。同理，在流式处理作业中，有时也需要在两条流上做 join 以获得更丰富的信息。Flink DataStream API 为用户提供了3个算子来实现双流 join，分别是：

join()
coGroup()
intervalJoin()

本文举例说明它们的使用方法，顺便聊聊比较特殊的 interval join 的原理。

1 准备数据

从 Kafka 分别接入点击流和订单流，并转化为 POJO。

复制代码

DataStream<String> clickSourceStream = env
  .addSource(new FlinkKafkaConsumer011<>(
    "ods_analytics_access_log",
    new SimpleStringSchema(),
    kafkaProps
  ).setStartFromLatest());
DataStream<String> orderSourceStream = env
  .addSource(new FlinkKafkaConsumer011<>(
    "ods_ms_order_done",
    new SimpleStringSchema(),
    kafkaProps
  ).setStartFromLatest());

DataStream<AnalyticsAccessLogRecord> clickRecordStream = clickSourceStream
  .map(message -> JSON.parseObject(message, AnalyticsAccessLogRecord.class));
DataStream<OrderDoneLogRecord> orderRecordStream = orderSourceStream
  .map(message -> JSON.parseObject(message, OrderDoneLogRecord.class));

复制代码

2 join()

join() 算子提供的语义为"Window join"，即按照指定字段和（滚动/滑动/会话）窗口进行 inner join，支持处理时间和事件时间两种时间特征。
以下示例以10秒滚动窗口，将两个流通过商品 ID 关联，取得订单流中的售价相关字段。

复制代码

clickRecordStream
  .join(orderRecordStream)
  .where(record -> record.getMerchandiseId())
  .equalTo(record -> record.getMerchandiseId())
  .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
  .apply(new JoinFunction<AnalyticsAccessLogRecord, OrderDoneLogRecord, String>() {
    @Override
    public String join(AnalyticsAccessLogRecord acce***ecord, OrderDoneLogRecord orderRecord) throws Exception {
      return StringUtils.join(Arrays.asList(
        acce***ecord.getMerchandiseId(),
        orderRecord.getPrice(),
        orderRecord.getCouponMoney(),
        orderRecord.getRebateAmount()
      ), '\t');
    }
  })
  .print().setParallelism(1);

复制代码

简单易用。

3 coGroup()

只有 inner join 肯定还不够，如何实现 left/right outer join 呢？答案就是利用 coGroup() 算子。它的调用方式类似于 join() 算子，也需要开窗，但是 CoGroupFunction 比 JoinFunction 更加灵活，可以按照用户指定的逻辑匹配左流和/或右流的数据并输出。
以下的例子就实现了点击流 left join 订单流的功能，是很朴素的 nested loop join 思想（二重循环）。

复制代码

clickRecordStream
  .coGroup(orderRecordStream)
  .where(record -> record.getMerchandiseId())
  .equalTo(record -> record.getMerchandiseId())
  .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
  .apply(new CoGroupFunction<AnalyticsAccessLogRecord, OrderDoneLogRecord, Tuple2<String, Long>>() {
    @Override
    public void coGroup(Iterable<AnalyticsAccessLogRecord> acce***ecords, Iterable<OrderDoneLogRecord> orderRecords, Collector<Tuple2<String, Long>> collector) throws Exception {
      for (AnalyticsAccessLogRecord acce***ecord : acce***ecords) {
        boolean isMatched = false;
        for (OrderDoneLogRecord orderRecord : orderRecords) {
          // 右流中有对应的记录
          collector.collect(new Tuple2<>(acce***ecord.getMerchandiseName(), orderRecord.getPrice()));
          isMatched = true;
        }
        if (!isMatched) {
          // 右流中没有对应的记录
          collector.collect(new Tuple2<>(acce***ecord.getMerchandiseName(), null));
        }
      }
    }
  })
  .print().setParallelism(1);

复制代码

4 intervalJoin()

join() 和 coGroup() 都是基于窗口做关联的。但是在某些情况下，两条流的数据步调未必一致。例如，订单流的数据有可能在点击流的购买动作发生之后很久才被写入，如果用窗口来圈定，很容易 join 不上。所以 Flink 又提供了"Interval join"的语义，按照指定字段以及右流相对左流偏移的时间区间进行关联，即：

right.timestamp ∈ [left.timestamp + lowerBound; left.timestamp + upperBound]

interval join 也是 inner join，虽然不需要开窗，但是需要用户指定偏移区间的上下界，并且只支持事件时间。
示例代码如下。注意在运行之前，需要分别在两个流上应用 assignTimestampsAndWatermarks() 方法获取事件时间戳和水印。

复制代码

clickRecordStream
  .keyBy(record -> record.getMerchandiseId())
  .intervalJoin(orderRecordStream.keyBy(record -> record.getMerchandiseId()))
  .between(Time.seconds(-30), Time.seconds(30))
  .process(new ProcessJoinFunction<AnalyticsAccessLogRecord, OrderDoneLogRecord, String>() {
    @Override
    public void processElement(AnalyticsAccessLogRecord acce***ecord, OrderDoneLogRecord orderRecord, Context context, Collector<String> collector) throws Exception {
      collector.collect(StringUtils.join(Arrays.asList(
        acce***ecord.getMerchandiseId(),
        orderRecord.getPrice(),
        orderRecord.getCouponMoney(),
        orderRecord.getRebateAmount()
      ), '\t'));
    }
  })
  .print().setParallelism(1);

复制代码

由上可见，interval join 与 window join 不同，是两个 KeyedStream 之上的操作，并且需要调用 between() 方法指定偏移区间的上下界。如果想令上下界是开区间，可以调用 upperBoundExclusive()/lowerBoundExclusive() 方法。

interval join 的实现原理

以下是 KeyedStream.process(ProcessJoinFunction) 方法调用的重载方法的逻辑。

复制代码

public <OUT> SingleOutputStreamOperator<OUT> process(
        ProcessJoinFunction<IN1, IN2, OUT> processJoinFunction,
        TypeInformation<OUT> outputType) {
    Preconditions.checkNotNull(processJoinFunction);
    Preconditions.checkNotNull(outputType);
    final ProcessJoinFunction<IN1, IN2, OUT> cleanedUdf = left.getExecutionEnvironment().clean(processJoinFunction);
    final IntervalJoinOperator<KEY, IN1, IN2, OUT> operator =
        new IntervalJoinOperator<>(
            lowerBound,
            upperBound,
            lowerBoundInclusive,
            upperBoundInclusive,
            left.getType().createSerializer(left.getExecutionConfig()),
            right.getType().createSerializer(right.getExecutionConfig()),
            cleanedUdf
        );
    return left
        .connect(right)
        .keyBy(keySelector1, keySelector2)
        .transform("Interval Join", outputType, operator);
}

复制代码

可见是先对两条流执行 connect() 和 keyBy() 操作，然后利用 IntervalJoinOperator 算子进行转换。在 IntervalJoinOperator 中，会利用两个 MapState 分别缓存左流和右流的数据。

复制代码

private transient MapState<Long, List<BufferEntry<T1>>> leftBuffer;
private transient MapState<Long, List<BufferEntry<T2>>> rightBuffer;

@Override
public void initializeState(StateInitializationContext context) throws Exception {
    super.initializeState(context);
    this.leftBuffer = context.getKeyedStateStore().getMapState(new MapStateDescriptor<>(
        LEFT_BUFFER,
        LongSerializer.INSTANCE,
        new ListSerializer<>(new BufferEntrySerializer<>(leftTypeSerializer))
    ));
    this.rightBuffer = context.getKeyedStateStore().getMapState(new MapStateDescriptor<>(
        RIGHT_BUFFER,
        LongSerializer.INSTANCE,
        new ListSerializer<>(new BufferEntrySerializer<>(rightTypeSerializer))
    ));
}

复制代码

其中 Long 表示事件时间戳，List<BufferEntry<T>> 表示该时刻到来的数据记录。当左流和右流有数据到达时，会分别调用 processElement1() 和 processElement2() 方法，它们都调用了 processElement() 方法，代码如下。

复制代码

@Override
public void processElement1(StreamRecord<T1> record) throws Exception {
    processElement(record, leftBuffer, rightBuffer, lowerBound, upperBound, true);
}

@Override
public void processElement2(StreamRecord<T2> record) throws Exception {
    processElement(record, rightBuffer, leftBuffer, -upperBound, -lowerBound, false);
}

@SuppressWarnings("unchecked")
private <THIS, OTHER> void processElement(
        final StreamRecord<THIS> record,
        final MapState<Long, List<IntervalJoinOperator.BufferEntry<THIS>>> ourBuffer,
        final MapState<Long, List<IntervalJoinOperator.BufferEntry<OTHER>>> otherBuffer,
        final long relativeLowerBound,
        final long relativeUpperBound,
        final boolean isLeft) throws Exception {
    final THIS ourValue = record.getValue();
    final long ourTimestamp = record.getTimestamp();
    if (ourTimestamp == Long.MIN_VALUE) {
        throw new FlinkException("Long.MIN_VALUE timestamp: Elements used in " +
                "interval stream joins need to have timestamps meaningful timestamps.");
    }
    if (isLate(ourTimestamp)) {
        return;
    }
    addToBuffer(ourBuffer, ourValue, ourTimestamp);
    for (Map.Entry<Long, List<BufferEntry<OTHER>>> bucket: otherBuffer.entries()) {
        final long timestamp  = bucket.getKey();
        if (timestamp < ourTimestamp + relativeLowerBound ||
                timestamp > ourTimestamp + relativeUpperBound) {
            continue;
        }
        for (BufferEntry<OTHER> entry: bucket.getValue()) {
            if (isLeft) {
                collect((T1) ourValue, (T2) entry.element, ourTimestamp, timestamp);
            } else {
                collect((T1) entry.element, (T2) ourValue, timestamp, ourTimestamp);
            }
        }
    }
    long cleanupTime = (relativeUpperBound > 0L) ? ourTimestamp + relativeUpperBound : ourTimestamp;
    if (isLeft) {
        internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_LEFT, cleanupTime);
    } else {
        internalTimerService.registerEventTimeTimer(CLEANUP_NAMESPACE_RIGHT, cleanupTime);
    }
}

复制代码

这段代码的思路是：

1. 取得当前流 StreamRecord 的时间戳，调用 isLate() 方法判断它是否是迟到数据（即时间戳小于当前水印值），如是则丢弃。
2. 调用 addToBuffer() 方法，将时间戳和数据一起插入当前流对应的 MapState。
3. 遍历另外一个流的 MapState，如果数据满足前述的时间区间条件，则调用 collect() 方法将该条数据投递给用户定义的 ProcessJoinFunction 进行处理。collect() 方法的代码如下，注意结果对应的时间戳是左右流时间戳里较大的那个。

private void collect(T1 left, T2 right, long leftTimestamp, long rightTimestamp) throws Exception {
    final long resultTimestamp = Math.max(leftTimestamp, rightTimestamp);
    collector.setAbsoluteTimestamp(resultTimestamp);
    context.updateTimestamps(leftTimestamp, rightTimestamp, resultTimestamp);
    userFunction.processElement(left, right, context, collector);
}

4 调用 TimerService.registerEventTimeTimer() 注册时间戳为 timestamp + relativeUpperBound 的定时器，该定时器负责在水印超过区间的上界时执行状态的清理逻辑，防止数据堆积。注意左右流的定时器所属的 namespace 是不同的，具体逻辑则位于 onEventTime() 方法中。

复制代码

@Override
public void onEventTime(InternalTimer<K, String> timer) throws Exception {
    long timerTimestamp = timer.getTimestamp();
    String namespace = timer.getNamespace();
    logger.trace("onEventTime @ {}", timerTimestamp);
    switch (namespace) {
        case CLEANUP_NAMESPACE_LEFT: {
            long timestamp = (upperBound <= 0L) ? timerTimestamp : timerTimestamp - upperBound;
            logger.trace("Removing from left buffer @ {}", timestamp);
            leftBuffer.remove(timestamp);
            break;
        }
        case CLEANUP_NAMESPACE_RIGHT: {
            long timestamp = (lowerBound <= 0L) ? timerTimestamp + lowerBound : timerTimestamp;
            logger.trace("Removing from right buffer @ {}", timestamp);
            rightBuffer.remove(timestamp);
            break;
        }
        default:
            throw new RuntimeException("Invalid namespace " + namespace);
    }
}

复制代码

标签：join,示例,timestamp,双流,ourTimestamp,record,new,final
来源： https://blog.51cto.com/u_9928699/2892540