閱讀(2k) 書簽贊(0) 我要糾錯

Samza 寫入HDFS

2018-08-22 18:20 更新

samza-hdfs 模塊實現(xiàn)了一個 Samza Producer 來寫入 HDFS。當(dāng)前的實現(xiàn)包括一個現(xiàn)成的 HdfsSystemProducer，和三個 HdfsWriterS：一個寫入原始字節(jié)的消息到SequenceFile 的 BytesWritable 鍵和值；另一個寫入 UTF-8 Strings 到一個 SequenceFile 與 LongWritable 鍵和 Text 值；最后一個寫出 Avro 數(shù)據(jù)文件，包括自動反映的 POJO 對象的模式。

配置 HdfsSystemProducer

您可以像任何其他 Samza 系統(tǒng)一樣配置 HdfsSystemProducer：使用 job.properties 文件中設(shè)置的配置鍵和值。您可以配置系統(tǒng)生產(chǎn)者以供您使用，StreamTasks 如下所示：

# set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to 'hdfs-clickstream'
systems.hdfs-clickstream.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory

# define a serializer/deserializer for the hdfs-clickstream system
# DO NOT define (i.e. comment out) a SerDe when using the AvroDataFileHdfsWriter so it can reflect the schema
systems.hdfs-clickstream.samza.msg.serde=some-serde-impl

# consumer configs not needed for HDFS system, reader is not implemented yet

# Assign a Metrics implementation via a label we defined earlier in the props file
systems.hdfs-clickstream.streams.metrics.samza.msg.serde=some-metrics-impl

# Assign the implementation class for this system's HdfsWriter
systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.TextSequenceFileHdfsWriter
#systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.AvroDataFileHdfsWriter

# Set compression type supported by chosen Writer. Only BLOCK compression is supported currently
# AvroDataFileHdfsWriter supports snappy, bzip2, deflate or none (null, anything other than the first three)
systems.hdfs-clickstream.producer.hdfs.compression.type=snappy

# The base dir for HDFS output. The default Bucketer for SequenceFile HdfsWriters
# is currently /BASE/JOB_NAME/DATE_PATH/FILES, where BASE is set below
systems.hdfs-clickstream.producer.hdfs.base.output.dir=/user/me/analytics/clickstream_data

# Assign the implementation class for the HdfsWriter's Bucketer
systems.hdfs-clickstream.producer.hdfs.bucketer.class=org.apache.samza.system.hdfs.writer.JobNameDateTimeBucketer

# Configure the DATE_PATH the Bucketer will set to bucket output files by day for this job run.
systems.hdfs-clickstream.producer.hdfs.bucketer.date.path.format=yyyy_MM_dd

# Optionally set the max output bytes (records for AvroDataFileHdfsWriter) per file.
# A new file will be cut and output continued on the next write call each time this many bytes
# (records for AvroDataFileHdfsWriter) are written.
systems.hdfs-clickstream.producer.hdfs.write.batch.size.bytes=134217728
#systems.hdfs-clickstream.producer.hdfs.write.batch.size.records=10000

假設(shè)上述配置已經(jīng)針對同一文件中的其他位置的標(biāo)簽 some-serde-impl 和 some-metrics-impl 標(biāo)簽正確配置了度量標(biāo)準(zhǔn)和序列實現(xiàn) job.properties。這些屬性中的每一個都具有合理的默認(rèn)值，因此您可以省略不需要為您的工作運行定制的屬性。

從 HDFS 文件讀取 ?

以上內(nèi)容是否對您有幫助：

← Samza YARN安全

Samza 從HDFS文件讀取 →

寫筆記

我要補充