这里写目录标题
- 一、 写入hive 配置
- 1.1 权限报错信息 :
- 1.2 hive 中文件格式
- 1.3 注意区别以下建表语句
- A、构建ORC 格式分区表
- B. 构建默认文件格式分区表
- C.构建非分区表
- 二、dataX 配置hive 分区表导入 配置
- 2.1 检查hive 表分区是否存在
一、 写入hive 配置
dataX 写入hive 不支持直接配置select 语句,必须配置一个json 任务,推荐使用hdfswriter
"writer": {"name": "hdfswriter","parameter": {"defaultFS": "hdfs://hadoop.fancv.com:9000","fileType": "text","path": "/user/hive/warehouse/fancv_center_devdb.db/test_p_text/ct=2020","fileName": "test_p_file","column": [{"name": "id","type": "INT"},{"name": "name","type": "STRING"}],"writeMode": "append","fieldDelimiter": "\u0001","compress":"GZIP"}}
hdfswriter 时直接写文件,所以回遇到权限问题,上一篇文章中 直接粗暴解决 chmod
但系统和内hive表一般时动态增加的,总不能一直靠着命令行解决问题吧,后边研究了Hadoop的权限校验。
发现可以在系统中配置环境变量指定用户,这样就可以在dolphins中通过dataX 直接向hive 中写入数据了
1.1 权限报错信息 :
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=default, access=WRITE, inode="/user/hive/warehouse/sz_center_devdb.db/test_p":anonymous:supergroup:drwxr-xr-xat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:336)at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:241)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1909)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1893)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1852)at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:323)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2635)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2577)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:532)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1020)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:948)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2952)at org.apache.hadoop.ipc.Client.call(Client.java:1476)at org.apache.hadoop.ipc.Client.call(Client.java:1407)at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)at com.sun.proxy.$Proxy10.create(Unknown Source)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)at com.sun.proxy.$Proxy11.create(Unknown Source)at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1623)... 14 moreat com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:40)at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.textFileStartWrite(HdfsHelper.java:317)at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:360)at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)at java.lang.Thread.run(Thread.java:750)Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=default, access=WRITE, inode="/user/hive/warehouse/sz_center_devdb.db/test_p":anonymous:supergroup:drwxr-xr-xat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:336)at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:241)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1909)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1893)at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1852)at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:323)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2635)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2577)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:494)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:532)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1020)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:948)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2952)
这里可以通过设置环境变量的方式解决
详情参考: fancv.com
参考资料:https://www.cnblogs.com/chhyan-dream/p/12929362.html
1.2 hive 中文件格式
hive 文件保存有格式区别,如果在dataX 任务里,没有细心核对hive 文件格式,虽然表面上dataX 任务执行成功,但是hive的文件却被破坏了,导致不可用。
报错信息如下:
org.jkiss.dbeaver.model.exec.DBCException: SQL 错误: java.io.IOException: org.apache.orc.FileFormatException: Malformed ORC file hdfs://shangzhi165:9000/user/hive/warehouse/sz_center_devdb.db/test_p/ct=2020/test_p_file__4f4b257d_b0ad_446c_a3ea_e5ea2e81580c.gz. Invalid postscript length 0at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCResultSetImpl.nextRow(JDBCResultSetImpl.java:183)at org.jkiss.dbeaver.model.impl.jdbc.struct.JDBCTable.readData(JDBCTable.java:195)at org.jkiss.dbeaver.ui.controls.resultset.ResultSetJobDataRead.lambda$0(ResultSetJobDataRead.java:123)at org.jkiss.dbeaver.model.exec.DBExecUtils.tryExecuteRecover(DBExecUtils.java:173)at org.jkiss.dbeaver.ui.controls.resultset.ResultSetJobDataRead.run(ResultSetJobDataRead.java:121)at org.jkiss.dbeaver.ui.controls.resultset.ResultSetViewer$ResultSetDataPumpJob.run(ResultSetViewer.java:5062)at org.jkiss.dbeaver.model.runtime.AbstractJob.run(AbstractJob.java:105)at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.orc.FileFormatException: Malformed ORC file hdfs://shangzhi165:9000/user/hive/warehouse/sz_center_devdb.db/test_p/ct=2020/test_p_file__4f4b257d_b0ad_446c_a3ea_e5ea2e81580c.gz. Invalid postscript length 0at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:300)at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:286)at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:374)at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCResultSetImpl.next(JDBCResultSetImpl.java:272)at org.jkiss.dbeaver.model.impl.jdbc.exec.JDBCResultSetImpl.nextRow(JDBCResultSetImpl.java:180)... 7 more
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.orc.FileFormatException: Malformed ORC file hdfs://shangzhi165:9000/user/hive/warehouse/sz_center_devdb.db/test_p/ct=2020/test_p_file__4f4b257d_b0ad_446c_a3ea_e5ea2e81580c.gz. Invalid postscript length 0at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:465)at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:309)at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:905)at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)at com.sun.proxy.$Proxy37.fetchResults(Unknown Source)at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:561)at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:786)at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1837)at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1822)at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: org.apache.orc.FileFormatException: Malformed ORC file hdfs://shangzhi165:9000/user/hive/warehouse/sz_center_devdb.db/test_p/ct=2020/test_p_file__4f4b257d_b0ad_446c_a3ea_e5ea2e81580c.gz. Invalid postscript length 0at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:602)at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:509)at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2691)at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:460)... 24 more
Caused by: java.lang.RuntimeException: org.apache.orc.FileFormatException:Malformed ORC file hdfs://shangzhi165:9000/user/hive/warehouse/sz_center_devdb.db/test_p/ct=2020/test_p_file__4f4b257d_b0ad_446c_a3ea_e5ea2e81580c.gz. Invalid postscript length 0at org.apache.orc.impl.ReaderImpl.ensureOrcFooter(ReaderImpl.java:260)at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:570)at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:368)at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:61)at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:96)at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1971)at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:776)at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:344)at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:540)... 29 more
以上就是把ORC 的文件当错text 来处理,导致异常。
我们要在配置datax 任务的时候,一定要仔细核对。
1.3 注意区别以下建表语句
A、构建ORC 格式分区表
CREATE TABLE IF NOT EXISTS `test_p` (`id` int COMMENT 'date in file', `name` string COMMENT 'appname' )
COMMENT 'cleared log of origin log'
PARTITIONED BY (`ct` string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC
TBLPROPERTIES ('creator'='c-chenjc', 'crate_time'='2018-06-07')
;
B. 构建默认文件格式分区表
CREATE TABLE IF NOT EXISTS `test_p_text` (`id` int COMMENT 'date in file', `name` string COMMENT 'appname' )
COMMENT 'cleared log of origin log'
PARTITIONED BY (`ct` string
)
C.构建非分区表
CREATE TABLE IF NOT EXISTS `my_test_p` (`id` int COMMENT 'date in file', `name` string COMMENT 'appname' ,`ct` string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC
TBLPROPERTIES ('creator'='c-chenjc', 'crate_time'='2018-06-07')
;
二、dataX 配置hive 分区表导入 配置
dataX 导入hive 分区表必须指定分区,否则的话,任务执行成功,没有返回异常,但是呢,数据没写入。
示例:
{"job": {"setting": {"speed": {"channel": 1}},"content": [{"reader": {"name": "mysqlreader","parameter": {"column": ["id","name"],"password": "xxxxx$xx","username": "xxxx","where": "","connection": [{"jdbcUrl": ["jdbc:mysql://x x xx:3306/web_magic"],"table": ["mysql_test_p"]}]}},"writer": {"name": "hdfswriter","parameter": {"column": [{"name": "id","type": "int"},{"name": "name","type": "string"}],"compress": "","defaultFS": "hdfs://xxxxx:xxx","fieldDelimiter": ",","fileName": "test_p","fileType": "text","path": "/user/hive/warehouse/testjar.db/test_p/ct=shanghai","writeMode": "append"}}}]}
}
注意:你在导入的时候这个分区还必须存在,否则报异常。
2.1 检查hive 表分区是否存在
show partitions test_p;