====== 常见问题处理 ======
===== DataFrameMapper中使用OneHotEncoder错误 =====
==== 示例代码 ====
# ############################################################
# File: 02-preprocess-04-OneHotEncoder
# Author: jinlong.hao
# Date: 2019-12-04
# OneHotEncoder: 将数据进行离散化处理,形成哑变量
# sklearn-19及以前仅支持integer数据,20以后支持string数据了
# Desc:
# 1. import语句
# 2. 构造数据
# 3. 使用DataFrameMapper进行转化
# 4. 使用DataFrameMapper结合OneHotEncoder进行转化
# ############################################################
# 1. import语句
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer
from sklearn_pandas import DataFrameMapper
import numpy as np
import pandas as pd
from sklearn2pmml.preprocessing import CutTransformer
# 2. 加载示例数据
df = pd.DataFrame({
'age': [3, 3, 7, 4, 2, 4],
'salary': [36, 39, 17, 82, 42, 10],
'name': ['james', 'jucy', 'jessica', 'tony', 'steve', 'jam']
})
# 3. 配合DataFrameMapper使用
dataFrameMapper = DataFrameMapper([
(['age'], OneHotEncoder(handle_unknown='ignore')),
(['age'], LabelBinarizer()),
(['name'], OneHotEncoder(handle_unknown='ignore'))
], df_out=True)
dataFrameMapper.fit_transform(df)
# 4 OneHotEncoder 不能在DataRameMapper中连接使用,会报错
dataFrameMapper2 = DataFrameMapper([
(['age'], [
CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']),
OneHotEncoder()]
) #报错,需要用LabelBinarizer()
], df_out=True)
dataFrameMapper2.fit_transform(df)
==== 错误提示 ====
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array
(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite,
ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
519 "Reshape your data either using array.reshape(-1, 1) if "
520 "your data has a single feature or array.reshape(1, -1) "
--> 521 "if it contains a single sample.".format(array))
522
523 # in the future np.flexible dtypes will be handled like object dtypes
ValueError: ['age']: Expected 2D array, got 1D array instead:
array=['1' '1' '3' '2' '1' '2' '2' '3' '1'].
Reshape your data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
==== 解决办法 ====
在DataFrameMapper多重处理中,不使用OneHotEncoder,该用LabelBinarizer,修改后的代码如下:
# 4. OneHotEncoder 不能在DataRameMapper中连接使用,改用LabelBinarizer
dataFrameMapper2 = DataFrameMapper([
#(['age'], [
# CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']),
# OneHotEncoder()]
#), #报错,需要用LabelBinarizer()
('age', [
CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']),
LabelBinarizer()]
)
], df_out=True)
dataFrameMapper2.fit_transform(df)
===== 使用CutTransformer,在生成pmml时,提示net.razorvine.pickle.PickleException =====
==== 示例代码 ====
# #########################################
# file: 09-sample-02-basic-01
# author: jinlong.hao
# date: 2019-12-06
# desc: sklearn2pmml基础验证代码
# content:
# ########################################
# 1. import
import sklearn
import sklearn.impute
import sklearn.ensemble
import sklearn.linear_model
import sklearn2pmml
import sklearn2pmml.preprocessing
from sklearn2pmml.preprocessing import ReplaceTransformer
import sklearn_pandas
import pandas as pd
import numpy as np
# 2. 加载数据
# 2.1 加载特征数据
train_x = pd.DataFrame({
'phone_brand': ['Huawei', 'Huawei', 'Apple', 'Apple', '360', '8848', np.NaN],
'phone_price': [2403, 1123, 4823, 2223, np.NaN, 1583, 2222]
})
train_x = pd.DataFrame(data=train_x.values, columns=train_x.columns)
# 2.2 train_y
train_y = pd.DataFrame({
'result': [0, 1, 0, 1, 0, 0, 1]
})
train_y=pd.DataFrame(data=train_y.values, columns=train_y.columns)
# 3. 构建DataFrameMapper预处理程序
dataFrameMapper = sklearn_pandas.DataFrameMapper([
(['phone_brand'], [
sklearn.impute.SimpleImputer(strategy='constant', fill_value='others')
,ReplaceTransformer(pattern='^(?!Huawei|Apple).*', replacement='others')
,sklearn.preprocessing.LabelEncoder()
,sklearn.preprocessing.LabelBinarizer()
]),
(['phone_price'], [
sklearn.impute.SimpleImputer(strategy='constant', fill_value=0)
,sklearn2pmml.preprocessing.CutTransformer([-1, 1000, 10000])
,sklearn.preprocessing.LabelEncoder()
,sklearn.preprocessing.LabelBinarizer()
])
])
dataFrameMapper.fit_transform(train_x)
# 4. 模型训练
# 4.1 构建逻辑回归分类器
logistic_classifier = sklearn.linear_model.LogisticRegression()
# 4.2 构建逻辑回归的pipeline
logistic_pipeline = sklearn2pmml.PMMLPipeline([
('mapper', dataFrameMapper),
('classifier', logistic_classifier)
])
# 4.3 执行训练
logistic_pipeline.fit(train_x, train_y)
# 5. 模型输出pmml
sklearn2pmml.sklearn2pmml(logistic_pipeline, '02-basic-01.pmml')
==== 错误提示 ====
Standard output is empty
Standard error:
十二月 09, 2019 5:32:59 下午 org.jpmml.sklearn.Main run
信息: Parsing PKL..
十二月 09, 2019 5:32:59 下午 org.jpmml.sklearn.Main run
严重: Failed to parse PKL
net.razorvine.pickle.PickleException: expected zero arguments for construction
of ClassDict (for pandas._libs.interval.Interval)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:213)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
at numpy.core.NDArrayUtil.readObject(NDArrayUtil.java:378)
at numpy.core.TypeDescriptor.read(TypeDescriptor.java:163)
at numpy.core.NDArrayUtil.parseArray(NDArrayUtil.java:214)
at numpy.core.NDArrayUtil.parseData(NDArrayUtil.java:189)
at joblib.NumpyArrayWrapper.toArray(NumpyArrayWrapper.java:43)
at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:88)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
at org.jpmml.sklearn.Main.run(Main.java:104)
at org.jpmml.sklearn.Main.main(Main.java:94)
==== 解决方案 ====
在使用CutTransformer时,指定labels参数,修改如下:
# 3. 构建DataFrameMapper预处理程序
dataFrameMapper = sklearn_pandas.DataFrameMapper([
(['phone_brand'], [
sklearn.impute.SimpleImputer(strategy='constant', fill_value='others')
,ReplaceTransformer(pattern='^(?!Huawei|Apple).*', replacement='others')
,sklearn.preprocessing.LabelEncoder()
,sklearn.preprocessing.LabelBinarizer()
]),
(['phone_price'], [
sklearn.impute.SimpleImputer(strategy='constant', fill_value=0)
,sklearn2pmml.preprocessing.CutTransformer(bins=[-1, 1000, 10000],
labels=['1', '2'])
# ,sklearn.preprocessing.LabelEncoder() # 指定labels后不需要在使用LabelEncoder
,sklearn.preprocessing.LabelBinarizer()
])
])
dataFrameMapper.fit_transform(train_x)
===== 模型发布时提示InvalidResultException错误 =====
==== 问题现象 ====
模型发布时,界面提示以下错误: Field "XXXXXX" cannot accept user input value NaN。
系统后台报错如下:
Servlet.service() for servlet [dispatcherServlet] in context with path [/register] threw exception [Request processing failed; nested exception is org.jpmml.evaluator.InvalidResultException: Field "近一个月访问url个数" cannot accept user input value NaN] with root cause
org.jpmml.evaluator.InvalidResultException: Field "近一个月访问url个数" cannot accept user input value NaN
at org.jpmml.evaluator.InputFieldUtil.performInvalidValueTreatment(InputFieldUtil.java:221) ~[pmml-evaluator-1.4.11.jar!/:na]
at org.jpmml.evaluator.InputFieldUtil.prepareScalarInputValue(InputFieldUtil.java:137) ~[pmml-evaluator-1.4.11.jar!/:na]
at org.jpmml.evaluator.InputFieldUtil.prepareInputValue(InputFieldUtil.java:111) ~[pmml-evaluator-1.4.11.jar!/:na]
at org.jpmml.evaluator.InputField.prepare(InputField.java:70) ~[pmml-evaluator-1.4.11.jar!/:na]
at org.jpmml.evaluator.ModelEvaluator.verify(ModelEvaluator.java:475) ~[pmml-evaluator-1.4.11.jar!/:na]
at org.jpmml.evaluator.ModelEvaluator.verify(ModelEvaluator.java:78) ~[pmml-evaluator-1.4.11.jar!/:na]
at cn.eppdev.mlib.basic.util.EppdevMlibPmmlUtils.getEvaluatorByContent(EppdevMlibPmmlUtils.java:105) ~[eppdev-mlib-basic-utils-1.0.0.jar!/:1.0.0]
at cn.eppdev.mlib.basic.util.EppdevMlibPmmlUtils.valid(EppdevMlibPmmlUtils.java:42) ~[eppdev-mlib-basic-utils-1.0.0.jar!/:1.0.0]
at cn.eppdev.mlib.register.admin.service.AdminModelService.insert(AdminModelService.java:88) ~[classes!/:1.0.0]
at cn.eppdev.mlib.register.admin.service.AdminModelService$$FastClassBySpringCGLIB$$4dcc733.invoke() ~[classes!/:1.0.0]
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:749) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:294) ~[spring-tx-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:98) ~[spring-tx-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at cn.eppdev.mlib.register.admin.service.AdminModelService$$EnhancerBySpringCGLIB$$c913079a.insert() ~[classes!/:1.0.0]
at cn.eppdev.mlib.register.admin.web.AdminModelController.doAddModel(AdminModelController.java:126) ~[classes!/:1.0.0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_232-ea]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_232-ea]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_232-ea]
at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_232-ea]
at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:189) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:102) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:800) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1038) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1005) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:908) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:660) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:882) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:741) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) ~[tomcat-embed-websocket-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:92) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:93) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1417) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232-ea]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232-ea]
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232-ea]
==== 原因分析 ====
此为jpmml的Bug,主要是因为在模型pmml文件保存时,pmml中自动保存的验证数据,
将空值直接为NaN,而在pmml中该字段被认为时int字段,导致pmml发布过程中的自检失败。
验证数据示例如下:
懂车帝DCC订单
女
20-25
中
欧珀
NaN
NaN
0.12311307178035973
0.8768869282196402
懂车帝DCC订单
男
20-25
中
苹果
0
0
0.11877290234764215
0.8812270976523577
==== 解决方案 ====
解决方案有三个:
- 手工将验证数据中含有NaN的数据删除,可以值删除该字段数据,也可以将整条数据全部删除
- 调整pmml输出,pmml中不再包含验证数据
- 在建模过程中,首先对该数据进行处理(使用dataframe.fillna方法),但是在数据预处理过程中仍然保留缺失值填充的方法,避免实际计算时出错