用户工具

站点工具


model:errors

常见问题处理

DataFrameMapper中使用OneHotEncoder错误

示例代码

# ############################################################
# File: 02-preprocess-04-OneHotEncoder
# Author: jinlong.hao
# Date: 2019-12-04
# OneHotEncoder: 将数据进行离散化处理,形成哑变量
#     sklearn-19及以前仅支持integer数据,20以后支持string数据了
# Desc: 
#    1. import语句
#    2. 构造数据
#    3. 使用DataFrameMapper进行转化
#    4. 使用DataFrameMapper结合OneHotEncoder进行转化
# ############################################################
 
# 1. import语句
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer
from sklearn_pandas import DataFrameMapper
import numpy as np
import pandas as pd
from sklearn2pmml.preprocessing import CutTransformer
 
# 2. 加载示例数据
df = pd.DataFrame({
    'age': [3, 3, 7, 4, 2, 4],
    'salary': [36, 39, 17, 82, 42, 10],
    'name': ['james', 'jucy', 'jessica', 'tony', 'steve', 'jam']
})
 
# 3. 配合DataFrameMapper使用
dataFrameMapper = DataFrameMapper([
    (['age'], OneHotEncoder(handle_unknown='ignore')),
    (['age'], LabelBinarizer()),
    (['name'], OneHotEncoder(handle_unknown='ignore'))
], df_out=True)
dataFrameMapper.fit_transform(df)
 
# 4 OneHotEncoder 不能在DataRameMapper中连接使用,会报错
dataFrameMapper2 = DataFrameMapper([
    (['age'], [
        CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']), 
        OneHotEncoder()]
    )  #报错,需要用LabelBinarizer()
], df_out=True)
 
dataFrameMapper2.fit_transform(df)

错误提示

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array
(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, 
ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    519  "Reshape your data either using array.reshape(-1, 1) if "
    520  "your data has a single feature or array.reshape(1, -1) "
--> 521  "if it contains a single sample.".format(array))
    522 
    523  # in the future np.flexible dtypes will be handled like object dtypes

ValueError: ['age']: Expected 2D array, got 1D array instead:
array=['1' '1' '3' '2' '1' '2' '2' '3' '1'].
Reshape your data either using array.reshape(-1, 1) if your data has a single 
feature or array.reshape(1, -1) if it contains a single sample.

解决办法

在DataFrameMapper多重处理中,不使用OneHotEncoder,该用LabelBinarizer,修改后的代码如下:

# 4. OneHotEncoder 不能在DataRameMapper中连接使用,改用LabelBinarizer
dataFrameMapper2 = DataFrameMapper([
    #(['age'], [
    #    CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']), 
    #    OneHotEncoder()]
    #),  #报错,需要用LabelBinarizer()
    ('age', [
        CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']), 
        LabelBinarizer()]
    ) 
], df_out=True)
 
dataFrameMapper2.fit_transform(df)

使用CutTransformer,在生成pmml时,提示net.razorvine.pickle.PickleException

示例代码

# #########################################
# file: 09-sample-02-basic-01
# author: jinlong.hao
# date: 2019-12-06
# desc: sklearn2pmml基础验证代码
# content: 
# ########################################
 
# 1. import
import sklearn
import sklearn.impute
import sklearn.ensemble
import sklearn.linear_model
import sklearn2pmml
import sklearn2pmml.preprocessing
from sklearn2pmml.preprocessing import ReplaceTransformer
import sklearn_pandas
import pandas as pd
import numpy as np
 
# 2. 加载数据
# 2.1 加载特征数据
train_x = pd.DataFrame({
    'phone_brand': ['Huawei', 'Huawei', 'Apple', 'Apple', '360', '8848', np.NaN],
    'phone_price': [2403, 1123,  4823, 2223, np.NaN, 1583, 2222]
})
train_x = pd.DataFrame(data=train_x.values, columns=train_x.columns)
 
# 2.2 train_y
train_y = pd.DataFrame({
    'result': [0, 1, 0, 1, 0, 0, 1]
})
train_y=pd.DataFrame(data=train_y.values, columns=train_y.columns)
 
# 3. 构建DataFrameMapper预处理程序
dataFrameMapper = sklearn_pandas.DataFrameMapper([
    (['phone_brand'], [
        sklearn.impute.SimpleImputer(strategy='constant', fill_value='others')
        ,ReplaceTransformer(pattern='^(?!Huawei|Apple).*', replacement='others')
        ,sklearn.preprocessing.LabelEncoder()
        ,sklearn.preprocessing.LabelBinarizer()
    ]),
    (['phone_price'], [
        sklearn.impute.SimpleImputer(strategy='constant', fill_value=0)
        ,sklearn2pmml.preprocessing.CutTransformer([-1, 1000, 10000])
        ,sklearn.preprocessing.LabelEncoder()
        ,sklearn.preprocessing.LabelBinarizer()
    ])
])
dataFrameMapper.fit_transform(train_x)
 
# 4. 模型训练
# 4.1 构建逻辑回归分类器
logistic_classifier = sklearn.linear_model.LogisticRegression()
 
# 4.2 构建逻辑回归的pipeline
logistic_pipeline = sklearn2pmml.PMMLPipeline([
    ('mapper', dataFrameMapper),
    ('classifier', logistic_classifier)
]) 
# 4.3 执行训练
logistic_pipeline.fit(train_x, train_y)
 
# 5. 模型输出pmml
sklearn2pmml.sklearn2pmml(logistic_pipeline, '02-basic-01.pmml')

错误提示

Standard output is empty
Standard error:
十二月 09, 2019 5:32:59 下午 org.jpmml.sklearn.Main run
信息: Parsing PKL..
十二月 09, 2019 5:32:59 下午 org.jpmml.sklearn.Main run
严重: Failed to parse PKL
net.razorvine.pickle.PickleException: expected zero arguments for construction 
of ClassDict (for pandas._libs.interval.Interval)
	at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
	at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773)
	at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:213)
	at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
	at numpy.core.NDArrayUtil.readObject(NDArrayUtil.java:378)
	at numpy.core.TypeDescriptor.read(TypeDescriptor.java:163)
	at numpy.core.NDArrayUtil.parseArray(NDArrayUtil.java:214)
	at numpy.core.NDArrayUtil.parseData(NDArrayUtil.java:189)
	at joblib.NumpyArrayWrapper.toArray(NumpyArrayWrapper.java:43)
	at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:88)
	at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
	at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98)
	at org.jpmml.sklearn.Main.run(Main.java:104)
	at org.jpmml.sklearn.Main.main(Main.java:94)

解决方案

在使用CutTransformer时,指定labels参数,修改如下:

# 3. 构建DataFrameMapper预处理程序
dataFrameMapper = sklearn_pandas.DataFrameMapper([
    (['phone_brand'], [
        sklearn.impute.SimpleImputer(strategy='constant', fill_value='others')
        ,ReplaceTransformer(pattern='^(?!Huawei|Apple).*', replacement='others')
        ,sklearn.preprocessing.LabelEncoder()
        ,sklearn.preprocessing.LabelBinarizer()
    ]),
    (['phone_price'], [
        sklearn.impute.SimpleImputer(strategy='constant', fill_value=0)
        ,sklearn2pmml.preprocessing.CutTransformer(bins=[-1, 1000, 10000],
                                                   labels=['1', '2'])
        # ,sklearn.preprocessing.LabelEncoder()  # 指定labels后不需要在使用LabelEncoder
        ,sklearn.preprocessing.LabelBinarizer()
    ])
])
dataFrameMapper.fit_transform(train_x)

模型发布时提示InvalidResultException错误

问题现象

模型发布时,界面提示以下错误: Field “XXXXXX” cannot accept user input value NaN。 系统后台报错如下:

Servlet.service() for servlet [dispatcherServlet] in context with path [/register] threw exception [Request processing failed; nested exception is org.jpmml.evaluator.InvalidResultException: Field "近一个月访问url个数" cannot accept user input value NaN] with root cause

org.jpmml.evaluator.InvalidResultException: Field "近一个月访问url个数" cannot accept user input value NaN
	at org.jpmml.evaluator.InputFieldUtil.performInvalidValueTreatment(InputFieldUtil.java:221) ~[pmml-evaluator-1.4.11.jar!/:na]
	at org.jpmml.evaluator.InputFieldUtil.prepareScalarInputValue(InputFieldUtil.java:137) ~[pmml-evaluator-1.4.11.jar!/:na]
	at org.jpmml.evaluator.InputFieldUtil.prepareInputValue(InputFieldUtil.java:111) ~[pmml-evaluator-1.4.11.jar!/:na]
	at org.jpmml.evaluator.InputField.prepare(InputField.java:70) ~[pmml-evaluator-1.4.11.jar!/:na]
	at org.jpmml.evaluator.ModelEvaluator.verify(ModelEvaluator.java:475) ~[pmml-evaluator-1.4.11.jar!/:na]
	at org.jpmml.evaluator.ModelEvaluator.verify(ModelEvaluator.java:78) ~[pmml-evaluator-1.4.11.jar!/:na]
	at cn.eppdev.mlib.basic.util.EppdevMlibPmmlUtils.getEvaluatorByContent(EppdevMlibPmmlUtils.java:105) ~[eppdev-mlib-basic-utils-1.0.0.jar!/:1.0.0]
	at cn.eppdev.mlib.basic.util.EppdevMlibPmmlUtils.valid(EppdevMlibPmmlUtils.java:42) ~[eppdev-mlib-basic-utils-1.0.0.jar!/:1.0.0]
	at cn.eppdev.mlib.register.admin.service.AdminModelService.insert(AdminModelService.java:88) ~[classes!/:1.0.0]
	at cn.eppdev.mlib.register.admin.service.AdminModelService$$FastClassBySpringCGLIB$$4dcc733.invoke(<generated>) ~[classes!/:1.0.0]
	at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:749) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:294) ~[spring-tx-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:98) ~[spring-tx-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at cn.eppdev.mlib.register.admin.service.AdminModelService$$EnhancerBySpringCGLIB$$c913079a.insert(<generated>) ~[classes!/:1.0.0]
	at cn.eppdev.mlib.register.admin.web.AdminModelController.doAddModel(AdminModelController.java:126) ~[classes!/:1.0.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_232-ea]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_232-ea]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_232-ea]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_232-ea]
	at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:189) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:102) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:800) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1038) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1005) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:908) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:660) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:882) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:741) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) ~[tomcat-embed-websocket-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:92) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:93) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE]
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1417) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232-ea]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232-ea]
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-embed-core-9.0.14.jar!/:9.0.14]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232-ea]

原因分析

此为jpmml的Bug,主要是因为在模型pmml文件保存时,pmml中自动保存的验证数据, 将空值直接为NaN,而在pmml中该字段被认为时int字段,导致pmml发布过程中的自检失败。 验证数据示例如下:

	<row>
		<data:来源四级>懂车帝DCC订单</data:来源四级>
		<data:性别></data:性别>
		<data:年龄>20-25</data:年龄>
		<data:消费能力></data:消费能力>
		<data:终端品牌>欧珀</data:终端品牌>
		<data:近一周访问url天数>NaN</data:近一周访问url天数>
		<data:近一个月访问url个数>NaN</data:近一个月访问url个数>
		<data:probability_false>0.12311307178035973</data:probability_false>
		<data:probability_true>0.8768869282196402</data:probability_true>
	</row>
	<row>
		<data:来源四级>懂车帝DCC订单</data:来源四级>
		<data:性别></data:性别>
		<data:年龄>20-25</data:年龄>
		<data:消费能力></data:消费能力>
		<data:终端品牌>苹果</data:终端品牌>
		<data:近一周访问url天数>0</data:近一周访问url天数>
		<data:近一个月访问url个数>0</data:近一个月访问url个数>
		<data:probability_false>0.11877290234764215</data:probability_false>
		<data:probability_true>0.8812270976523577</data:probability_true>
	</row>

解决方案

解决方案有三个:

  1. 手工将验证数据中含有NaN的数据删除,可以值删除该字段数据,也可以将整条数据全部删除
  2. 调整pmml输出,pmml中不再包含验证数据
  3. 在建模过程中,首先对该数据进行处理(使用dataframe.fillna方法),但是在数据预处理过程中仍然保留缺失值填充的方法,避免实际计算时出错
model/errors.txt · 最后更改: 2020/07/12 12:07 (外部编辑)