====== 常见问题处理 ====== ===== DataFrameMapper中使用OneHotEncoder错误 ===== ==== 示例代码 ==== # ############################################################ # File: 02-preprocess-04-OneHotEncoder # Author: jinlong.hao # Date: 2019-12-04 # OneHotEncoder: 将数据进行离散化处理,形成哑变量 # sklearn-19及以前仅支持integer数据,20以后支持string数据了 # Desc: # 1. import语句 # 2. 构造数据 # 3. 使用DataFrameMapper进行转化 # 4. 使用DataFrameMapper结合OneHotEncoder进行转化 # ############################################################ # 1. import语句 from sklearn.preprocessing import OneHotEncoder, LabelBinarizer from sklearn_pandas import DataFrameMapper import numpy as np import pandas as pd from sklearn2pmml.preprocessing import CutTransformer # 2. 加载示例数据 df = pd.DataFrame({ 'age': [3, 3, 7, 4, 2, 4], 'salary': [36, 39, 17, 82, 42, 10], 'name': ['james', 'jucy', 'jessica', 'tony', 'steve', 'jam'] }) # 3. 配合DataFrameMapper使用 dataFrameMapper = DataFrameMapper([ (['age'], OneHotEncoder(handle_unknown='ignore')), (['age'], LabelBinarizer()), (['name'], OneHotEncoder(handle_unknown='ignore')) ], df_out=True) dataFrameMapper.fit_transform(df) # 4 OneHotEncoder 不能在DataRameMapper中连接使用,会报错 dataFrameMapper2 = DataFrameMapper([ (['age'], [ CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']), OneHotEncoder()] ) #报错,需要用LabelBinarizer() ], df_out=True) dataFrameMapper2.fit_transform(df) ==== 错误提示 ==== /usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array (array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 519 "Reshape your data either using array.reshape(-1, 1) if " 520 "your data has a single feature or array.reshape(1, -1) " --> 521 "if it contains a single sample.".format(array)) 522 523 # in the future np.flexible dtypes will be handled like object dtypes ValueError: ['age']: Expected 2D array, got 1D array instead: array=['1' '1' '3' '2' '1' '2' '2' '3' '1']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. ==== 解决办法 ==== 在DataFrameMapper多重处理中,不使用OneHotEncoder,该用LabelBinarizer,修改后的代码如下: # 4. OneHotEncoder 不能在DataRameMapper中连接使用,改用LabelBinarizer dataFrameMapper2 = DataFrameMapper([ #(['age'], [ # CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']), # OneHotEncoder()] #), #报错,需要用LabelBinarizer() ('age', [ CutTransformer(bins=[0, 3, 5, 200], labels=['1', '2', '3']), LabelBinarizer()] ) ], df_out=True) dataFrameMapper2.fit_transform(df) ===== 使用CutTransformer,在生成pmml时,提示net.razorvine.pickle.PickleException ===== ==== 示例代码 ==== # ######################################### # file: 09-sample-02-basic-01 # author: jinlong.hao # date: 2019-12-06 # desc: sklearn2pmml基础验证代码 # content: # ######################################## # 1. import import sklearn import sklearn.impute import sklearn.ensemble import sklearn.linear_model import sklearn2pmml import sklearn2pmml.preprocessing from sklearn2pmml.preprocessing import ReplaceTransformer import sklearn_pandas import pandas as pd import numpy as np # 2. 加载数据 # 2.1 加载特征数据 train_x = pd.DataFrame({ 'phone_brand': ['Huawei', 'Huawei', 'Apple', 'Apple', '360', '8848', np.NaN], 'phone_price': [2403, 1123, 4823, 2223, np.NaN, 1583, 2222] }) train_x = pd.DataFrame(data=train_x.values, columns=train_x.columns) # 2.2 train_y train_y = pd.DataFrame({ 'result': [0, 1, 0, 1, 0, 0, 1] }) train_y=pd.DataFrame(data=train_y.values, columns=train_y.columns) # 3. 构建DataFrameMapper预处理程序 dataFrameMapper = sklearn_pandas.DataFrameMapper([ (['phone_brand'], [ sklearn.impute.SimpleImputer(strategy='constant', fill_value='others') ,ReplaceTransformer(pattern='^(?!Huawei|Apple).*', replacement='others') ,sklearn.preprocessing.LabelEncoder() ,sklearn.preprocessing.LabelBinarizer() ]), (['phone_price'], [ sklearn.impute.SimpleImputer(strategy='constant', fill_value=0) ,sklearn2pmml.preprocessing.CutTransformer([-1, 1000, 10000]) ,sklearn.preprocessing.LabelEncoder() ,sklearn.preprocessing.LabelBinarizer() ]) ]) dataFrameMapper.fit_transform(train_x) # 4. 模型训练 # 4.1 构建逻辑回归分类器 logistic_classifier = sklearn.linear_model.LogisticRegression() # 4.2 构建逻辑回归的pipeline logistic_pipeline = sklearn2pmml.PMMLPipeline([ ('mapper', dataFrameMapper), ('classifier', logistic_classifier) ]) # 4.3 执行训练 logistic_pipeline.fit(train_x, train_y) # 5. 模型输出pmml sklearn2pmml.sklearn2pmml(logistic_pipeline, '02-basic-01.pmml') ==== 错误提示 ==== Standard output is empty Standard error: 十二月 09, 2019 5:32:59 下午 org.jpmml.sklearn.Main run 信息: Parsing PKL.. 十二月 09, 2019 5:32:59 下午 org.jpmml.sklearn.Main run 严重: Failed to parse PKL net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pandas._libs.interval.Interval) at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773) at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:213) at net.razorvine.pickle.Unpickler.load(Unpickler.java:123) at numpy.core.NDArrayUtil.readObject(NDArrayUtil.java:378) at numpy.core.TypeDescriptor.read(TypeDescriptor.java:163) at numpy.core.NDArrayUtil.parseArray(NDArrayUtil.java:214) at numpy.core.NDArrayUtil.parseData(NDArrayUtil.java:189) at joblib.NumpyArrayWrapper.toArray(NumpyArrayWrapper.java:43) at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:88) at net.razorvine.pickle.Unpickler.load(Unpickler.java:123) at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:98) at org.jpmml.sklearn.Main.run(Main.java:104) at org.jpmml.sklearn.Main.main(Main.java:94) ==== 解决方案 ==== 在使用CutTransformer时,指定labels参数,修改如下: # 3. 构建DataFrameMapper预处理程序 dataFrameMapper = sklearn_pandas.DataFrameMapper([ (['phone_brand'], [ sklearn.impute.SimpleImputer(strategy='constant', fill_value='others') ,ReplaceTransformer(pattern='^(?!Huawei|Apple).*', replacement='others') ,sklearn.preprocessing.LabelEncoder() ,sklearn.preprocessing.LabelBinarizer() ]), (['phone_price'], [ sklearn.impute.SimpleImputer(strategy='constant', fill_value=0) ,sklearn2pmml.preprocessing.CutTransformer(bins=[-1, 1000, 10000], labels=['1', '2']) # ,sklearn.preprocessing.LabelEncoder() # 指定labels后不需要在使用LabelEncoder ,sklearn.preprocessing.LabelBinarizer() ]) ]) dataFrameMapper.fit_transform(train_x) ===== 模型发布时提示InvalidResultException错误 ===== ==== 问题现象 ==== 模型发布时,界面提示以下错误: Field "XXXXXX" cannot accept user input value NaN。 系统后台报错如下: Servlet.service() for servlet [dispatcherServlet] in context with path [/register] threw exception [Request processing failed; nested exception is org.jpmml.evaluator.InvalidResultException: Field "近一个月访问url个数" cannot accept user input value NaN] with root cause org.jpmml.evaluator.InvalidResultException: Field "近一个月访问url个数" cannot accept user input value NaN at org.jpmml.evaluator.InputFieldUtil.performInvalidValueTreatment(InputFieldUtil.java:221) ~[pmml-evaluator-1.4.11.jar!/:na] at org.jpmml.evaluator.InputFieldUtil.prepareScalarInputValue(InputFieldUtil.java:137) ~[pmml-evaluator-1.4.11.jar!/:na] at org.jpmml.evaluator.InputFieldUtil.prepareInputValue(InputFieldUtil.java:111) ~[pmml-evaluator-1.4.11.jar!/:na] at org.jpmml.evaluator.InputField.prepare(InputField.java:70) ~[pmml-evaluator-1.4.11.jar!/:na] at org.jpmml.evaluator.ModelEvaluator.verify(ModelEvaluator.java:475) ~[pmml-evaluator-1.4.11.jar!/:na] at org.jpmml.evaluator.ModelEvaluator.verify(ModelEvaluator.java:78) ~[pmml-evaluator-1.4.11.jar!/:na] at cn.eppdev.mlib.basic.util.EppdevMlibPmmlUtils.getEvaluatorByContent(EppdevMlibPmmlUtils.java:105) ~[eppdev-mlib-basic-utils-1.0.0.jar!/:1.0.0] at cn.eppdev.mlib.basic.util.EppdevMlibPmmlUtils.valid(EppdevMlibPmmlUtils.java:42) ~[eppdev-mlib-basic-utils-1.0.0.jar!/:1.0.0] at cn.eppdev.mlib.register.admin.service.AdminModelService.insert(AdminModelService.java:88) ~[classes!/:1.0.0] at cn.eppdev.mlib.register.admin.service.AdminModelService$$FastClassBySpringCGLIB$$4dcc733.invoke() ~[classes!/:1.0.0] at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:749) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:294) ~[spring-tx-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:98) ~[spring-tx-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688) ~[spring-aop-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at cn.eppdev.mlib.register.admin.service.AdminModelService$$EnhancerBySpringCGLIB$$c913079a.insert() ~[classes!/:1.0.0] at cn.eppdev.mlib.register.admin.web.AdminModelController.doAddModel(AdminModelController.java:126) ~[classes!/:1.0.0] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_232-ea] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_232-ea] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_232-ea] at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_232-ea] at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:189) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:102) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:800) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1038) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1005) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:908) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at javax.servlet.http.HttpServlet.service(HttpServlet.java:660) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:882) ~[spring-webmvc-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at javax.servlet.http.HttpServlet.service(HttpServlet.java:741) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) ~[tomcat-embed-websocket-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:92) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:93) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) ~[spring-web-5.1.4.RELEASE.jar!/:5.1.4.RELEASE] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) ~[tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:834) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1417) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_232-ea] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_232-ea] at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-embed-core-9.0.14.jar!/:9.0.14] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232-ea] ==== 原因分析 ==== 此为jpmml的Bug,主要是因为在模型pmml文件保存时,pmml中自动保存的验证数据, 将空值直接为NaN,而在pmml中该字段被认为时int字段,导致pmml发布过程中的自检失败。 验证数据示例如下: 懂车帝DCC订单 20-25 欧珀 NaN NaN 0.12311307178035973 0.8768869282196402 懂车帝DCC订单 20-25 苹果 0 0 0.11877290234764215 0.8812270976523577 ==== 解决方案 ==== 解决方案有三个: - 手工将验证数据中含有NaN的数据删除,可以值删除该字段数据,也可以将整条数据全部删除 - 调整pmml输出,pmml中不再包含验证数据 - 在建模过程中,首先对该数据进行处理(使用dataframe.fillna方法),但是在数据预处理过程中仍然保留缺失值填充的方法,避免实际计算时出错